Project Velocity - Synthetic Client Graph Dataset
Generated: 2026-04-18
Dataset Version: 1.0.0
Target: 250 full synthetic client graphs
Owner: Sagnik
Alignment: Founder CRM and Platform Delivery Pack (Doc 16)
Overview
This dataset contains 250 fully synthetic client graphs aligned to the Project Velocity canonical domain model. It is designed for:
- CRM module validation and testing
- Import pipeline replay testing
- Client 360 aggregation validation
- Oracle intelligence and writeback testing
- QD score and timeseries validation
- Communication capture and transcript processing
- Workflow and approval governance testing
The data simulates premium real-estate sales behavior in the Kolkata market across 14 projects.
Geography and Inventory
Market: Kolkata and surrounding micro-markets
Projects: 14 premium residential projects
| Project ID | Project Name | Developer | Micro-Market |
|---|---|---|---|
| PRJ-001 | Eden Devprayag | Eden Group | Rajarhat |
| PRJ-002 | Sugam Prakriti | Sugam Homes | Barasat |
| PRJ-003 | Atri Aqua | Atri Developers | New Town |
| PRJ-004 | Atri Surya Toron | Atri Developers | Rajarhat |
| PRJ-005 | Siddha Suburbia Bungalow | Siddha Group | Madanpur |
| PRJ-006 | Merlin Avana | Merlin Group | Tangra |
| PRJ-007 | DTC Good Earth | DTC Projects | New Town |
| PRJ-008 | Siddha Serena | Siddha Group | New Town |
| PRJ-009 | Siddha Sky Waterfront | Siddha Group | Beliaghata |
| PRJ-010 | Godrej Blue | Godrej Properties | New Town |
| PRJ-011 | DTC Sojon | DTC Projects | Rajarhat |
| PRJ-012 | Shriram Grand City | Shriram Properties | Howrah |
| PRJ-013 | Godrej Elevate | Godrej Properties | Dum Dum |
| PRJ-014 | Ambuja Utpaala | Ambuja Neotia | Tollygunge |
Dataset Composition
Primary Entities
| Entity | Count | Description |
|---|---|---|
| Primary Clients (People) | 250 | Main decision-makers and buyers |
| Co-buyers/Family | 91 | Secondary contacts linked to households |
| Accounts (Organizations) | 153 | Employers, businesses, referral partners |
| Households | 118 | Family decision units |
| Relationships | 91 | Spouse, parent, sibling, business partner links |
| Leads | 250 | Funnel-stage qualification records |
| Opportunities | 400 | Deal pipeline objects (1-3 per client) |
| Property Interests | 400 | Project/unit preference records |
| Stage History | 1,373 | Lead stage transition audit trail |
Interaction Graph
| Artifact | Count | Description |
|---|---|---|
| Interactions | 1,897 | Umbrella communication events |
| WhatsApp Messages | 3,367 | Text messages with realistic dialogue |
| WhatsApp Threads | 606 | Conversation thread summaries |
| Phone Calls | 478 | Call records with duration and direction |
| Transcripts | 231 | Speaker-segmented call transcripts |
| Emails | 149 | Business correspondence with subjects and bodies |
| Site Visits | 305 | Physical site visit records with notes |
| Reminders/Tasks | 759 | Follow-up items and action reminders |
Intelligence & Enrichment
| Artifact | Count | Description |
|---|---|---|
| QD Scores | 250 | Latest qualification/disposition scores |
| QD Timeseries | 1,953 | Historical score propagation (4-12 pts/client) |
| Vehicle Events | 80 | Number-plate detection events |
| Perception Events | 60 | Behavioral/dwell-time intelligence |
| CCTV Links | 120 | Video clip references linked to visits |
Workflow & Governance
| Artifact | Count | Description |
|---|---|---|
| Workflow Actions | 100 | Import reviews, merge proposals, writebacks |
| Approvals | 49 | Human review decisions |
| Writebacks | 28 | Approved canonical mutations |
Inventory
| Artifact | Count | Description |
|---|---|---|
| Projects | 14 | Master project records |
| Units | 209 | Individual unit inventory (8-20 per project) |
File Structure
synthetic_client_graphs/
├── csv/
│ ├── inventory_projects.csv
│ ├── inventory_units.csv
│ ├── crm_people.csv
│ ├── crm_accounts.csv
│ ├── crm_households.csv
│ ├── crm_relationships.csv
│ ├── crm_leads.csv
│ ├── crm_opportunities.csv
│ ├── crm_property_interests.csv
│ ├── crm_stage_history.csv
│ ├── intel_interactions.csv
│ ├── intel_messages.csv
│ ├── intel_calls.csv
│ ├── intel_transcripts.csv
│ ├── intel_emails.csv
│ ├── intel_whatsapp_threads.csv
│ ├── intel_visits.csv
│ ├── intel_reminders.csv
│ ├── intel_qd_scores.csv
│ ├── intel_qd_timeseries.csv
│ ├── intel_vehicle_events.csv
│ ├── intel_perception_events.csv
│ ├── intel_cctv_links.csv
│ ├── workflow_actions.csv
│ ├── workflow_approvals.csv
│ └── workflow_writebacks.csv
├── json/
│ ├── client_360_snapshots_batch_1.json (Clients 1-50)
│ ├── client_360_snapshots_batch_2.json (Clients 51-100)
│ ├── client_360_snapshots_batch_3.json (Clients 101-150)
│ ├── client_360_snapshots_batch_4.json (Clients 151-200)
│ ├── client_360_snapshots_batch_5.json (Clients 201-250)
│ ├── import_mapping_manifest_example.json
│ ├── relationship_graph_map.json
│ └── transcript_sidecars.json
└── README.md
Buyer Persona Distribution
The 250 primary clients are distributed across realistic premium real-estate buyer personas:
| Persona | Percentage | Count | Characteristics |
|---|---|---|---|
| High-Intent Buyer | 20% | ~50 | Quick decision cycle, clear requirements, responsive |
| Slow-Burn Investor | 18% | ~45 | Long horizon, price-sensitive, comparison-heavy |
| NRI Buyer | 12% | ~30 | Remote decision-making, video calls, family proxies |
| Family Decision Unit | 20% | ~50 | Multiple stakeholders, consensus-driven, Vastu-conscious |
| Price-Sensitive Aspirational | 15% | ~37 | Stretch budget, EMI-focused, festival-offer hunters |
| Broker/Referral Chain | 8% | ~20 | Multiple client representations, commission-focused |
| Repeat Visitor | 7% | ~18 | High engagement, multiple visits, decision paralysis |
Canonical Domain Alignment
This dataset maps to the planned Velocity canonical domains:
crm_* Domain
crm_people: Contact identity and demographicscrm_accounts: Organization and employer recordscrm_households: Family and co-buyer structurescrm_relationships: Person-to-person linkagescrm_leads: Funnel stage and qualificationcrm_opportunities: Deal pipeline and valuationcrm_property_interests: Project/unit preferencescrm_stage_history: Audit trail of stage transitions
intel_* Domain
intel_interactions: Unified communication eventsintel_messages: Text-level message recordsintel_calls: Call metadata and durationintel_transcripts: Speaker-segmented conversation textintel_emails: Email correspondenceintel_whatsapp_threads: Thread-level summariesintel_visits: Site visit records and notesintel_reminders: Task and follow-up trackingintel_qd_scores: Qualification/disposition scoresintel_qd_timeseries: Temporal score evolutionintel_vehicle_events: Parking/entry detectionintel_perception_events: Behavioral intelligenceintel_cctv_links: Video evidence references
inventory_* Domain
inventory_projects: Master project cataloginventory_units: Unit-level availability and pricing
workflow_* Domain
workflow_actions: Proposed AI/human actionsworkflow_approvals: Review decisionsworkflow_writebacks: Committed mutations
Quality Assurance
Referential Integrity
All foreign key relationships have been validated:
- ✅ All
lead.person_idvalues exist incrm_people - ✅ All
opportunity.lead_idvalues exist incrm_leads - ✅ All
interaction.person_idvalues exist incrm_people - ✅ All
visit.person_idvalues exist incrm_people - ✅ All
qd_score.person_idvalues exist incrm_people - ✅ No orphaned stage history records
- ✅ All
opportunity.project_idvalues exist ininventory_projects - ✅ All
property_interest.project_idvalues exist ininventory_projects
Temporal Consistency
- ✅ Lead creation dates precede interaction dates
- ✅ Stage history transitions are monotonic in time
- ✅ QD timeseries points are chronologically ordered
- ✅ Visit dates align with lead stage progression
- ✅ Reminder due dates follow interaction dates
Realism Rules Applied
- Names: Realistic Indian names (Bengali, Hindi, mixed demographics)
- Organizations: Major Indian IT, banking, manufacturing, and consulting firms
- Communication: Premium property sales tone, not generic retail
- Stage Transitions: Narratively coherent (enquiry → visit → negotiation → booking)
- Sales Cadence: Realistic follow-up intervals (3-15 days between touches)
- Dialogue: Context-aware transcripts referencing specific projects, prices, and objections
- Budgets: Aligned to Kolkata premium market (1.5 Cr - 25 Cr range)
Usage Instructions
CSV-First Import Testing
- Start with
crm_people.csvas the identity anchor - Join
crm_leads.csvonperson_id - Join
crm_opportunities.csvonlead_id - Join
inventory_projects.csvandinventory_units.csvon project/unit IDs - Map
intel_interactions.csvonperson_idfor communication history - Aggregate
intel_qd_scores.csvandintel_qd_timeseries.csvfor intelligence
Client 360 Validation
Load json/client_360_snapshots_batch_*.json to validate:
- Aggregation accuracy
- Cross-domain joining
- Derived field computation
- Missing data handling
Oracle Writeback Testing
Use workflow_actions.csv and workflow_writebacks.csv to test:
- Proposal generation
- Approval flow simulation
- Canonical mutation application
- Audit trail completeness
Transcript Processing
Load json/transcript_sidecars.json for:
- Speaker diarization validation
- Conversation context extraction
- Sentiment and intent inference testing
Evidence Placeholders
The dataset includes metadata placeholders for:
- CCTV clip references (
clips/VIS_{visit_id}_{random}.mp4) - Call recording references (
rec/CAL_{call_id}.mp3) - Transcript references (
trx/CAL_{call_id}.json) - Camera IDs and gate references
These are structured metadata only. Actual media payloads are not included.
Synthetic Data Limitations
- Names and addresses are fictional but culturally realistic
- Phone numbers follow Indian format but are not real
- Email addresses are synthetic and non-deliverable
- Prices are representative of Kolkata premium market but approximate
- Communication text is template-generated but contextually coherent
- Transcripts are structured dialogue, not actual ASR output
Acceptance Criteria Verification
| Criterion | Status |
|---|---|
| 250 complete synthetic client graphs | ✅ |
| All 14 project names represented | ✅ |
| Spans CRM, interaction, opportunity, reminder, transcript, enrichment layers | ✅ |
| Files structured for CSV-first import testing | ✅ |
| Human reviewer can inspect a graph and believe it is coherent | ✅ (sample review recommended) |
| Referential integrity across all IDs | ✅ |
| No impossible date ordering | ✅ |
| No orphaned opportunities or interactions | ✅ |
| Every QD artifact points back to plausible evidence | ✅ |
Next Steps
- Import Replay: Load CSVs into the Velocity import pipeline and validate mapping proposals
- Client 360 Render: Use JSON snapshots to test frontend dossier rendering
- QD Validation: Verify score computation logic against interaction density
- Oracle Testing: Use workflow items to test writeback proposal generation
- Synthetic Expansion: Add more projects, cities, or persona types as needed
Generated for: Project Velocity Founder CRM and Platform Planning
Canonical Source: Doc 16 - Coding Agent Swarm Brief: Synthetic Client Graph Generation
Reviewers: Sayan, Sourik