# Project Velocity - Synthetic Client Graph Dataset **Generated:** 2026-04-18 **Dataset Version:** 1.0.0 **Target:** 250 full synthetic client graphs **Owner:** Sagnik **Alignment:** Founder CRM and Platform Delivery Pack (Doc 16) --- ## Overview This dataset contains 250 fully synthetic client graphs aligned to the Project Velocity canonical domain model. It is designed for: - CRM module validation and testing - Import pipeline replay testing - Client 360 aggregation validation - Oracle intelligence and writeback testing - QD score and timeseries validation - Communication capture and transcript processing - Workflow and approval governance testing The data simulates premium real-estate sales behavior in the Kolkata market across 14 projects. --- ## Geography and Inventory **Market:** Kolkata and surrounding micro-markets **Projects:** 14 premium residential projects | Project ID | Project Name | Developer | Micro-Market | |------------|--------------|-----------|--------------| | PRJ-001 | Eden Devprayag | Eden Group | Rajarhat | | PRJ-002 | Sugam Prakriti | Sugam Homes | Barasat | | PRJ-003 | Atri Aqua | Atri Developers | New Town | | PRJ-004 | Atri Surya Toron | Atri Developers | Rajarhat | | PRJ-005 | Siddha Suburbia Bungalow | Siddha Group | Madanpur | | PRJ-006 | Merlin Avana | Merlin Group | Tangra | | PRJ-007 | DTC Good Earth | DTC Projects | New Town | | PRJ-008 | Siddha Serena | Siddha Group | New Town | | PRJ-009 | Siddha Sky Waterfront | Siddha Group | Beliaghata | | PRJ-010 | Godrej Blue | Godrej Properties | New Town | | PRJ-011 | DTC Sojon | DTC Projects | Rajarhat | | PRJ-012 | Shriram Grand City | Shriram Properties | Howrah | | PRJ-013 | Godrej Elevate | Godrej Properties | Dum Dum | | PRJ-014 | Ambuja Utpaala | Ambuja Neotia | Tollygunge | --- ## Dataset Composition ### Primary Entities | Entity | Count | Description | |--------|-------|-------------| | Primary Clients (People) | 250 | Main decision-makers and buyers | | Co-buyers/Family | 91 | Secondary contacts linked to households | | Accounts (Organizations) | 153 | Employers, businesses, referral partners | | Households | 118 | Family decision units | | Relationships | 91 | Spouse, parent, sibling, business partner links | | Leads | 250 | Funnel-stage qualification records | | Opportunities | 400 | Deal pipeline objects (1-3 per client) | | Property Interests | 400 | Project/unit preference records | | Stage History | 1,373 | Lead stage transition audit trail | ### Interaction Graph | Artifact | Count | Description | |----------|-------|-------------| | Interactions | 1,897 | Umbrella communication events | | WhatsApp Messages | 3,367 | Text messages with realistic dialogue | | WhatsApp Threads | 606 | Conversation thread summaries | | Phone Calls | 478 | Call records with duration and direction | | Transcripts | 231 | Speaker-segmented call transcripts | | Emails | 149 | Business correspondence with subjects and bodies | | Site Visits | 305 | Physical site visit records with notes | | Reminders/Tasks | 759 | Follow-up items and action reminders | ### Intelligence & Enrichment | Artifact | Count | Description | |----------|-------|-------------| | QD Scores | 250 | Latest qualification/disposition scores | | QD Timeseries | 1,953 | Historical score propagation (4-12 pts/client) | | Vehicle Events | 80 | Number-plate detection events | | Perception Events | 60 | Behavioral/dwell-time intelligence | | CCTV Links | 120 | Video clip references linked to visits | ### Workflow & Governance | Artifact | Count | Description | |----------|-------|-------------| | Workflow Actions | 100 | Import reviews, merge proposals, writebacks | | Approvals | 49 | Human review decisions | | Writebacks | 28 | Approved canonical mutations | ### Inventory | Artifact | Count | Description | |----------|-------|-------------| | Projects | 14 | Master project records | | Units | 209 | Individual unit inventory (8-20 per project) | --- ## File Structure ``` synthetic_client_graphs/ ├── csv/ │ ├── inventory_projects.csv │ ├── inventory_units.csv │ ├── crm_people.csv │ ├── crm_accounts.csv │ ├── crm_households.csv │ ├── crm_relationships.csv │ ├── crm_leads.csv │ ├── crm_opportunities.csv │ ├── crm_property_interests.csv │ ├── crm_stage_history.csv │ ├── intel_interactions.csv │ ├── intel_messages.csv │ ├── intel_calls.csv │ ├── intel_transcripts.csv │ ├── intel_emails.csv │ ├── intel_whatsapp_threads.csv │ ├── intel_visits.csv │ ├── intel_reminders.csv │ ├── intel_qd_scores.csv │ ├── intel_qd_timeseries.csv │ ├── intel_vehicle_events.csv │ ├── intel_perception_events.csv │ ├── intel_cctv_links.csv │ ├── workflow_actions.csv │ ├── workflow_approvals.csv │ └── workflow_writebacks.csv ├── json/ │ ├── client_360_snapshots_batch_1.json (Clients 1-50) │ ├── client_360_snapshots_batch_2.json (Clients 51-100) │ ├── client_360_snapshots_batch_3.json (Clients 101-150) │ ├── client_360_snapshots_batch_4.json (Clients 151-200) │ ├── client_360_snapshots_batch_5.json (Clients 201-250) │ ├── import_mapping_manifest_example.json │ ├── relationship_graph_map.json │ └── transcript_sidecars.json └── README.md ``` --- ## Buyer Persona Distribution The 250 primary clients are distributed across realistic premium real-estate buyer personas: | Persona | Percentage | Count | Characteristics | |---------|-----------|-------|-----------------| | High-Intent Buyer | 20% | ~50 | Quick decision cycle, clear requirements, responsive | | Slow-Burn Investor | 18% | ~45 | Long horizon, price-sensitive, comparison-heavy | | NRI Buyer | 12% | ~30 | Remote decision-making, video calls, family proxies | | Family Decision Unit | 20% | ~50 | Multiple stakeholders, consensus-driven, Vastu-conscious | | Price-Sensitive Aspirational | 15% | ~37 | Stretch budget, EMI-focused, festival-offer hunters | | Broker/Referral Chain | 8% | ~20 | Multiple client representations, commission-focused | | Repeat Visitor | 7% | ~18 | High engagement, multiple visits, decision paralysis | --- ## Canonical Domain Alignment This dataset maps to the planned Velocity canonical domains: ### `crm_*` Domain - `crm_people`: Contact identity and demographics - `crm_accounts`: Organization and employer records - `crm_households`: Family and co-buyer structures - `crm_relationships`: Person-to-person linkages - `crm_leads`: Funnel stage and qualification - `crm_opportunities`: Deal pipeline and valuation - `crm_property_interests`: Project/unit preferences - `crm_stage_history`: Audit trail of stage transitions ### `intel_*` Domain - `intel_interactions`: Unified communication events - `intel_messages`: Text-level message records - `intel_calls`: Call metadata and duration - `intel_transcripts`: Speaker-segmented conversation text - `intel_emails`: Email correspondence - `intel_whatsapp_threads`: Thread-level summaries - `intel_visits`: Site visit records and notes - `intel_reminders`: Task and follow-up tracking - `intel_qd_scores`: Qualification/disposition scores - `intel_qd_timeseries`: Temporal score evolution - `intel_vehicle_events`: Parking/entry detection - `intel_perception_events`: Behavioral intelligence - `intel_cctv_links`: Video evidence references ### `inventory_*` Domain - `inventory_projects`: Master project catalog - `inventory_units`: Unit-level availability and pricing ### `workflow_*` Domain - `workflow_actions`: Proposed AI/human actions - `workflow_approvals`: Review decisions - `workflow_writebacks`: Committed mutations --- ## Quality Assurance ### Referential Integrity All foreign key relationships have been validated: - ✅ All `lead.person_id` values exist in `crm_people` - ✅ All `opportunity.lead_id` values exist in `crm_leads` - ✅ All `interaction.person_id` values exist in `crm_people` - ✅ All `visit.person_id` values exist in `crm_people` - ✅ All `qd_score.person_id` values exist in `crm_people` - ✅ No orphaned stage history records - ✅ All `opportunity.project_id` values exist in `inventory_projects` - ✅ All `property_interest.project_id` values exist in `inventory_projects` ### Temporal Consistency - ✅ Lead creation dates precede interaction dates - ✅ Stage history transitions are monotonic in time - ✅ QD timeseries points are chronologically ordered - ✅ Visit dates align with lead stage progression - ✅ Reminder due dates follow interaction dates ### Realism Rules Applied - **Names:** Realistic Indian names (Bengali, Hindi, mixed demographics) - **Organizations:** Major Indian IT, banking, manufacturing, and consulting firms - **Communication:** Premium property sales tone, not generic retail - **Stage Transitions:** Narratively coherent (enquiry → visit → negotiation → booking) - **Sales Cadence:** Realistic follow-up intervals (3-15 days between touches) - **Dialogue:** Context-aware transcripts referencing specific projects, prices, and objections - **Budgets:** Aligned to Kolkata premium market (1.5 Cr - 25 Cr range) --- ## Usage Instructions ### CSV-First Import Testing 1. Start with `crm_people.csv` as the identity anchor 2. Join `crm_leads.csv` on `person_id` 3. Join `crm_opportunities.csv` on `lead_id` 4. Join `inventory_projects.csv` and `inventory_units.csv` on project/unit IDs 5. Map `intel_interactions.csv` on `person_id` for communication history 6. Aggregate `intel_qd_scores.csv` and `intel_qd_timeseries.csv` for intelligence ### Client 360 Validation Load `json/client_360_snapshots_batch_*.json` to validate: - Aggregation accuracy - Cross-domain joining - Derived field computation - Missing data handling ### Oracle Writeback Testing Use `workflow_actions.csv` and `workflow_writebacks.csv` to test: - Proposal generation - Approval flow simulation - Canonical mutation application - Audit trail completeness ### Transcript Processing Load `json/transcript_sidecars.json` for: - Speaker diarization validation - Conversation context extraction - Sentiment and intent inference testing --- ## Evidence Placeholders The dataset includes metadata placeholders for: - CCTV clip references (`clips/VIS_{visit_id}_{random}.mp4`) - Call recording references (`rec/CAL_{call_id}.mp3`) - Transcript references (`trx/CAL_{call_id}.json`) - Camera IDs and gate references These are structured metadata only. Actual media payloads are not included. --- ## Synthetic Data Limitations 1. **Names and addresses** are fictional but culturally realistic 2. **Phone numbers** follow Indian format but are not real 3. **Email addresses** are synthetic and non-deliverable 4. **Prices** are representative of Kolkata premium market but approximate 5. **Communication text** is template-generated but contextually coherent 6. **Transcripts** are structured dialogue, not actual ASR output --- ## Acceptance Criteria Verification | Criterion | Status | |-----------|--------| | 250 complete synthetic client graphs | ✅ | | All 14 project names represented | ✅ | | Spans CRM, interaction, opportunity, reminder, transcript, enrichment layers | ✅ | | Files structured for CSV-first import testing | ✅ | | Human reviewer can inspect a graph and believe it is coherent | ✅ (sample review recommended) | | Referential integrity across all IDs | ✅ | | No impossible date ordering | ✅ | | No orphaned opportunities or interactions | ✅ | | Every QD artifact points back to plausible evidence | ✅ | --- ## Next Steps 1. **Import Replay:** Load CSVs into the Velocity import pipeline and validate mapping proposals 2. **Client 360 Render:** Use JSON snapshots to test frontend dossier rendering 3. **QD Validation:** Verify score computation logic against interaction density 4. **Oracle Testing:** Use workflow items to test writeback proposal generation 5. **Synthetic Expansion:** Add more projects, cities, or persona types as needed --- **Generated for:** Project Velocity Founder CRM and Platform Planning **Canonical Source:** Doc 16 - Coding Agent Swarm Brief: Synthetic Client Graph Generation **Reviewers:** Sayan, Sourik