Files

12 KiB

Project Velocity - Synthetic Client Graph Dataset

Generated: 2026-04-18
Dataset Version: 1.0.0
Target: 250 full synthetic client graphs
Owner: Sagnik
Alignment: Founder CRM and Platform Delivery Pack (Doc 16)


Overview

This dataset contains 250 fully synthetic client graphs aligned to the Project Velocity canonical domain model. It is designed for:

  • CRM module validation and testing
  • Import pipeline replay testing
  • Client 360 aggregation validation
  • Oracle intelligence and writeback testing
  • QD score and timeseries validation
  • Communication capture and transcript processing
  • Workflow and approval governance testing

The data simulates premium real-estate sales behavior in the Kolkata market across 14 projects.


Geography and Inventory

Market: Kolkata and surrounding micro-markets
Projects: 14 premium residential projects

Project ID Project Name Developer Micro-Market
PRJ-001 Eden Devprayag Eden Group Rajarhat
PRJ-002 Sugam Prakriti Sugam Homes Barasat
PRJ-003 Atri Aqua Atri Developers New Town
PRJ-004 Atri Surya Toron Atri Developers Rajarhat
PRJ-005 Siddha Suburbia Bungalow Siddha Group Madanpur
PRJ-006 Merlin Avana Merlin Group Tangra
PRJ-007 DTC Good Earth DTC Projects New Town
PRJ-008 Siddha Serena Siddha Group New Town
PRJ-009 Siddha Sky Waterfront Siddha Group Beliaghata
PRJ-010 Godrej Blue Godrej Properties New Town
PRJ-011 DTC Sojon DTC Projects Rajarhat
PRJ-012 Shriram Grand City Shriram Properties Howrah
PRJ-013 Godrej Elevate Godrej Properties Dum Dum
PRJ-014 Ambuja Utpaala Ambuja Neotia Tollygunge

Dataset Composition

Primary Entities

Entity Count Description
Primary Clients (People) 250 Main decision-makers and buyers
Co-buyers/Family 91 Secondary contacts linked to households
Accounts (Organizations) 153 Employers, businesses, referral partners
Households 118 Family decision units
Relationships 91 Spouse, parent, sibling, business partner links
Leads 250 Funnel-stage qualification records
Opportunities 400 Deal pipeline objects (1-3 per client)
Property Interests 400 Project/unit preference records
Stage History 1,373 Lead stage transition audit trail

Interaction Graph

Artifact Count Description
Interactions 1,897 Umbrella communication events
WhatsApp Messages 3,367 Text messages with realistic dialogue
WhatsApp Threads 606 Conversation thread summaries
Phone Calls 478 Call records with duration and direction
Transcripts 231 Speaker-segmented call transcripts
Emails 149 Business correspondence with subjects and bodies
Site Visits 305 Physical site visit records with notes
Reminders/Tasks 759 Follow-up items and action reminders

Intelligence & Enrichment

Artifact Count Description
QD Scores 250 Latest qualification/disposition scores
QD Timeseries 1,953 Historical score propagation (4-12 pts/client)
Vehicle Events 80 Number-plate detection events
Perception Events 60 Behavioral/dwell-time intelligence
CCTV Links 120 Video clip references linked to visits

Workflow & Governance

Artifact Count Description
Workflow Actions 100 Import reviews, merge proposals, writebacks
Approvals 49 Human review decisions
Writebacks 28 Approved canonical mutations

Inventory

Artifact Count Description
Projects 14 Master project records
Units 209 Individual unit inventory (8-20 per project)

File Structure

synthetic_client_graphs/
├── csv/
│   ├── inventory_projects.csv
│   ├── inventory_units.csv
│   ├── crm_people.csv
│   ├── crm_accounts.csv
│   ├── crm_households.csv
│   ├── crm_relationships.csv
│   ├── crm_leads.csv
│   ├── crm_opportunities.csv
│   ├── crm_property_interests.csv
│   ├── crm_stage_history.csv
│   ├── intel_interactions.csv
│   ├── intel_messages.csv
│   ├── intel_calls.csv
│   ├── intel_transcripts.csv
│   ├── intel_emails.csv
│   ├── intel_whatsapp_threads.csv
│   ├── intel_visits.csv
│   ├── intel_reminders.csv
│   ├── intel_qd_scores.csv
│   ├── intel_qd_timeseries.csv
│   ├── intel_vehicle_events.csv
│   ├── intel_perception_events.csv
│   ├── intel_cctv_links.csv
│   ├── workflow_actions.csv
│   ├── workflow_approvals.csv
│   └── workflow_writebacks.csv
├── json/
│   ├── client_360_snapshots_batch_1.json (Clients 1-50)
│   ├── client_360_snapshots_batch_2.json (Clients 51-100)
│   ├── client_360_snapshots_batch_3.json (Clients 101-150)
│   ├── client_360_snapshots_batch_4.json (Clients 151-200)
│   ├── client_360_snapshots_batch_5.json (Clients 201-250)
│   ├── import_mapping_manifest_example.json
│   ├── relationship_graph_map.json
│   └── transcript_sidecars.json
└── README.md

Buyer Persona Distribution

The 250 primary clients are distributed across realistic premium real-estate buyer personas:

Persona Percentage Count Characteristics
High-Intent Buyer 20% ~50 Quick decision cycle, clear requirements, responsive
Slow-Burn Investor 18% ~45 Long horizon, price-sensitive, comparison-heavy
NRI Buyer 12% ~30 Remote decision-making, video calls, family proxies
Family Decision Unit 20% ~50 Multiple stakeholders, consensus-driven, Vastu-conscious
Price-Sensitive Aspirational 15% ~37 Stretch budget, EMI-focused, festival-offer hunters
Broker/Referral Chain 8% ~20 Multiple client representations, commission-focused
Repeat Visitor 7% ~18 High engagement, multiple visits, decision paralysis

Canonical Domain Alignment

This dataset maps to the planned Velocity canonical domains:

crm_* Domain

  • crm_people: Contact identity and demographics
  • crm_accounts: Organization and employer records
  • crm_households: Family and co-buyer structures
  • crm_relationships: Person-to-person linkages
  • crm_leads: Funnel stage and qualification
  • crm_opportunities: Deal pipeline and valuation
  • crm_property_interests: Project/unit preferences
  • crm_stage_history: Audit trail of stage transitions

intel_* Domain

  • intel_interactions: Unified communication events
  • intel_messages: Text-level message records
  • intel_calls: Call metadata and duration
  • intel_transcripts: Speaker-segmented conversation text
  • intel_emails: Email correspondence
  • intel_whatsapp_threads: Thread-level summaries
  • intel_visits: Site visit records and notes
  • intel_reminders: Task and follow-up tracking
  • intel_qd_scores: Qualification/disposition scores
  • intel_qd_timeseries: Temporal score evolution
  • intel_vehicle_events: Parking/entry detection
  • intel_perception_events: Behavioral intelligence
  • intel_cctv_links: Video evidence references

inventory_* Domain

  • inventory_projects: Master project catalog
  • inventory_units: Unit-level availability and pricing

workflow_* Domain

  • workflow_actions: Proposed AI/human actions
  • workflow_approvals: Review decisions
  • workflow_writebacks: Committed mutations

Quality Assurance

Referential Integrity

All foreign key relationships have been validated:

  • All lead.person_id values exist in crm_people
  • All opportunity.lead_id values exist in crm_leads
  • All interaction.person_id values exist in crm_people
  • All visit.person_id values exist in crm_people
  • All qd_score.person_id values exist in crm_people
  • No orphaned stage history records
  • All opportunity.project_id values exist in inventory_projects
  • All property_interest.project_id values exist in inventory_projects

Temporal Consistency

  • Lead creation dates precede interaction dates
  • Stage history transitions are monotonic in time
  • QD timeseries points are chronologically ordered
  • Visit dates align with lead stage progression
  • Reminder due dates follow interaction dates

Realism Rules Applied

  • Names: Realistic Indian names (Bengali, Hindi, mixed demographics)
  • Organizations: Major Indian IT, banking, manufacturing, and consulting firms
  • Communication: Premium property sales tone, not generic retail
  • Stage Transitions: Narratively coherent (enquiry → visit → negotiation → booking)
  • Sales Cadence: Realistic follow-up intervals (3-15 days between touches)
  • Dialogue: Context-aware transcripts referencing specific projects, prices, and objections
  • Budgets: Aligned to Kolkata premium market (1.5 Cr - 25 Cr range)

Usage Instructions

CSV-First Import Testing

  1. Start with crm_people.csv as the identity anchor
  2. Join crm_leads.csv on person_id
  3. Join crm_opportunities.csv on lead_id
  4. Join inventory_projects.csv and inventory_units.csv on project/unit IDs
  5. Map intel_interactions.csv on person_id for communication history
  6. Aggregate intel_qd_scores.csv and intel_qd_timeseries.csv for intelligence

Client 360 Validation

Load json/client_360_snapshots_batch_*.json to validate:

  • Aggregation accuracy
  • Cross-domain joining
  • Derived field computation
  • Missing data handling

Oracle Writeback Testing

Use workflow_actions.csv and workflow_writebacks.csv to test:

  • Proposal generation
  • Approval flow simulation
  • Canonical mutation application
  • Audit trail completeness

Transcript Processing

Load json/transcript_sidecars.json for:

  • Speaker diarization validation
  • Conversation context extraction
  • Sentiment and intent inference testing

Evidence Placeholders

The dataset includes metadata placeholders for:

  • CCTV clip references (clips/VIS_{visit_id}_{random}.mp4)
  • Call recording references (rec/CAL_{call_id}.mp3)
  • Transcript references (trx/CAL_{call_id}.json)
  • Camera IDs and gate references

These are structured metadata only. Actual media payloads are not included.


Synthetic Data Limitations

  1. Names and addresses are fictional but culturally realistic
  2. Phone numbers follow Indian format but are not real
  3. Email addresses are synthetic and non-deliverable
  4. Prices are representative of Kolkata premium market but approximate
  5. Communication text is template-generated but contextually coherent
  6. Transcripts are structured dialogue, not actual ASR output

Acceptance Criteria Verification

Criterion Status
250 complete synthetic client graphs
All 14 project names represented
Spans CRM, interaction, opportunity, reminder, transcript, enrichment layers
Files structured for CSV-first import testing
Human reviewer can inspect a graph and believe it is coherent (sample review recommended)
Referential integrity across all IDs
No impossible date ordering
No orphaned opportunities or interactions
Every QD artifact points back to plausible evidence

Next Steps

  1. Import Replay: Load CSVs into the Velocity import pipeline and validate mapping proposals
  2. Client 360 Render: Use JSON snapshots to test frontend dossier rendering
  3. QD Validation: Verify score computation logic against interaction density
  4. Oracle Testing: Use workflow items to test writeback proposal generation
  5. Synthetic Expansion: Add more projects, cities, or persona types as needed

Generated for: Project Velocity Founder CRM and Platform Planning
Canonical Source: Doc 16 - Coding Agent Swarm Brief: Synthetic Client Graph Generation
Reviewers: Sayan, Sourik