Files

Sagnik 954618c3ef feat(crm): canonical crm and imported routes implementation

2026-04-18 21:32:54 +05:30

12 KiB

Raw Permalink Blame History

Project Velocity - Synthetic Client Graph Dataset

Generated: 2026-04-18
Dataset Version: 1.0.0
Target: 250 full synthetic client graphs
Owner: Sagnik
Alignment: Founder CRM and Platform Delivery Pack (Doc 16)

Overview

This dataset contains 250 fully synthetic client graphs aligned to the Project Velocity canonical domain model. It is designed for:

CRM module validation and testing
Import pipeline replay testing
Client 360 aggregation validation
Oracle intelligence and writeback testing
QD score and timeseries validation
Communication capture and transcript processing
Workflow and approval governance testing

The data simulates premium real-estate sales behavior in the Kolkata market across 14 projects.

Geography and Inventory

Market: Kolkata and surrounding micro-markets
Projects: 14 premium residential projects

Project ID	Project Name	Developer	Micro-Market
PRJ-001	Eden Devprayag	Eden Group	Rajarhat
PRJ-002	Sugam Prakriti	Sugam Homes	Barasat
PRJ-003	Atri Aqua	Atri Developers	New Town
PRJ-004	Atri Surya Toron	Atri Developers	Rajarhat
PRJ-005	Siddha Suburbia Bungalow	Siddha Group	Madanpur
PRJ-006	Merlin Avana	Merlin Group	Tangra
PRJ-007	DTC Good Earth	DTC Projects	New Town
PRJ-008	Siddha Serena	Siddha Group	New Town
PRJ-009	Siddha Sky Waterfront	Siddha Group	Beliaghata
PRJ-010	Godrej Blue	Godrej Properties	New Town
PRJ-011	DTC Sojon	DTC Projects	Rajarhat
PRJ-012	Shriram Grand City	Shriram Properties	Howrah
PRJ-013	Godrej Elevate	Godrej Properties	Dum Dum
PRJ-014	Ambuja Utpaala	Ambuja Neotia	Tollygunge

Dataset Composition

Primary Entities

Entity	Count	Description
Primary Clients (People)	250	Main decision-makers and buyers
Co-buyers/Family	91	Secondary contacts linked to households
Accounts (Organizations)	153	Employers, businesses, referral partners
Households	118	Family decision units
Relationships	91	Spouse, parent, sibling, business partner links
Leads	250	Funnel-stage qualification records
Opportunities	400	Deal pipeline objects (1-3 per client)
Property Interests	400	Project/unit preference records
Stage History	1,373	Lead stage transition audit trail

Interaction Graph

Artifact	Count	Description
Interactions	1,897	Umbrella communication events
WhatsApp Messages	3,367	Text messages with realistic dialogue
WhatsApp Threads	606	Conversation thread summaries
Phone Calls	478	Call records with duration and direction
Transcripts	231	Speaker-segmented call transcripts
Emails	149	Business correspondence with subjects and bodies
Site Visits	305	Physical site visit records with notes
Reminders/Tasks	759	Follow-up items and action reminders

Intelligence & Enrichment

Artifact	Count	Description
QD Scores	250	Latest qualification/disposition scores
QD Timeseries	1,953	Historical score propagation (4-12 pts/client)
Vehicle Events	80	Number-plate detection events
Perception Events	60	Behavioral/dwell-time intelligence
CCTV Links	120	Video clip references linked to visits

Workflow & Governance

Artifact	Count	Description
Workflow Actions	100	Import reviews, merge proposals, writebacks
Approvals	49	Human review decisions
Writebacks	28	Approved canonical mutations

Inventory

Artifact	Count	Description
Projects	14	Master project records
Units	209	Individual unit inventory (8-20 per project)

File Structure

synthetic_client_graphs/
├── csv/
│   ├── inventory_projects.csv
│   ├── inventory_units.csv
│   ├── crm_people.csv
│   ├── crm_accounts.csv
│   ├── crm_households.csv
│   ├── crm_relationships.csv
│   ├── crm_leads.csv
│   ├── crm_opportunities.csv
│   ├── crm_property_interests.csv
│   ├── crm_stage_history.csv
│   ├── intel_interactions.csv
│   ├── intel_messages.csv
│   ├── intel_calls.csv
│   ├── intel_transcripts.csv
│   ├── intel_emails.csv
│   ├── intel_whatsapp_threads.csv
│   ├── intel_visits.csv
│   ├── intel_reminders.csv
│   ├── intel_qd_scores.csv
│   ├── intel_qd_timeseries.csv
│   ├── intel_vehicle_events.csv
│   ├── intel_perception_events.csv
│   ├── intel_cctv_links.csv
│   ├── workflow_actions.csv
│   ├── workflow_approvals.csv
│   └── workflow_writebacks.csv
├── json/
│   ├── client_360_snapshots_batch_1.json (Clients 1-50)
│   ├── client_360_snapshots_batch_2.json (Clients 51-100)
│   ├── client_360_snapshots_batch_3.json (Clients 101-150)
│   ├── client_360_snapshots_batch_4.json (Clients 151-200)
│   ├── client_360_snapshots_batch_5.json (Clients 201-250)
│   ├── import_mapping_manifest_example.json
│   ├── relationship_graph_map.json
│   └── transcript_sidecars.json
└── README.md

Buyer Persona Distribution

The 250 primary clients are distributed across realistic premium real-estate buyer personas:

Persona	Percentage	Count	Characteristics
High-Intent Buyer	20%	~50	Quick decision cycle, clear requirements, responsive
Slow-Burn Investor	18%	~45	Long horizon, price-sensitive, comparison-heavy
NRI Buyer	12%	~30	Remote decision-making, video calls, family proxies
Family Decision Unit	20%	~50	Multiple stakeholders, consensus-driven, Vastu-conscious
Price-Sensitive Aspirational	15%	~37	Stretch budget, EMI-focused, festival-offer hunters
Broker/Referral Chain	8%	~20	Multiple client representations, commission-focused
Repeat Visitor	7%	~18	High engagement, multiple visits, decision paralysis

Canonical Domain Alignment

This dataset maps to the planned Velocity canonical domains:

`crm_*` Domain

crm_people: Contact identity and demographics
crm_accounts: Organization and employer records
crm_households: Family and co-buyer structures
crm_relationships: Person-to-person linkages
crm_leads: Funnel stage and qualification
crm_opportunities: Deal pipeline and valuation
crm_property_interests: Project/unit preferences
crm_stage_history: Audit trail of stage transitions

`intel_*` Domain

intel_interactions: Unified communication events
intel_messages: Text-level message records
intel_calls: Call metadata and duration
intel_transcripts: Speaker-segmented conversation text
intel_emails: Email correspondence
intel_whatsapp_threads: Thread-level summaries
intel_visits: Site visit records and notes
intel_reminders: Task and follow-up tracking
intel_qd_scores: Qualification/disposition scores
intel_qd_timeseries: Temporal score evolution
intel_vehicle_events: Parking/entry detection
intel_perception_events: Behavioral intelligence
intel_cctv_links: Video evidence references

`inventory_*` Domain

inventory_projects: Master project catalog
inventory_units: Unit-level availability and pricing

`workflow_*` Domain

workflow_actions: Proposed AI/human actions
workflow_approvals: Review decisions
workflow_writebacks: Committed mutations

Quality Assurance

Referential Integrity

All foreign key relationships have been validated:

✅ All lead.person_id values exist in crm_people
✅ All opportunity.lead_id values exist in crm_leads
✅ All interaction.person_id values exist in crm_people
✅ All visit.person_id values exist in crm_people
✅ All qd_score.person_id values exist in crm_people
✅ No orphaned stage history records
✅ All opportunity.project_id values exist in inventory_projects
✅ All property_interest.project_id values exist in inventory_projects

Temporal Consistency

✅ Lead creation dates precede interaction dates
✅ Stage history transitions are monotonic in time
✅ QD timeseries points are chronologically ordered
✅ Visit dates align with lead stage progression
✅ Reminder due dates follow interaction dates

Realism Rules Applied

Names: Realistic Indian names (Bengali, Hindi, mixed demographics)
Organizations: Major Indian IT, banking, manufacturing, and consulting firms
Communication: Premium property sales tone, not generic retail
Stage Transitions: Narratively coherent (enquiry → visit → negotiation → booking)
Sales Cadence: Realistic follow-up intervals (3-15 days between touches)
Dialogue: Context-aware transcripts referencing specific projects, prices, and objections
Budgets: Aligned to Kolkata premium market (1.5 Cr - 25 Cr range)

Usage Instructions

CSV-First Import Testing

Start with crm_people.csv as the identity anchor
Join crm_leads.csv on person_id
Join crm_opportunities.csv on lead_id
Join inventory_projects.csv and inventory_units.csv on project/unit IDs
Map intel_interactions.csv on person_id for communication history
Aggregate intel_qd_scores.csv and intel_qd_timeseries.csv for intelligence

Client 360 Validation

Load json/client_360_snapshots_batch_*.json to validate:

Aggregation accuracy
Cross-domain joining
Derived field computation
Missing data handling

Oracle Writeback Testing

Use workflow_actions.csv and workflow_writebacks.csv to test:

Proposal generation
Approval flow simulation
Canonical mutation application
Audit trail completeness

Transcript Processing

Load json/transcript_sidecars.json for:

Speaker diarization validation
Conversation context extraction
Sentiment and intent inference testing

Evidence Placeholders

The dataset includes metadata placeholders for:

CCTV clip references (clips/VIS_{visit_id}_{random}.mp4)
Call recording references (rec/CAL_{call_id}.mp3)
Transcript references (trx/CAL_{call_id}.json)
Camera IDs and gate references

These are structured metadata only. Actual media payloads are not included.

Synthetic Data Limitations

Names and addresses are fictional but culturally realistic
Phone numbers follow Indian format but are not real
Email addresses are synthetic and non-deliverable
Prices are representative of Kolkata premium market but approximate
Communication text is template-generated but contextually coherent
Transcripts are structured dialogue, not actual ASR output

Acceptance Criteria Verification

Criterion	Status
250 complete synthetic client graphs	✅
All 14 project names represented	✅
Spans CRM, interaction, opportunity, reminder, transcript, enrichment layers	✅
Files structured for CSV-first import testing	✅
Human reviewer can inspect a graph and believe it is coherent	✅ (sample review recommended)
Referential integrity across all IDs	✅
No impossible date ordering	✅
No orphaned opportunities or interactions	✅
Every QD artifact points back to plausible evidence	✅

Next Steps

Import Replay: Load CSVs into the Velocity import pipeline and validate mapping proposals
Client 360 Render: Use JSON snapshots to test frontend dossier rendering
QD Validation: Verify score computation logic against interaction density
Oracle Testing: Use workflow items to test writeback proposal generation
Synthetic Expansion: Add more projects, cities, or persona types as needed

Generated for: Project Velocity Founder CRM and Platform Planning
Canonical Source: Doc 16 - Coding Agent Swarm Brief: Synthetic Client Graph Generation
Reviewers: Sayan, Sourik

12 KiB Raw Permalink Blame History