Files
Project_Velocity/db assets/synthetic_crm_v1/README.md

323 lines
12 KiB
Markdown

# Project Velocity - Synthetic Client Graph Dataset
**Generated:** 2026-04-18
**Dataset Version:** 1.0.0
**Target:** 250 full synthetic client graphs
**Owner:** Sagnik
**Alignment:** Founder CRM and Platform Delivery Pack (Doc 16)
---
## Overview
This dataset contains 250 fully synthetic client graphs aligned to the Project Velocity canonical domain model. It is designed for:
- CRM module validation and testing
- Import pipeline replay testing
- Client 360 aggregation validation
- Oracle intelligence and writeback testing
- QD score and timeseries validation
- Communication capture and transcript processing
- Workflow and approval governance testing
The data simulates premium real-estate sales behavior in the Kolkata market across 14 projects.
---
## Geography and Inventory
**Market:** Kolkata and surrounding micro-markets
**Projects:** 14 premium residential projects
| Project ID | Project Name | Developer | Micro-Market |
|------------|--------------|-----------|--------------|
| PRJ-001 | Eden Devprayag | Eden Group | Rajarhat |
| PRJ-002 | Sugam Prakriti | Sugam Homes | Barasat |
| PRJ-003 | Atri Aqua | Atri Developers | New Town |
| PRJ-004 | Atri Surya Toron | Atri Developers | Rajarhat |
| PRJ-005 | Siddha Suburbia Bungalow | Siddha Group | Madanpur |
| PRJ-006 | Merlin Avana | Merlin Group | Tangra |
| PRJ-007 | DTC Good Earth | DTC Projects | New Town |
| PRJ-008 | Siddha Serena | Siddha Group | New Town |
| PRJ-009 | Siddha Sky Waterfront | Siddha Group | Beliaghata |
| PRJ-010 | Godrej Blue | Godrej Properties | New Town |
| PRJ-011 | DTC Sojon | DTC Projects | Rajarhat |
| PRJ-012 | Shriram Grand City | Shriram Properties | Howrah |
| PRJ-013 | Godrej Elevate | Godrej Properties | Dum Dum |
| PRJ-014 | Ambuja Utpaala | Ambuja Neotia | Tollygunge |
---
## Dataset Composition
### Primary Entities
| Entity | Count | Description |
|--------|-------|-------------|
| Primary Clients (People) | 250 | Main decision-makers and buyers |
| Co-buyers/Family | 91 | Secondary contacts linked to households |
| Accounts (Organizations) | 153 | Employers, businesses, referral partners |
| Households | 118 | Family decision units |
| Relationships | 91 | Spouse, parent, sibling, business partner links |
| Leads | 250 | Funnel-stage qualification records |
| Opportunities | 400 | Deal pipeline objects (1-3 per client) |
| Property Interests | 400 | Project/unit preference records |
| Stage History | 1,373 | Lead stage transition audit trail |
### Interaction Graph
| Artifact | Count | Description |
|----------|-------|-------------|
| Interactions | 1,897 | Umbrella communication events |
| WhatsApp Messages | 3,367 | Text messages with realistic dialogue |
| WhatsApp Threads | 606 | Conversation thread summaries |
| Phone Calls | 478 | Call records with duration and direction |
| Transcripts | 231 | Speaker-segmented call transcripts |
| Emails | 149 | Business correspondence with subjects and bodies |
| Site Visits | 305 | Physical site visit records with notes |
| Reminders/Tasks | 759 | Follow-up items and action reminders |
### Intelligence & Enrichment
| Artifact | Count | Description |
|----------|-------|-------------|
| QD Scores | 250 | Latest qualification/disposition scores |
| QD Timeseries | 1,953 | Historical score propagation (4-12 pts/client) |
| Vehicle Events | 80 | Number-plate detection events |
| Perception Events | 60 | Behavioral/dwell-time intelligence |
| CCTV Links | 120 | Video clip references linked to visits |
### Workflow & Governance
| Artifact | Count | Description |
|----------|-------|-------------|
| Workflow Actions | 100 | Import reviews, merge proposals, writebacks |
| Approvals | 49 | Human review decisions |
| Writebacks | 28 | Approved canonical mutations |
### Inventory
| Artifact | Count | Description |
|----------|-------|-------------|
| Projects | 14 | Master project records |
| Units | 209 | Individual unit inventory (8-20 per project) |
---
## File Structure
```
synthetic_client_graphs/
├── csv/
│ ├── inventory_projects.csv
│ ├── inventory_units.csv
│ ├── crm_people.csv
│ ├── crm_accounts.csv
│ ├── crm_households.csv
│ ├── crm_relationships.csv
│ ├── crm_leads.csv
│ ├── crm_opportunities.csv
│ ├── crm_property_interests.csv
│ ├── crm_stage_history.csv
│ ├── intel_interactions.csv
│ ├── intel_messages.csv
│ ├── intel_calls.csv
│ ├── intel_transcripts.csv
│ ├── intel_emails.csv
│ ├── intel_whatsapp_threads.csv
│ ├── intel_visits.csv
│ ├── intel_reminders.csv
│ ├── intel_qd_scores.csv
│ ├── intel_qd_timeseries.csv
│ ├── intel_vehicle_events.csv
│ ├── intel_perception_events.csv
│ ├── intel_cctv_links.csv
│ ├── workflow_actions.csv
│ ├── workflow_approvals.csv
│ └── workflow_writebacks.csv
├── json/
│ ├── client_360_snapshots_batch_1.json (Clients 1-50)
│ ├── client_360_snapshots_batch_2.json (Clients 51-100)
│ ├── client_360_snapshots_batch_3.json (Clients 101-150)
│ ├── client_360_snapshots_batch_4.json (Clients 151-200)
│ ├── client_360_snapshots_batch_5.json (Clients 201-250)
│ ├── import_mapping_manifest_example.json
│ ├── relationship_graph_map.json
│ └── transcript_sidecars.json
└── README.md
```
---
## Buyer Persona Distribution
The 250 primary clients are distributed across realistic premium real-estate buyer personas:
| Persona | Percentage | Count | Characteristics |
|---------|-----------|-------|-----------------|
| High-Intent Buyer | 20% | ~50 | Quick decision cycle, clear requirements, responsive |
| Slow-Burn Investor | 18% | ~45 | Long horizon, price-sensitive, comparison-heavy |
| NRI Buyer | 12% | ~30 | Remote decision-making, video calls, family proxies |
| Family Decision Unit | 20% | ~50 | Multiple stakeholders, consensus-driven, Vastu-conscious |
| Price-Sensitive Aspirational | 15% | ~37 | Stretch budget, EMI-focused, festival-offer hunters |
| Broker/Referral Chain | 8% | ~20 | Multiple client representations, commission-focused |
| Repeat Visitor | 7% | ~18 | High engagement, multiple visits, decision paralysis |
---
## Canonical Domain Alignment
This dataset maps to the planned Velocity canonical domains:
### `crm_*` Domain
- `crm_people`: Contact identity and demographics
- `crm_accounts`: Organization and employer records
- `crm_households`: Family and co-buyer structures
- `crm_relationships`: Person-to-person linkages
- `crm_leads`: Funnel stage and qualification
- `crm_opportunities`: Deal pipeline and valuation
- `crm_property_interests`: Project/unit preferences
- `crm_stage_history`: Audit trail of stage transitions
### `intel_*` Domain
- `intel_interactions`: Unified communication events
- `intel_messages`: Text-level message records
- `intel_calls`: Call metadata and duration
- `intel_transcripts`: Speaker-segmented conversation text
- `intel_emails`: Email correspondence
- `intel_whatsapp_threads`: Thread-level summaries
- `intel_visits`: Site visit records and notes
- `intel_reminders`: Task and follow-up tracking
- `intel_qd_scores`: Qualification/disposition scores
- `intel_qd_timeseries`: Temporal score evolution
- `intel_vehicle_events`: Parking/entry detection
- `intel_perception_events`: Behavioral intelligence
- `intel_cctv_links`: Video evidence references
### `inventory_*` Domain
- `inventory_projects`: Master project catalog
- `inventory_units`: Unit-level availability and pricing
### `workflow_*` Domain
- `workflow_actions`: Proposed AI/human actions
- `workflow_approvals`: Review decisions
- `workflow_writebacks`: Committed mutations
---
## Quality Assurance
### Referential Integrity
All foreign key relationships have been validated:
- ✅ All `lead.person_id` values exist in `crm_people`
- ✅ All `opportunity.lead_id` values exist in `crm_leads`
- ✅ All `interaction.person_id` values exist in `crm_people`
- ✅ All `visit.person_id` values exist in `crm_people`
- ✅ All `qd_score.person_id` values exist in `crm_people`
- ✅ No orphaned stage history records
- ✅ All `opportunity.project_id` values exist in `inventory_projects`
- ✅ All `property_interest.project_id` values exist in `inventory_projects`
### Temporal Consistency
- ✅ Lead creation dates precede interaction dates
- ✅ Stage history transitions are monotonic in time
- ✅ QD timeseries points are chronologically ordered
- ✅ Visit dates align with lead stage progression
- ✅ Reminder due dates follow interaction dates
### Realism Rules Applied
- **Names:** Realistic Indian names (Bengali, Hindi, mixed demographics)
- **Organizations:** Major Indian IT, banking, manufacturing, and consulting firms
- **Communication:** Premium property sales tone, not generic retail
- **Stage Transitions:** Narratively coherent (enquiry → visit → negotiation → booking)
- **Sales Cadence:** Realistic follow-up intervals (3-15 days between touches)
- **Dialogue:** Context-aware transcripts referencing specific projects, prices, and objections
- **Budgets:** Aligned to Kolkata premium market (1.5 Cr - 25 Cr range)
---
## Usage Instructions
### CSV-First Import Testing
1. Start with `crm_people.csv` as the identity anchor
2. Join `crm_leads.csv` on `person_id`
3. Join `crm_opportunities.csv` on `lead_id`
4. Join `inventory_projects.csv` and `inventory_units.csv` on project/unit IDs
5. Map `intel_interactions.csv` on `person_id` for communication history
6. Aggregate `intel_qd_scores.csv` and `intel_qd_timeseries.csv` for intelligence
### Client 360 Validation
Load `json/client_360_snapshots_batch_*.json` to validate:
- Aggregation accuracy
- Cross-domain joining
- Derived field computation
- Missing data handling
### Oracle Writeback Testing
Use `workflow_actions.csv` and `workflow_writebacks.csv` to test:
- Proposal generation
- Approval flow simulation
- Canonical mutation application
- Audit trail completeness
### Transcript Processing
Load `json/transcript_sidecars.json` for:
- Speaker diarization validation
- Conversation context extraction
- Sentiment and intent inference testing
---
## Evidence Placeholders
The dataset includes metadata placeholders for:
- CCTV clip references (`clips/VIS_{visit_id}_{random}.mp4`)
- Call recording references (`rec/CAL_{call_id}.mp3`)
- Transcript references (`trx/CAL_{call_id}.json`)
- Camera IDs and gate references
These are structured metadata only. Actual media payloads are not included.
---
## Synthetic Data Limitations
1. **Names and addresses** are fictional but culturally realistic
2. **Phone numbers** follow Indian format but are not real
3. **Email addresses** are synthetic and non-deliverable
4. **Prices** are representative of Kolkata premium market but approximate
5. **Communication text** is template-generated but contextually coherent
6. **Transcripts** are structured dialogue, not actual ASR output
---
## Acceptance Criteria Verification
| Criterion | Status |
|-----------|--------|
| 250 complete synthetic client graphs | ✅ |
| All 14 project names represented | ✅ |
| Spans CRM, interaction, opportunity, reminder, transcript, enrichment layers | ✅ |
| Files structured for CSV-first import testing | ✅ |
| Human reviewer can inspect a graph and believe it is coherent | ✅ (sample review recommended) |
| Referential integrity across all IDs | ✅ |
| No impossible date ordering | ✅ |
| No orphaned opportunities or interactions | ✅ |
| Every QD artifact points back to plausible evidence | ✅ |
---
## Next Steps
1. **Import Replay:** Load CSVs into the Velocity import pipeline and validate mapping proposals
2. **Client 360 Render:** Use JSON snapshots to test frontend dossier rendering
3. **QD Validation:** Verify score computation logic against interaction density
4. **Oracle Testing:** Use workflow items to test writeback proposal generation
5. **Synthetic Expansion:** Add more projects, cities, or persona types as needed
---
**Generated for:** Project Velocity Founder CRM and Platform Planning
**Canonical Source:** Doc 16 - Coding Agent Swarm Brief: Synthetic Client Graph Generation
**Reviewers:** Sayan, Sourik