323 lines
12 KiB
Markdown
323 lines
12 KiB
Markdown
# Project Velocity - Synthetic Client Graph Dataset
|
|
|
|
**Generated:** 2026-04-18
|
|
**Dataset Version:** 1.0.0
|
|
**Target:** 250 full synthetic client graphs
|
|
**Owner:** Sagnik
|
|
**Alignment:** Founder CRM and Platform Delivery Pack (Doc 16)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This dataset contains 250 fully synthetic client graphs aligned to the Project Velocity canonical domain model. It is designed for:
|
|
|
|
- CRM module validation and testing
|
|
- Import pipeline replay testing
|
|
- Client 360 aggregation validation
|
|
- Oracle intelligence and writeback testing
|
|
- QD score and timeseries validation
|
|
- Communication capture and transcript processing
|
|
- Workflow and approval governance testing
|
|
|
|
The data simulates premium real-estate sales behavior in the Kolkata market across 14 projects.
|
|
|
|
---
|
|
|
|
## Geography and Inventory
|
|
|
|
**Market:** Kolkata and surrounding micro-markets
|
|
**Projects:** 14 premium residential projects
|
|
|
|
| Project ID | Project Name | Developer | Micro-Market |
|
|
|------------|--------------|-----------|--------------|
|
|
| PRJ-001 | Eden Devprayag | Eden Group | Rajarhat |
|
|
| PRJ-002 | Sugam Prakriti | Sugam Homes | Barasat |
|
|
| PRJ-003 | Atri Aqua | Atri Developers | New Town |
|
|
| PRJ-004 | Atri Surya Toron | Atri Developers | Rajarhat |
|
|
| PRJ-005 | Siddha Suburbia Bungalow | Siddha Group | Madanpur |
|
|
| PRJ-006 | Merlin Avana | Merlin Group | Tangra |
|
|
| PRJ-007 | DTC Good Earth | DTC Projects | New Town |
|
|
| PRJ-008 | Siddha Serena | Siddha Group | New Town |
|
|
| PRJ-009 | Siddha Sky Waterfront | Siddha Group | Beliaghata |
|
|
| PRJ-010 | Godrej Blue | Godrej Properties | New Town |
|
|
| PRJ-011 | DTC Sojon | DTC Projects | Rajarhat |
|
|
| PRJ-012 | Shriram Grand City | Shriram Properties | Howrah |
|
|
| PRJ-013 | Godrej Elevate | Godrej Properties | Dum Dum |
|
|
| PRJ-014 | Ambuja Utpaala | Ambuja Neotia | Tollygunge |
|
|
|
|
---
|
|
|
|
## Dataset Composition
|
|
|
|
### Primary Entities
|
|
|
|
| Entity | Count | Description |
|
|
|--------|-------|-------------|
|
|
| Primary Clients (People) | 250 | Main decision-makers and buyers |
|
|
| Co-buyers/Family | 91 | Secondary contacts linked to households |
|
|
| Accounts (Organizations) | 153 | Employers, businesses, referral partners |
|
|
| Households | 118 | Family decision units |
|
|
| Relationships | 91 | Spouse, parent, sibling, business partner links |
|
|
| Leads | 250 | Funnel-stage qualification records |
|
|
| Opportunities | 400 | Deal pipeline objects (1-3 per client) |
|
|
| Property Interests | 400 | Project/unit preference records |
|
|
| Stage History | 1,373 | Lead stage transition audit trail |
|
|
|
|
### Interaction Graph
|
|
|
|
| Artifact | Count | Description |
|
|
|----------|-------|-------------|
|
|
| Interactions | 1,897 | Umbrella communication events |
|
|
| WhatsApp Messages | 3,367 | Text messages with realistic dialogue |
|
|
| WhatsApp Threads | 606 | Conversation thread summaries |
|
|
| Phone Calls | 478 | Call records with duration and direction |
|
|
| Transcripts | 231 | Speaker-segmented call transcripts |
|
|
| Emails | 149 | Business correspondence with subjects and bodies |
|
|
| Site Visits | 305 | Physical site visit records with notes |
|
|
| Reminders/Tasks | 759 | Follow-up items and action reminders |
|
|
|
|
### Intelligence & Enrichment
|
|
|
|
| Artifact | Count | Description |
|
|
|----------|-------|-------------|
|
|
| QD Scores | 250 | Latest qualification/disposition scores |
|
|
| QD Timeseries | 1,953 | Historical score propagation (4-12 pts/client) |
|
|
| Vehicle Events | 80 | Number-plate detection events |
|
|
| Perception Events | 60 | Behavioral/dwell-time intelligence |
|
|
| CCTV Links | 120 | Video clip references linked to visits |
|
|
|
|
### Workflow & Governance
|
|
|
|
| Artifact | Count | Description |
|
|
|----------|-------|-------------|
|
|
| Workflow Actions | 100 | Import reviews, merge proposals, writebacks |
|
|
| Approvals | 49 | Human review decisions |
|
|
| Writebacks | 28 | Approved canonical mutations |
|
|
|
|
### Inventory
|
|
|
|
| Artifact | Count | Description |
|
|
|----------|-------|-------------|
|
|
| Projects | 14 | Master project records |
|
|
| Units | 209 | Individual unit inventory (8-20 per project) |
|
|
|
|
---
|
|
|
|
## File Structure
|
|
|
|
```
|
|
synthetic_client_graphs/
|
|
├── csv/
|
|
│ ├── inventory_projects.csv
|
|
│ ├── inventory_units.csv
|
|
│ ├── crm_people.csv
|
|
│ ├── crm_accounts.csv
|
|
│ ├── crm_households.csv
|
|
│ ├── crm_relationships.csv
|
|
│ ├── crm_leads.csv
|
|
│ ├── crm_opportunities.csv
|
|
│ ├── crm_property_interests.csv
|
|
│ ├── crm_stage_history.csv
|
|
│ ├── intel_interactions.csv
|
|
│ ├── intel_messages.csv
|
|
│ ├── intel_calls.csv
|
|
│ ├── intel_transcripts.csv
|
|
│ ├── intel_emails.csv
|
|
│ ├── intel_whatsapp_threads.csv
|
|
│ ├── intel_visits.csv
|
|
│ ├── intel_reminders.csv
|
|
│ ├── intel_qd_scores.csv
|
|
│ ├── intel_qd_timeseries.csv
|
|
│ ├── intel_vehicle_events.csv
|
|
│ ├── intel_perception_events.csv
|
|
│ ├── intel_cctv_links.csv
|
|
│ ├── workflow_actions.csv
|
|
│ ├── workflow_approvals.csv
|
|
│ └── workflow_writebacks.csv
|
|
├── json/
|
|
│ ├── client_360_snapshots_batch_1.json (Clients 1-50)
|
|
│ ├── client_360_snapshots_batch_2.json (Clients 51-100)
|
|
│ ├── client_360_snapshots_batch_3.json (Clients 101-150)
|
|
│ ├── client_360_snapshots_batch_4.json (Clients 151-200)
|
|
│ ├── client_360_snapshots_batch_5.json (Clients 201-250)
|
|
│ ├── import_mapping_manifest_example.json
|
|
│ ├── relationship_graph_map.json
|
|
│ └── transcript_sidecars.json
|
|
└── README.md
|
|
```
|
|
|
|
---
|
|
|
|
## Buyer Persona Distribution
|
|
|
|
The 250 primary clients are distributed across realistic premium real-estate buyer personas:
|
|
|
|
| Persona | Percentage | Count | Characteristics |
|
|
|---------|-----------|-------|-----------------|
|
|
| High-Intent Buyer | 20% | ~50 | Quick decision cycle, clear requirements, responsive |
|
|
| Slow-Burn Investor | 18% | ~45 | Long horizon, price-sensitive, comparison-heavy |
|
|
| NRI Buyer | 12% | ~30 | Remote decision-making, video calls, family proxies |
|
|
| Family Decision Unit | 20% | ~50 | Multiple stakeholders, consensus-driven, Vastu-conscious |
|
|
| Price-Sensitive Aspirational | 15% | ~37 | Stretch budget, EMI-focused, festival-offer hunters |
|
|
| Broker/Referral Chain | 8% | ~20 | Multiple client representations, commission-focused |
|
|
| Repeat Visitor | 7% | ~18 | High engagement, multiple visits, decision paralysis |
|
|
|
|
---
|
|
|
|
## Canonical Domain Alignment
|
|
|
|
This dataset maps to the planned Velocity canonical domains:
|
|
|
|
### `crm_*` Domain
|
|
- `crm_people`: Contact identity and demographics
|
|
- `crm_accounts`: Organization and employer records
|
|
- `crm_households`: Family and co-buyer structures
|
|
- `crm_relationships`: Person-to-person linkages
|
|
- `crm_leads`: Funnel stage and qualification
|
|
- `crm_opportunities`: Deal pipeline and valuation
|
|
- `crm_property_interests`: Project/unit preferences
|
|
- `crm_stage_history`: Audit trail of stage transitions
|
|
|
|
### `intel_*` Domain
|
|
- `intel_interactions`: Unified communication events
|
|
- `intel_messages`: Text-level message records
|
|
- `intel_calls`: Call metadata and duration
|
|
- `intel_transcripts`: Speaker-segmented conversation text
|
|
- `intel_emails`: Email correspondence
|
|
- `intel_whatsapp_threads`: Thread-level summaries
|
|
- `intel_visits`: Site visit records and notes
|
|
- `intel_reminders`: Task and follow-up tracking
|
|
- `intel_qd_scores`: Qualification/disposition scores
|
|
- `intel_qd_timeseries`: Temporal score evolution
|
|
- `intel_vehicle_events`: Parking/entry detection
|
|
- `intel_perception_events`: Behavioral intelligence
|
|
- `intel_cctv_links`: Video evidence references
|
|
|
|
### `inventory_*` Domain
|
|
- `inventory_projects`: Master project catalog
|
|
- `inventory_units`: Unit-level availability and pricing
|
|
|
|
### `workflow_*` Domain
|
|
- `workflow_actions`: Proposed AI/human actions
|
|
- `workflow_approvals`: Review decisions
|
|
- `workflow_writebacks`: Committed mutations
|
|
|
|
---
|
|
|
|
## Quality Assurance
|
|
|
|
### Referential Integrity
|
|
All foreign key relationships have been validated:
|
|
- ✅ All `lead.person_id` values exist in `crm_people`
|
|
- ✅ All `opportunity.lead_id` values exist in `crm_leads`
|
|
- ✅ All `interaction.person_id` values exist in `crm_people`
|
|
- ✅ All `visit.person_id` values exist in `crm_people`
|
|
- ✅ All `qd_score.person_id` values exist in `crm_people`
|
|
- ✅ No orphaned stage history records
|
|
- ✅ All `opportunity.project_id` values exist in `inventory_projects`
|
|
- ✅ All `property_interest.project_id` values exist in `inventory_projects`
|
|
|
|
### Temporal Consistency
|
|
- ✅ Lead creation dates precede interaction dates
|
|
- ✅ Stage history transitions are monotonic in time
|
|
- ✅ QD timeseries points are chronologically ordered
|
|
- ✅ Visit dates align with lead stage progression
|
|
- ✅ Reminder due dates follow interaction dates
|
|
|
|
### Realism Rules Applied
|
|
- **Names:** Realistic Indian names (Bengali, Hindi, mixed demographics)
|
|
- **Organizations:** Major Indian IT, banking, manufacturing, and consulting firms
|
|
- **Communication:** Premium property sales tone, not generic retail
|
|
- **Stage Transitions:** Narratively coherent (enquiry → visit → negotiation → booking)
|
|
- **Sales Cadence:** Realistic follow-up intervals (3-15 days between touches)
|
|
- **Dialogue:** Context-aware transcripts referencing specific projects, prices, and objections
|
|
- **Budgets:** Aligned to Kolkata premium market (1.5 Cr - 25 Cr range)
|
|
|
|
---
|
|
|
|
## Usage Instructions
|
|
|
|
### CSV-First Import Testing
|
|
1. Start with `crm_people.csv` as the identity anchor
|
|
2. Join `crm_leads.csv` on `person_id`
|
|
3. Join `crm_opportunities.csv` on `lead_id`
|
|
4. Join `inventory_projects.csv` and `inventory_units.csv` on project/unit IDs
|
|
5. Map `intel_interactions.csv` on `person_id` for communication history
|
|
6. Aggregate `intel_qd_scores.csv` and `intel_qd_timeseries.csv` for intelligence
|
|
|
|
### Client 360 Validation
|
|
Load `json/client_360_snapshots_batch_*.json` to validate:
|
|
- Aggregation accuracy
|
|
- Cross-domain joining
|
|
- Derived field computation
|
|
- Missing data handling
|
|
|
|
### Oracle Writeback Testing
|
|
Use `workflow_actions.csv` and `workflow_writebacks.csv` to test:
|
|
- Proposal generation
|
|
- Approval flow simulation
|
|
- Canonical mutation application
|
|
- Audit trail completeness
|
|
|
|
### Transcript Processing
|
|
Load `json/transcript_sidecars.json` for:
|
|
- Speaker diarization validation
|
|
- Conversation context extraction
|
|
- Sentiment and intent inference testing
|
|
|
|
---
|
|
|
|
## Evidence Placeholders
|
|
|
|
The dataset includes metadata placeholders for:
|
|
- CCTV clip references (`clips/VIS_{visit_id}_{random}.mp4`)
|
|
- Call recording references (`rec/CAL_{call_id}.mp3`)
|
|
- Transcript references (`trx/CAL_{call_id}.json`)
|
|
- Camera IDs and gate references
|
|
|
|
These are structured metadata only. Actual media payloads are not included.
|
|
|
|
---
|
|
|
|
## Synthetic Data Limitations
|
|
|
|
1. **Names and addresses** are fictional but culturally realistic
|
|
2. **Phone numbers** follow Indian format but are not real
|
|
3. **Email addresses** are synthetic and non-deliverable
|
|
4. **Prices** are representative of Kolkata premium market but approximate
|
|
5. **Communication text** is template-generated but contextually coherent
|
|
6. **Transcripts** are structured dialogue, not actual ASR output
|
|
|
|
---
|
|
|
|
## Acceptance Criteria Verification
|
|
|
|
| Criterion | Status |
|
|
|-----------|--------|
|
|
| 250 complete synthetic client graphs | ✅ |
|
|
| All 14 project names represented | ✅ |
|
|
| Spans CRM, interaction, opportunity, reminder, transcript, enrichment layers | ✅ |
|
|
| Files structured for CSV-first import testing | ✅ |
|
|
| Human reviewer can inspect a graph and believe it is coherent | ✅ (sample review recommended) |
|
|
| Referential integrity across all IDs | ✅ |
|
|
| No impossible date ordering | ✅ |
|
|
| No orphaned opportunities or interactions | ✅ |
|
|
| Every QD artifact points back to plausible evidence | ✅ |
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Import Replay:** Load CSVs into the Velocity import pipeline and validate mapping proposals
|
|
2. **Client 360 Render:** Use JSON snapshots to test frontend dossier rendering
|
|
3. **QD Validation:** Verify score computation logic against interaction density
|
|
4. **Oracle Testing:** Use workflow items to test writeback proposal generation
|
|
5. **Synthetic Expansion:** Add more projects, cities, or persona types as needed
|
|
|
|
---
|
|
|
|
**Generated for:** Project Velocity Founder CRM and Platform Planning
|
|
**Canonical Source:** Doc 16 - Coding Agent Swarm Brief: Synthetic Client Graph Generation
|
|
**Reviewers:** Sayan, Sourik
|