Files
Project_Velocity/docs/KIMI_SYNTHETIC_DATA_DOWNSTREAM_PLAN.md
sayan 84e439712c feat/#24 WebOS Completion (#25)
#24 WebOS Completion

Co-authored-by: Sayan Datta <sayan@Sayans-MacBook-Air.local>
Reviewed-on: #25
2026-04-18 18:59:04 +05:30

2.2 KiB

Kimi Synthetic Data Downstream Plan

Goal

Use the Oracle template taxonomy as the control surface for generating structured synthetic examples that can be replayed into analytics, training, QA, and demo environments without coupling generation logic to any one UI surface.

Inputs

  • backend/oracle/oracle_template_seed_db.json for chapters, subchapters, and exemplar prompts
  • schema_extension_v2.sql tables for templates, synthetic jobs, and auditability
  • Admin surface actions for publish, archive, trigger, and cancel workflows

Downstream stages

  1. Template selection
    • Admin or operator selects a published template chapter and revision.
    • The request binds tenant, locale, target channel, and generation volume.
  2. Prompt expansion
    • Seed examples are expanded into structured prompt packs.
    • Each pack should preserve chapter lineage and example provenance.
  3. Synthetic generation
    • Queue work into oracle_synthetic_generation_jobs.
    • Persist idempotency keys so reruns can be traced without duplicate publication.
  4. Validation and scoring
    • Run schema validation on every generated artifact.
    • Score for completeness, realism, and chapter coverage.
  5. Distribution
    • Publish accepted outputs to analytics sandboxes, QA fixtures, or demonstration bundles.
    • Keep rejected artifacts attached to the job for review rather than dropping them silently.

Contract shape

  • Request
    • template_id
    • template_revision
    • chapter_key
    • subchapter_key
    • tenant_id
    • locale
    • record_count
    • target_surface
  • Result
    • job_id
    • status
    • accepted_records
    • rejected_records
    • output_manifest
    • lineage

Guardrails

  • Only published templates can be used for production synthetic jobs.
  • Every output record must retain template lineage metadata.
  • Cancelled jobs remain queryable from the admin surface.
  • Generated content should never overwrite operator-authored production data.

Immediate next step

Implement a background worker that consumes pending rows from oracle_synthetic_generation_jobs, writes structured manifests, and exposes completion state through the admin surface queue endpoints.