Files
Project_Velocity/.Agent Context/Sprint 1/Biomimetic Agentic Orchestration Layer/Deployment, Operations, and Release Readiness Spec.md

5.1 KiB

Deployment, Operations, and Release Readiness Spec

Date: 2026-04-14
Status: Draft implementation artifact
Purpose: Define how the colony system is deployed, operated, observed, released, and judged ready for internal demos and early sales use.

1. Purpose

The colony cannot be sold if it only exists as a local developer runtime. It needs a disciplined deployment and operating model that fits current Project Velocity infrastructure.

2. Deployment Topology

Recommended topology for Sprint 1:

  • FastAPI root remains the public application authority
  • TypeScript colony service is an internal service behind the root
  • PostgreSQL remains canonical persistence
  • Nemoclaw and MCP services remain root-governed append layers

Traffic pattern:

  1. UI calls FastAPI root
  2. root authenticates and normalizes mission
  3. root calls colony service over private service boundary
  4. colony persists artifacts back through root APIs or root persistence bridge
  5. root returns status and reviewed output to UI

3. Environment Contract

3.1 Root Backend

Required environment values:

  • COLONY_SERVICE_BASE_URL
  • COLONY_SERVICE_API_KEY
  • COLONY_ENABLED
  • COLONY_TIMEOUT_MS
  • COLONY_DEFAULT_TIME_BUDGET_MS
  • COLONY_DEFAULT_TOKEN_BUDGET

3.2 Colony Service

Required environment values:

  • PORT
  • ROOT_API_BASE_URL
  • ROOT_API_KEY
  • DEFAULT_MODEL_ROUTE
  • RESEARCH_PROVIDER
  • BROWSER_PROVIDER
  • MAX_CONCURRENT_MISSIONS
  • MAX_WORKERS_PER_MISSION
  • MISSION_TIMEOUT_MS
  • TASK_TIMEOUT_MS

4. Release Environments

Need three environments:

  • local development
  • shared staging
  • production

Rules:

  • staging must be used for Oracle and CRM mission replay before production enablement
  • production must begin with assisted-mode missions only
  • no production writeback automation before approval route testing is complete

5. Observability Requirements

Required operational outputs:

  • structured logs
  • mission traces
  • stage latency metrics
  • provider failure metrics
  • approval queue metrics
  • mission success and failure counters by mission type

Minimum dashboards:

  • mission health dashboard
  • policy block dashboard
  • provider health dashboard
  • approval backlog dashboard

6. Release Gates

The colony is not release-ready until all gates below pass.

Gate 1: Technical Integrity

  • schema applies cleanly
  • root and colony health endpoints are green
  • one Oracle mission completes in staging
  • one CRM mission completes in staging

Gate 2: Governance Integrity

  • blocked tool case is denied correctly
  • blocked writeback case is denied correctly
  • approved writeback case requires explicit operator action

Gate 3: Operational Integrity

  • missions are replayable through artifact inspection
  • failures preserve enough artifacts for debugging
  • approval queue is visible to operators

Gate 4: Sales Integrity

  • output is stable enough for live demo
  • reviewer packet can explain why the answer is trustworthy
  • operator can inspect mission evidence quickly

7. Failure Runbooks

Need documented responses for:

  • colony service unavailable
  • root-to-colony authentication failure
  • provider outage
  • malformed contract payload
  • approval queue backlog
  • stuck mission in non-terminal state

Required runtime behavior:

  • root returns structured degraded-state response
  • mission remains auditable
  • operator can mark mission failed or replay it

8. Rollout Strategy

Recommended rollout:

  1. enable health and dry-run mission creation
  2. enable Oracle assisted missions in staging
  3. enable CRM assisted missions in staging
  4. enable Oracle assisted missions in production
  5. enable CRM assisted missions in production
  6. enable Catalyst strategy missions in staging

Do not enable autonomous external research by default in production on day one.

9. Sales Readiness Criteria

The system is sellable for early Project Velocity demos only if:

  • one Oracle mission reliably returns project-aware reviewed output
  • one CRM mission reliably returns lead intelligence with evidence trail
  • operator can show auditability in under two minutes
  • approval workflow prevents accidental mutations
  • mission failures are graceful and legible

It is not sellable if:

  • output quality depends on manual developer intervention
  • mission replay is impossible
  • provider outages create silent failure
  • writebacks can happen without approval

10. Ownership Model

Operational ownership should be split clearly:

  • root owner: backend routes, auth, persistence, approval flows
  • colony owner: runtime, workers, orchestration behavior
  • policy owner: governance, model routing, tool permissions
  • product owner: mission definitions, demo scenarios, release decision

11. Ticket Breakdown

  1. define environment contracts
  2. implement health checks
  3. add mission trace and metrics
  4. create staging rollout checklist
  5. create production assisted-mode rollout checklist
  6. add approval queue observability
  7. document runbooks for outage and failed mission recovery

12. Bottom Line

The colony becomes commercially usable only when it is deployable, inspectable, and fail-safe. Release readiness is not a polish task. It is part of the core product contract.