5.1 KiB
Deployment, Operations, and Release Readiness Spec
Date: 2026-04-14
Status: Draft implementation artifact
Purpose: Define how the colony system is deployed, operated, observed, released, and judged ready for internal demos and early sales use.
1. Purpose
The colony cannot be sold if it only exists as a local developer runtime. It needs a disciplined deployment and operating model that fits current Project Velocity infrastructure.
2. Deployment Topology
Recommended topology for Sprint 1:
- FastAPI root remains the public application authority
- TypeScript colony service is an internal service behind the root
- PostgreSQL remains canonical persistence
- Nemoclaw and MCP services remain root-governed append layers
Traffic pattern:
- UI calls FastAPI root
- root authenticates and normalizes mission
- root calls colony service over private service boundary
- colony persists artifacts back through root APIs or root persistence bridge
- root returns status and reviewed output to UI
3. Environment Contract
3.1 Root Backend
Required environment values:
COLONY_SERVICE_BASE_URLCOLONY_SERVICE_API_KEYCOLONY_ENABLEDCOLONY_TIMEOUT_MSCOLONY_DEFAULT_TIME_BUDGET_MSCOLONY_DEFAULT_TOKEN_BUDGET
3.2 Colony Service
Required environment values:
PORTROOT_API_BASE_URLROOT_API_KEYDEFAULT_MODEL_ROUTERESEARCH_PROVIDERBROWSER_PROVIDERMAX_CONCURRENT_MISSIONSMAX_WORKERS_PER_MISSIONMISSION_TIMEOUT_MSTASK_TIMEOUT_MS
4. Release Environments
Need three environments:
- local development
- shared staging
- production
Rules:
- staging must be used for Oracle and CRM mission replay before production enablement
- production must begin with assisted-mode missions only
- no production writeback automation before approval route testing is complete
5. Observability Requirements
Required operational outputs:
- structured logs
- mission traces
- stage latency metrics
- provider failure metrics
- approval queue metrics
- mission success and failure counters by mission type
Minimum dashboards:
- mission health dashboard
- policy block dashboard
- provider health dashboard
- approval backlog dashboard
6. Release Gates
The colony is not release-ready until all gates below pass.
Gate 1: Technical Integrity
- schema applies cleanly
- root and colony health endpoints are green
- one Oracle mission completes in staging
- one CRM mission completes in staging
Gate 2: Governance Integrity
- blocked tool case is denied correctly
- blocked writeback case is denied correctly
- approved writeback case requires explicit operator action
Gate 3: Operational Integrity
- missions are replayable through artifact inspection
- failures preserve enough artifacts for debugging
- approval queue is visible to operators
Gate 4: Sales Integrity
- output is stable enough for live demo
- reviewer packet can explain why the answer is trustworthy
- operator can inspect mission evidence quickly
7. Failure Runbooks
Need documented responses for:
- colony service unavailable
- root-to-colony authentication failure
- provider outage
- malformed contract payload
- approval queue backlog
- stuck mission in non-terminal state
Required runtime behavior:
- root returns structured degraded-state response
- mission remains auditable
- operator can mark mission failed or replay it
8. Rollout Strategy
Recommended rollout:
- enable health and dry-run mission creation
- enable Oracle assisted missions in staging
- enable CRM assisted missions in staging
- enable Oracle assisted missions in production
- enable CRM assisted missions in production
- enable Catalyst strategy missions in staging
Do not enable autonomous external research by default in production on day one.
9. Sales Readiness Criteria
The system is sellable for early Project Velocity demos only if:
- one Oracle mission reliably returns project-aware reviewed output
- one CRM mission reliably returns lead intelligence with evidence trail
- operator can show auditability in under two minutes
- approval workflow prevents accidental mutations
- mission failures are graceful and legible
It is not sellable if:
- output quality depends on manual developer intervention
- mission replay is impossible
- provider outages create silent failure
- writebacks can happen without approval
10. Ownership Model
Operational ownership should be split clearly:
- root owner: backend routes, auth, persistence, approval flows
- colony owner: runtime, workers, orchestration behavior
- policy owner: governance, model routing, tool permissions
- product owner: mission definitions, demo scenarios, release decision
11. Ticket Breakdown
- define environment contracts
- implement health checks
- add mission trace and metrics
- create staging rollout checklist
- create production assisted-mode rollout checklist
- add approval queue observability
- document runbooks for outage and failed mission recovery
12. Bottom Line
The colony becomes commercially usable only when it is deployable, inspectable, and fail-safe. Release readiness is not a polish task. It is part of the core product contract.