186 lines
5.1 KiB
Markdown
186 lines
5.1 KiB
Markdown
# Deployment, Operations, and Release Readiness Spec
|
|
|
|
**Date:** 2026-04-14
|
|
**Status:** Draft implementation artifact
|
|
**Purpose:** Define how the colony system is deployed, operated, observed, released, and judged ready for internal demos and early sales use.
|
|
|
|
## 1. Purpose
|
|
|
|
The colony cannot be sold if it only exists as a local developer runtime. It needs a disciplined deployment and operating model that fits current Project Velocity infrastructure.
|
|
|
|
## 2. Deployment Topology
|
|
|
|
Recommended topology for Sprint 1:
|
|
|
|
- FastAPI root remains the public application authority
|
|
- TypeScript colony service is an internal service behind the root
|
|
- PostgreSQL remains canonical persistence
|
|
- Nemoclaw and MCP services remain root-governed append layers
|
|
|
|
Traffic pattern:
|
|
|
|
1. UI calls FastAPI root
|
|
2. root authenticates and normalizes mission
|
|
3. root calls colony service over private service boundary
|
|
4. colony persists artifacts back through root APIs or root persistence bridge
|
|
5. root returns status and reviewed output to UI
|
|
|
|
## 3. Environment Contract
|
|
|
|
### 3.1 Root Backend
|
|
|
|
Required environment values:
|
|
|
|
- `COLONY_SERVICE_BASE_URL`
|
|
- `COLONY_SERVICE_API_KEY`
|
|
- `COLONY_ENABLED`
|
|
- `COLONY_TIMEOUT_MS`
|
|
- `COLONY_DEFAULT_TIME_BUDGET_MS`
|
|
- `COLONY_DEFAULT_TOKEN_BUDGET`
|
|
|
|
### 3.2 Colony Service
|
|
|
|
Required environment values:
|
|
|
|
- `PORT`
|
|
- `ROOT_API_BASE_URL`
|
|
- `ROOT_API_KEY`
|
|
- `DEFAULT_MODEL_ROUTE`
|
|
- `RESEARCH_PROVIDER`
|
|
- `BROWSER_PROVIDER`
|
|
- `MAX_CONCURRENT_MISSIONS`
|
|
- `MAX_WORKERS_PER_MISSION`
|
|
- `MISSION_TIMEOUT_MS`
|
|
- `TASK_TIMEOUT_MS`
|
|
|
|
## 4. Release Environments
|
|
|
|
Need three environments:
|
|
|
|
- local development
|
|
- shared staging
|
|
- production
|
|
|
|
Rules:
|
|
|
|
- staging must be used for Oracle and CRM mission replay before production enablement
|
|
- production must begin with assisted-mode missions only
|
|
- no production writeback automation before approval route testing is complete
|
|
|
|
## 5. Observability Requirements
|
|
|
|
Required operational outputs:
|
|
|
|
- structured logs
|
|
- mission traces
|
|
- stage latency metrics
|
|
- provider failure metrics
|
|
- approval queue metrics
|
|
- mission success and failure counters by mission type
|
|
|
|
Minimum dashboards:
|
|
|
|
- mission health dashboard
|
|
- policy block dashboard
|
|
- provider health dashboard
|
|
- approval backlog dashboard
|
|
|
|
## 6. Release Gates
|
|
|
|
The colony is not release-ready until all gates below pass.
|
|
|
|
### Gate 1: Technical Integrity
|
|
|
|
- schema applies cleanly
|
|
- root and colony health endpoints are green
|
|
- one Oracle mission completes in staging
|
|
- one CRM mission completes in staging
|
|
|
|
### Gate 2: Governance Integrity
|
|
|
|
- blocked tool case is denied correctly
|
|
- blocked writeback case is denied correctly
|
|
- approved writeback case requires explicit operator action
|
|
|
|
### Gate 3: Operational Integrity
|
|
|
|
- missions are replayable through artifact inspection
|
|
- failures preserve enough artifacts for debugging
|
|
- approval queue is visible to operators
|
|
|
|
### Gate 4: Sales Integrity
|
|
|
|
- output is stable enough for live demo
|
|
- reviewer packet can explain why the answer is trustworthy
|
|
- operator can inspect mission evidence quickly
|
|
|
|
## 7. Failure Runbooks
|
|
|
|
Need documented responses for:
|
|
|
|
- colony service unavailable
|
|
- root-to-colony authentication failure
|
|
- provider outage
|
|
- malformed contract payload
|
|
- approval queue backlog
|
|
- stuck mission in non-terminal state
|
|
|
|
Required runtime behavior:
|
|
|
|
- root returns structured degraded-state response
|
|
- mission remains auditable
|
|
- operator can mark mission failed or replay it
|
|
|
|
## 8. Rollout Strategy
|
|
|
|
Recommended rollout:
|
|
|
|
1. enable health and dry-run mission creation
|
|
2. enable Oracle assisted missions in staging
|
|
3. enable CRM assisted missions in staging
|
|
4. enable Oracle assisted missions in production
|
|
5. enable CRM assisted missions in production
|
|
6. enable Catalyst strategy missions in staging
|
|
|
|
Do not enable autonomous external research by default in production on day one.
|
|
|
|
## 9. Sales Readiness Criteria
|
|
|
|
The system is sellable for early Project Velocity demos only if:
|
|
|
|
- one Oracle mission reliably returns project-aware reviewed output
|
|
- one CRM mission reliably returns lead intelligence with evidence trail
|
|
- operator can show auditability in under two minutes
|
|
- approval workflow prevents accidental mutations
|
|
- mission failures are graceful and legible
|
|
|
|
It is not sellable if:
|
|
|
|
- output quality depends on manual developer intervention
|
|
- mission replay is impossible
|
|
- provider outages create silent failure
|
|
- writebacks can happen without approval
|
|
|
|
## 10. Ownership Model
|
|
|
|
Operational ownership should be split clearly:
|
|
|
|
- root owner: backend routes, auth, persistence, approval flows
|
|
- colony owner: runtime, workers, orchestration behavior
|
|
- policy owner: governance, model routing, tool permissions
|
|
- product owner: mission definitions, demo scenarios, release decision
|
|
|
|
## 11. Ticket Breakdown
|
|
|
|
1. define environment contracts
|
|
2. implement health checks
|
|
3. add mission trace and metrics
|
|
4. create staging rollout checklist
|
|
5. create production assisted-mode rollout checklist
|
|
6. add approval queue observability
|
|
7. document runbooks for outage and failed mission recovery
|
|
|
|
## 12. Bottom Line
|
|
|
|
The colony becomes commercially usable only when it is deployable, inspectable, and fail-safe. Release readiness is not a polish task. It is part of the core product contract.
|