feat: Oracle Canvas, Revision History and Canvas Sharing (#33)

Co-authored-by: Sagnik <sagnik7896@gmail.com>
Reviewed-on: #33
This commit was merged in pull request #33.
This commit is contained in:
2026-04-23 01:20:21 +05:30
parent e519339cc9
commit 6cdc366718
58 changed files with 3187 additions and 705 deletions

View File

@@ -0,0 +1,494 @@
# Desineuron AWS Coding Runtime Truth Book
Date: 2026-04-22
Scope: Coding runtime, Roo Code access, NemoClaw runtime, ingress routing, GPU recovery, model staging
## 1. Current Runtime Truth
The Desineuron shared coding runtime has been cut over from Ollama to SGLang while preserving the public contracts already used by the team.
Locked production decisions:
- Public contract remains stable.
- GPU inference remains on the AWS GPU worker, not on the Linux-origin box.
- Linux-origin remains the control plane.
- Ingress remains the stable routed entrypoint.
- `Qwen 3.6 35B A3B` remains the production target model for the current `4 x L4` rollout.
- `NemoClaw` moves onto the same shared runtime.
- There is no production fallback to Ollama after cutover.
Current live public routes:
- `https://velocity.desineuron.in/llm`
- `https://llm.desineuron.in`
Current live API shape after cutover:
- `https://velocity.desineuron.in/llm/v1/models`
- `https://velocity.desineuron.in/llm/v1/chat/completions`
- `https://llm.desineuron.in/v1/models`
- `https://llm.desineuron.in/v1/chat/completions`
- GPU SGLang bind: `172.31.46.190:30100`
- Linux-origin LLM route-sync target port: `30100`
## 2. Infra Split
### Linux-origin
Responsibilities:
- owns route-sync logic
- owns operational orchestration
- updates ingress upstream target when GPU private IP changes
- does not host the heavy model runtime
### Ingress
Responsibilities:
- terminates public hostname
- renders stable reverse-proxy contracts
- forwards `/llm/*` and `llm.desineuron.in` to the current GPU target
### GPU worker
Responsibilities:
- hosts SGLang
- hosts model payloads on NVMe only
- serves Roo Code, Oracle runtime, runtime LLM, and NemoClaw inference
Non-negotiable rules:
- do not use the GPU public IP directly
- do not keep model state on root disk
- keep all large model/runtime caches on GPU NVMe
## 3. Live Hardware Target
Current worker class:
- `g6.12xlarge`
- `4 x NVIDIA L4`
- `96 GB VRAM total`
Serving profile for this hardware:
- tensor parallel size `4`
- prompt-prefix caching enabled
- async / continuous batching enabled through SGLang
- FlashInfer preferred where supported by the live CUDA stack
Measured validation on the live GPU worker:
- host class: `g6.12xlarge`
- GPU layout: `4 x NVIDIA L4`
- model path used for the validated runtime: `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8`
- SGLang served model ID used for the test: `qwen3.6-35b-a3b`
- validated SGLang launch profile:
- `--tp-size 4`
- `--attention-backend flashinfer`
- `--context-length 131072`
- `--mem-fraction-static 0.88`
- `--dist-init-addr 127.0.0.1:50000`
- `--enable-metrics`
- required bind rule on this SGLang build:
- public HTTP server must bind to the GPU private IP, not `0.0.0.0`
- internal scheduler keeps a loopback listener on the API port
- wildcard bind collides with that loopback listener on this build
- public validation after cutover:
- `https://velocity.desineuron.in/llm/v1/models` returns `200`
- `https://llm.desineuron.in/v1/models` returns `200`
- streamed chat TTFT through public ingress measured at about `2.36 s`
- one short non-stream completion measured about `33.86 completion tok/s`
## 4. Production Model Policy
### Primary production model
- user-facing family: `Qwen 3.6 35B A3B`
- exact SGLang served model ID: `qwen3.6-35b-a3b`
Why it remains live:
- fits the current `4 x L4` target
- already aligned with current team workflows
- suitable for coding/runtime use while the SGLang migration lands
- measured well enough for three concurrent coding users on the current hardware
### Staged future model on current L4 hardware
- `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit`
Status:
- acquisition/staging path is added
- not the live runtime on the current L4 cutover
- should be treated as a staged artifact for later runtime experimentation and hardware-fit validation
Why this is the right 122B staging path for the current worker:
- `4 x L4` is a better fit for an AWQ/int4 track than for an NVFP4 track
- this keeps the 122B experiment aligned with current hardware instead of assuming a Blackwell-oriented path
Why `txn545/Qwen3.5-122B-A10B-NVFP4` is not the active choice on L4:
- NVFP4 is not the safe default for the current L4 rollout
- if the team wants that track later, it should be treated as a separate hardware/runtime validation branch
Why no 122B model is the active live model in this round:
- the current migration is locked to preserving service continuity on the existing `4 x L4` worker
- the 122B track is a separate performance-fit and runtime-tuning exercise
## 5. Runtime Software Stack
Primary runtime after cutover:
- `SGLang`
Primary interface style:
- OpenAI-compatible `/v1/*`
Required runtime features:
- tensor parallel across all four GPUs
- prefix cache / prompt cache
- async scheduling
- continuous batching
- FlashInfer when supported by the live driver/runtime stack
Observed runtime note from the live bring-up:
- FlashInfer required `ninja-build` on the GPU box because it JIT-builds kernels on first run.
- The current GPU image needed:
- `ninja-build`
- `build-essential`
- After installing those packages, the FP8 runtime came up cleanly and served OpenAI-compatible traffic.
If stock SGLang underperforms:
- keep the same public routes
- tune CUDA/runtime behavior behind the same routed contract
- do not reintroduce Ollama fallback
## 6. Implemented Repo Changes
### Backend runtime service
File:
- `backend/services/runtime_llm_service.py`
Current state:
- provider catalog is standardized to `sglang`
- legacy provider names like `ollama` and `nemoclaw` are mapped into `sglang` to avoid immediate caller breakage
- model discovery uses `/v1/models`
### NemoClaw client
File:
- `backend/services/nemoclaw_client.py`
Current state:
- production path now targets the shared SGLang/OpenAI-compatible endpoint
- NVIDIA and Ollama production fallback logic is removed from the runtime path
- legacy env names still seed config where needed
### Prompt expander
File:
- `comfy_engine/scripts/prompt_expander.py`
Current state:
- now uses the shared OpenAI-compatible runtime instead of Ollama `/api/generate`
### NemoClaw deploy helper
File:
- `backend/scripts/nemoclaw_deploy.sh`
Current state:
- rewritten around SGLang-compatible inference
- no Ollama-era deployment assumptions
## 7. Route Sync And Stable Hostnames
Route-sync files:
- `infrastructure/desineuron_ingress/sync_llm_route.py`
- `infrastructure/desineuron_ingress/run_llm_route_sync.sh`
- `infrastructure/desineuron_ingress/desineuron-llm-route-sync.service`
- `infrastructure/desineuron_ingress/desineuron-llm-route-sync.timer`
- `infrastructure/desineuron_ingress/install_linux_llm_route_sync.sh`
Important behavior:
- Linux-origin discovers the current GPU private IP
- Linux-origin updates ingress-managed route state
- ingress forwards `llm.desineuron.in` and `/llm/*` to the GPU worker
Current safe default route-sync port in the repo:
- `11434`
Reason:
- the repo now contains the SGLang installer and watchdog, but the public route should not auto-cut from Ollama to SGLang until the GPU runtime is actually installed and validated on-host
- when SGLang is installed on the GPU worker, operators should flip `LLM_ROUTE_PORT` to the live SGLang port and then run route-sync
Manual operator-safe route sync entrypoint:
- `/usr/local/bin/run_llm_route_sync.sh`
This avoids the prior failure mode where operators accidentally used a system Python without `boto3`.
## 8. GPU Watchdog And Auto-Recovery
Added GPU-side scripts:
- `infrastructure/desineuron_ingress/install_gpu_sglang_runtime.sh`
- `infrastructure/desineuron_ingress/install_gpu_sglang_watchdog.sh`
Installed unit names expected on the GPU worker:
- `desineuron-sglang.service`
- `desineuron-sglang-watchdog.service`
- `desineuron-sglang-watchdog.timer`
Recovery policy:
- ensure the SGLang service is running
- verify `/v1/models` health locally
- if the configured model path is missing, rehydrate from the canonical source
- only report healthy after successful verification
Required recovery assertions for the SGLang watchdog:
- confirm the process is serving `/v1/models`
- confirm the returned model list contains `qwen3.6-35b-a3b`
- confirm all 4 GPUs are engaged during model load
- confirm FlashInfer dependencies are present before declaring runtime healthy
## 9. Model Rehydration And Staging
Added staging helper:
- `infrastructure/desineuron_ingress/acquire_qwen35_122b_nvfp4.sh`
Purpose:
- stages `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit` onto GPU NVMe by default
- does not automatically flip production traffic to that model
Expected current live model path style:
- `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8`
Expected staged 122B path style:
- `/opt/dlami/nvme/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit`
## 10. Roo Code Team Setup
After SGLang cutover, team members should stop using the Ollama provider mode for Desineuron-hosted inference.
Canonical team profile:
- API Provider: OpenAI-compatible / custom OpenAI
- Base URL: `https://llm.desineuron.in/v1`
- Model: `qwen3.6-35b-a3b`
- Temperature: `0.1` to `0.2`
- Server context ceiling: `131072`
- Recommended Roo context: `131072`
Team decision for this wave:
- all three team members can target `128K` context through the same shared runtime
- if real concurrent repo-heavy usage causes OOM or latency regression, the first rollback knob is the client context setting, not the model family
- the current production-ready long-context path is pure VRAM on `4 x L4`, not host-RAM spill
## 11. Measured SGLang Performance
Benchmark date:
- `2026-04-22`
Benchmark topology:
- live AWS GPU worker
- `SGLang + Qwen 3.6 35B A3B FP8`
- tensor parallel `4`
- FlashInfer enabled
- async scheduler / SGLang default continuous batching path
- prompt-prefix caching available in runtime
- server context ceiling: `131072`
Measured results:
- time to first token: `0.12 s`
- streamed completion wall time for a short coding/planning answer: `1.31 s`
- test concurrency: `3`
- aggregate wall time for `3 x 256-token` responses: `3.61 s`
- aggregate completion tokens: `768`
- aggregate prompt tokens: `168`
- aggregate total tokens: `936`
- aggregate completion throughput: `212.76 tokens/s`
Per-request timing under `3` concurrent requests:
- request 1: `3.608 s` for `256` completion tokens
- request 2: `3.609 s` for `256` completion tokens
- request 3: `3.608 s` for `256` completion tokens
Long-context smoke validation:
- prompt size validated: `50010` prompt tokens
- completion size: `8` tokens
- total request size: `50018` tokens
- wall time: `8.345 s`
Operational interpretation:
- the runtime is fast enough for three simultaneous coding users
- TTFT is already in the sub-200 ms range on the warmed runtime
- aggregate decode throughput is materially better than the previous Ollama-backed path while holding a `128K` server context ceiling
- `Qwen 3.6 35B A3B` is the correct production choice for the current one-week delivery window
## 12. Cutover Guidance
Use this model ID consistently across SGLang-facing clients:
- `qwen3.6-35b-a3b`
Do not use this older Ollama-style model ID against SGLang:
- `qwen3.6:35b-a3b`
Why:
- SGLang rejects colons in `served_model_name`
- the colon is reserved internally for adapter syntax
Backend compatibility note:
- the Velocity backend can still map legacy provider naming internally
- external Roo Code and OpenAI-compatible clients should use the hyphenated SGLang model ID only
Canonical Roo configuration:
- API Provider: `OpenAI-compatible` or `Custom OpenAI`
- Base URL: `https://llm.desineuron.in/v1`
- Model: `qwen3.6-35b-a3b`
- Context window: `131072`
- Temperature: `0.1` to `0.2`
Recommended initial values:
- `Base URL`: `https://llm.desineuron.in/v1`
- `Model`: `qwen3.6-35b-a3b`
- `Context Window Size (num_ctx equivalent)`: `131072`
Do not use:
- Ollama provider mode pointing at the public Desineuron route after the cutover
Reason:
- the stable contract is moving to SGLang's OpenAI-compatible interface
## 13. Most Efficient Working Long-Context Strategy On Current Hardware
Strategies tested against the live `4 x L4` worker:
1. Pure-VRAM `131072` context on SGLang with tensor parallel `4`
Result:
- works
- preserves sub-200 ms TTFT on warm short prompts
- preserved about `212.76 tok/s` aggregate completion throughput in the 3-user benchmark
2. Hierarchical host-memory cache with `131072` context
Result:
- not production-safe on the current stack for this model
- first failed on a model-specific `page_size=1` requirement for the hybrid Mamba cache
- second attempt progressed further but one rank died with exit code `-9`
- current interpretation: this path is materially less stable than the pure-VRAM profile
Current decision:
- keep `131072` in VRAM as the production target
- do not use host-RAM hierarchical cache for this model in the current rollout
- if more headroom is needed later, tune kernels and scheduling first before re-opening host-memory spill
## 14. NemoClaw Runtime Policy
NemoClaw should use the same shared SGLang runtime as:
- Roo Code
- Oracle runtime
- backend runtime LLM jobs
This is a deliberate single-stack decision:
- one serving runtime
- one model family for the current wave
- one stable routed contract
If later profiles differ, express that with config, not with a second serving stack in this phase.
## 15. Endpoint Checklist
These should work after cutover:
- `https://velocity.desineuron.in/llm/v1/models`
- `https://velocity.desineuron.in/llm/v1/chat/completions`
- `https://llm.desineuron.in/v1/models`
- `https://llm.desineuron.in/v1/chat/completions`
Internal backend envs:
- `LLM_BASE_URL`
- `SGLANG_BASE_URL`
- `SGLANG_CHAT_URL`
- `SGLANG_MODELS_URL`
- `SGLANG_MODEL`
- `SGLANG_API_TOKEN`
## 16. What Is Left
Still required to complete the migration end to end:
1. Persist the `131072` launch profile into the GPU systemd runtime using the updated installer.
2. Reinstall or update the GPU watchdog so it validates the same `131072` service profile.
3. Repoint Linux-origin route-sync env from `11434` to the live SGLang port after GPU validation.
4. Validate both public routes against `/v1/models`.
5. Run one more public-route benchmark through ingress after cutover to capture real routed TTFT.
6. Generate tuned L4-specific runtime configs if we want to push further on throughput without lowering context.
7. Keep the 122B track separate; it is not part of the current production coding-runtime choice.
## 17. Team Hand-Off
For Roo Code today, once cutover is complete, the team only needs:
- Base URL: `https://llm.desineuron.in/v1`
- Model: `qwen3.6-35b-a3b`
- Context window: `131072`
- Provider type: OpenAI-compatible
For operators, the important truth is:
- Linux-origin controls routing
- ingress owns the stable hostname
- GPU box owns inference
- NVMe owns model state
- SGLang is the production runtime

View File

@@ -0,0 +1,10 @@
# Deprecated Title
This document has been superseded by:
- [Desineuron AWS Coding Runtime Truth Book](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\.Agent Context\Desineuron AWS Coding Runtime Truth Book.md)
Reason:
- the coding runtime is no longer being tracked as an Ollama-only Qwen note
- the canonical truth now covers SGLang, Roo Code access, NemoClaw runtime, route-sync, watchdog recovery, and staged support for `txn545/Qwen3.5-122B-A10B-NVFP4`

891
.Agent Context/README.md Normal file
View File

@@ -0,0 +1,891 @@
# Project Velocity — Truthbook
> **What this is:** The single source of truth for Project Velocity. If it's written down here, it's how the system works — not how someone hoped it would work.
---
## Table of Contents
1. [What Is Project Velocity](#what-is-project-velocity)
2. [Quick Start](#quick-start)
3. [Architecture Overview](#architecture-overview)
4. [Runtime Truth](#runtime-truth)
5. [Team Setup](#team-setup)
6. [GPU & Model Runtime](#gpu--model-runtime)
7. [Infrastructure](#infrastructure)
8. [Runbooks](#runbooks)
9. [API Reference](#api-reference)
10. [Contributing](#contributing)
---
## What Is Project Velocity
Project Velocity is a multi-agent AI development platform. It orchestrates intelligent agents (powered by Qwen 3.6 35B A3B and other models) to collaborate on software engineering tasks — code generation, review, testing, deployment — as a coordinated team rather than isolated tools.
**Why it exists:** Single-agent coding tools hit a ceiling. They lack context persistence, cross-task coordination, and operational reliability. Velocity solves this by:
- **Multi-agent collaboration** — Agents communicate via WebSocket channels and shared memory
- **Persistent state** — PostgreSQL backs user data, CRM records, and agent memory
- **GPU-accelerated inference** — Local Ollama runtime on NVIDIA GPU hardware
- **Role-based access control** — Admin and standard user tiers with avatar support
- **Live event broadcasting** — Real-time campaign and catalyst events via WebSocket
**Core stack:**
| Layer | Technology |
|-------|-----------|
| Backend API | Python / FastAPI |
| Database | PostgreSQL (via `databases` library with connection pooling) |
| Frontend | React 19 + TypeScript + Vite + Tailwind CSS + Framer Motion |
| Inference | Ollama (Qwen 3.6 35B A3B primary model) |
| Real-time | WebSocket (Catalyst channel, CRM channel) |
| Deployment | systemd services on Linux with NVIDIA GPU |
---
## Quick Start
### Prerequisites
- **GPU Machine:** NVIDIA GPU with sufficient VRAM (≥16GB recommended for Qwen 3.6 35B A3B)
- **NVMe Storage:** For model weights and cache
- **Linux OS:** Ubuntu 22.04+ or equivalent
- **Python 3.11+:** Backend runtime
- **Node.js 18+:** Frontend build
- **Ollama:** Latest stable with Qwen 3.6 35B A3B model pulled
- **PostgreSQL 15+:** Database backend
### One-Line Bootstrap
```bash
bash bootstrap/setup.sh
```
This script handles:
1. GPU driver verification
2. Ollama installation and model pull
3. PostgreSQL setup
4. Backend dependency installation
5. Frontend dependency installation
6. systemd service creation
### Manual Setup
#### 1. GPU & Ollama
```bash
# Verify GPU
nvidia-smi
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the primary model
ollama pull qwen3.6:35b-a3b
# Verify model is loaded
curl http://localhost:11434/api/tags | jq '.models[] | select(.name == "qwen3.6:35b-a3b")'
```
#### 2. Database
```bash
# Start PostgreSQL
sudo systemctl start postgresql
# Create database and user
psql -U postgres -c "CREATE DATABASE velocity;"
psql -U postgres -c "CREATE USER velocity WITH PASSWORD 'secure_password';"
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE velocity TO velocity;"
```
#### 3. Backend
```bash
cd Project_Velocity/backend
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your database credentials and secrets
# Run migrations
python migrate.py
# Start server
uvicorn main:app --host 0.0.0.0 --port 8000
```
#### 4. Frontend
```bash
cd Project_Velocity/app
# Install dependencies
npm install
# Start dev server
npm run dev
```
Frontend is now available at `http://localhost:5173`.
#### 5. Verify Everything
```bash
# Backend health
curl http://localhost:8000/health
# Model availability
curl http://localhost:11434/api/tags
# Frontend
open http://localhost:5173
```
---
## Architecture Overview
### System Diagram
```
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ React UI │────▶│ FastAPI │────▶│ PostgreSQL │
│ (Port 5173)│◀────│ (Port 8000) │◀────│ (Port 5432)│
└─────────────┘ └──────┬───────┘ └─────────────┘
┌──────────────┐
│ Ollama │
│ (Port 11434) │
│ Qwen 3.6 35B │
└──────────────┘
┌──────────────┐
│ NVIDIA GPU │
└──────────────┘
```
### Component Breakdown
#### Backend (`backend/`)
[`main.py`](Project_Velocity/backend/main.py) — FastAPI application with:
- **Auth system** — Login, profile lookup, user listing, avatar upload
- **WebSocket managers** — [`_CatalystManager()`](Project_Velocity/backend/main.py:296) and [`_CRMManager()`](Project_Velocity/backend/main.py:320) for real-time event broadcasting
- **Connection pooling** — PostgreSQL via `databases` library with async context management
- **Lifespan hooks** — [`lifespan()`](Project_Velocity/backend/main.py:83) initializes and cleans up resources
Key endpoints:
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/api/auth/login` | POST | Authenticate user |
| `/api/auth/me` | GET | Get current user profile |
| `/api/auth/users` | GET | List all users (admin) |
| `/api/auth/profile/avatar` | POST | Upload profile avatar |
| `/ws/catalyst` | WS | Catalyst event channel |
| `/ws/crm` | WS | CRM event channel |
| `/health` | GET | Health check |
#### Frontend (`app/`)
[`App.tsx`](Project_Velocity/app/src/App.tsx) — React application with:
- **Protected routes** — [`ProtectedRoute()`](Project_Velocity/app/src/App.tsx:66) wraps authenticated paths
- **Route module sync** — [`RouteModuleSync()`](Project_Velocity/app/src/App.tsx:90) handles dynamic route loading
- **Main layout** — [`MainLayout()`](Project_Velocity/app/src/App.tsx:90) provides chrome (header, sidebar, content area)
- **Role rendering** — [`formatRoleLabel()`](Project_Velocity/app/src/App.tsx:379) converts role codes to display labels
- **Auth state management** — Dual `useEffect` hooks handle token persistence and user fetch
#### Agent Context (`.Agent Context/`)
Documents that define how agents operate within Velocity:
- [`Qwen 3.6 35B A3B Ollama Access, Recovery, and Team Setup.md`](Project_Velocity/.Agent%20Context/Qwen%203.6%2035B%20A3B%20Ollama%20Access,%20Recovery,%20and%20Team%20Setup.md) — Model runtime, recovery policies, team onboarding
- `README.md` — This file
#### Infrastructure (`.Infrastructure/`)
Deployment and operational documentation:
- systemd unit files for backend, frontend, Ollama services
- Network configuration and ingress rules
- Monitoring and alerting setup
---
## Runtime Truth
### What "Works" Means in Velocity
Velocity has three runtime layers, each with different failure modes:
#### Layer A: Fast Runtime Recovery
If the API crashes or restarts:
- PostgreSQL connection pool rebuilds automatically via [`lifespan()`](Project_Velocity/backend/main.py:83)
- WebSocket managers reinitialize and accept new connections
- No data loss — all state is in PostgreSQL
#### Layer B: Model Rehydration Recovery
If Ollama loses the Qwen model:
- Watchdog systemd unit detects absence via `/api/tags`
- Auto-registers model from NVMe cache or S3 artifact storage
- **Production requirement:** Same-run auto-hydration logic must complete before any agent request
#### Layer C: Full System Recovery
If everything goes down:
1. PostgreSQL recovers WAL logs
2. Ollama watchdog restores model
3. Backend systemd unit restarts API
4. Frontend rebuilds if artifacts are corrupted
### Critical Contracts
**Auth contract:**
```
Client → POST /api/auth/login {email, password}
→ 200 OK {token, user}
Client → GET /api/auth/me (Authorization: Bearer <token>)
→ 200 OK {id, email, role, avatar_url}
→ 401 Unauthorized
```
**WebSocket contract:**
```
Client → WS /ws/catalyst
→ Accepts live events: {event_type, campaign_name, value, timestamp}
Client → WS /ws/crm
→ Accepts CRM events: {type, payload, timestamp}
```
**Model contract:**
```
Ollama → GET /api/tags returns qwen3.6:35b-a3b
→ Context window: 131072 tokens
→ Provider: OpenAI-compatible interface at http://localhost:11434/v1
```
---
## Team Setup
### Developer Onboarding
#### 1. Clone & Bootstrap
```bash
git clone <repo-url>
cd Project_Velocity
bash bootstrap/setup.sh
```
#### 2. VS Code / Roo Code Configuration
Edit `.vscode/settings.json`:
```json
{
"roo-cline.provider": "openai-compatible",
"roo-cline.baseUrl": "http://localhost:11434/v1",
"roo-cline.modelId": "qwen3.6:35b-a3b",
"roo-cline.contextWindow": 131072,
"roo-cline.temperature": 0.7
}
```
#### 3. Verify Team Access
```bash
# Backend health
curl http://localhost:8000/health
# Expected: {"status": "ok"}
# Model loaded
curl http://localhost:11434/api/tags | jq -r '.models[].name'
# Expected: qwen3.6:35b-a3b
# Frontend
open http://localhost:5173
# Expected: Login screen
```
### Role Definitions
| Role | Access Level | Can Do |
|------|-------------|--------|
| `admin` | Full | User management, system config, agent orchestration |
| `developer` | Standard | Code generation, review, testing |
| `viewer` | Read-only | Dashboard, campaign monitoring |
### Performance Expectations
| Scenario | Tokens/sec | Latency |
|----------|-----------|---------|
| Single-stream (local GPU) | ~80-120 tok/s | ~200ms first token |
| Two concurrent requests | ~60-90 tok/s each | ~300ms first token |
| Four-way batch | ~40-60 tok/s each | ~500ms first token |
*Numbers vary by GPU hardware. Measure your setup.*
---
## GPU & Model Runtime
### Hardware Requirements
| Component | Minimum | Recommended |
|-----------|---------|-------------|
| GPU VRAM | 16GB | 24GB+ |
| GPU Compute | Turing architecture | Ada Lovelace / Hopper |
| NVMe Storage | 50GB free | 100GB+ NVMe Gen4 |
| RAM | 32GB | 64GB+ |
### Ollama Watchdog
The watchdog is a systemd-managed service that ensures the Qwen model stays loaded:
**Location:** `.Infrastructure/systemd/ollama-watchdog.service`
**Behavior:**
1. Every 60 seconds, queries `http://localhost:11434/api/tags`
2. If `qwen3.6:35b-a3b` is absent, triggers rehydration
3. Rehydration priority: NVMe cache → S3 artifact → remote pull
4. Logs all actions to journalctl
**Manual watchdog check:**
```bash
sudo systemctl status ollama-watchdog
journalctl -u ollama-watchdog --since "1 hour ago"
```
### Model Hydration Strategies
| Strategy | Speed | Use Case |
|----------|-------|----------|
| NVMe local registration | ~2 seconds | Primary recovery path |
| Local manifest `ollama create` | ~5 seconds | Fresh hydration from extracted weights |
| S3 cold hydrate | ~60-300 seconds | No local cache available |
### Critical: What Watchdog Must NOT Do
- ❌ Delete model layers during recovery
- ❌ Modify GPU memory directly
- ❌ Block agent requests during hydration (graceful degradation only)
- ❌ Restart Ollama process unless absolutely necessary
---
## Infrastructure
### Deployment Topology
```
┌─────────────────────────────────────────────────┐
│ Production Host │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Backend │ │ Frontend │ │ Ollama │ │
│ │ :8000 │ │ :5173 │ │ :11434 │ │
│ │ systemd │ │ nginx │ │ systemd │ │
│ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────┴───────────────┘ │
│ │ │
│ ┌──────▼───────┐ │
│ │ PostgreSQL │ │
│ │ :5432 │ │
│ │ systemd │ │
│ └──────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ NVIDIA GPU (CUDA + TensorRT) │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
```
### systemd Services
| Service | File | Restart Policy |
|---------|------|---------------|
| Backend API | `velocity-backend.service` | always |
| Frontend (nginx) | `velocity-frontend.service` | always |
| Ollama | `ollama.service` | on-failure |
| Watchdog | `ollama-watchdog.service` | always |
| PostgreSQL | `postgresql.service` | on-failure |
### Network Rules
| Port | Protocol | Service | External Access |
|------|----------|---------|-----------------|
| 80 | HTTP | nginx → frontend | Yes (public) |
| 443 | HTTPS | nginx → frontend | Yes (public) |
| 8000 | TCP | FastAPI backend | No (internal only) |
| 5173 | TCP | Vite dev server | No (dev only) |
| 5432 | TCP | PostgreSQL | No (internal only) |
| 11434 | TCP | Ollama API | No (internal only) |
### Monitoring
```bash
# All service health
systemctl status velocity-backend ollama postgresql
# GPU utilization
nvidia-smi -l 1
# Model inference logs
journalctl -u ollama -f
# API error rate
curl -s http://localhost:8000/health | jq .
```
---
## Runbooks
### Runbook: Backend Crashes at 2 AM
**Symptom:** Frontend shows 500 errors on API calls.
**Steps:**
```bash
# 1. Check backend status
sudo systemctl status velocity-backend
# Expected: active (running)
# 2. If stopped, restart
sudo systemctl restart velocity-backend
# 3. Check logs for root cause
sudo journalctl -u velocity-backend --since "30 minutes ago" --no-pager
# 4. Verify recovery
curl http://localhost:8000/health
# Expected: {"status": "ok"}
# 5. If crash repeats, check database connectivity
psql -U velocity -d velocity -c "SELECT 1;"
# Expected: 1
```
**If still broken:**
1. Check disk space: `df -h /`
2. Check memory: `free -h`
3. Check PostgreSQL: `sudo systemctl status postgresql`
4. Escalate with logs from step 3
---
### Runbook: Ollama Model Disappeared
**Symptom:** Agents return empty responses or errors.
**Steps:**
```bash
# 1. Check if Ollama is running
sudo systemctl status ollama
# Expected: active (running)
# 2. Check loaded models
curl http://localhost:11434/api/tags | jq '.models[].name'
# Expected: qwen3.6:35b-a3b
# 3. If model is missing, check watchdog
sudo systemctl status ollama-watchdog
journalctl -u ollama-watchdog --since "1 hour ago" --no-pager
# 4. Manual recovery if watchdog failed
ollama pull qwen3.6:35b-a3b
# 5. Verify model is usable
curl http://localhost:11434/api/generate -d '{
"model": "qwen3.6:35b-a3b",
"prompt": "Hello",
"stream": false
}' | jq .done
# Expected: true
```
---
### Runbook: Database Connection Failures
**Symptom:** Backend logs show `connection refused` or `pool exhausted`.
**Steps:**
```bash
# 1. Check PostgreSQL status
sudo systemctl status postgresql
# Expected: active (running)
# 2. Check connection count
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
# Should be < max_connections (default 100)
# 3. Check disk space for WAL files
df -h /var/lib/postgresql
# 4. Restart if hung
sudo systemctl restart postgresql
# 5. Verify backend reconnects
sudo journalctl -u velocity-backend --since "1 minute ago" | grep -i "connected\|error"
```
---
### Runbook: GPU Memory Exhaustion
**Symptom:** Ollama returns `out of memory` errors.
**Steps:**
```bash
# 1. Check current GPU usage
nvidia-smi
# Note: PID, memory usage, temperature
# 2. Kill non-essential GPU processes if needed
nvidia-smi --id=0 --query-compute-apps=pid,name,used_memory --format=csv
kill <PID>
# 3. Check Ollama memory allocation
ollama show qwen3.6:35b-a3b | grep -i "layer\|memory"
# 4. If still exhausted, reduce model quantization
ollama pull qwen3.6:35b-a3b-q4_0
# 5. Monitor recovery
watch -n 1 nvidia-smi
```
---
## API Reference
### Auth Endpoints
#### `POST /api/auth/login`
Authenticate a user and receive a JWT token.
**Request:**
```json
{
"email": "user@example.com",
"password": "secure_password"
}
```
**Response (200 OK):**
```json
{
"token": "eyJhbGciOiJIUzI1NiIs...",
"user": {
"id": "uuid-here",
"email": "user@example.com",
"role": "developer",
"avatar_url": null
}
}
```
**Errors:**
| Status | Meaning |
|--------|---------|
| 401 | Invalid credentials |
| 422 | Malformed request body |
---
#### `GET /api/auth/me`
Get the current authenticated user's profile.
**Headers:**
```
Authorization: Bearer <token>
```
**Response (200 OK):**
```json
{
"id": "uuid-here",
"email": "user@example.com",
"role": "developer",
"avatar_url": "https://cdn.example.com/avatars/user.png"
}
```
**Errors:**
| Status | Meaning |
|--------|---------|
| 401 | Token missing or invalid |
| 403 | Token expired |
---
#### `GET /api/auth/users`
List all users in the system. Admin only.
**Headers:**
```
Authorization: Bearer <admin_token>
```
**Response (200 OK):**
```json
[
{
"id": "uuid-1",
"email": "admin@example.com",
"role": "admin",
"avatar_url": null
},
{
"id": "uuid-2",
"email": "dev@example.com",
"role": "developer",
"avatar_url": "https://cdn.example.com/avatars/dev.png"
}
]
```
**Errors:**
| Status | Meaning |
|--------|---------|
| 403 | User is not admin |
---
#### `POST /api/auth/profile/avatar`
Upload a profile avatar image.
**Headers:**
```
Authorization: Bearer <token>
Content-Type: multipart/form-data
```
**Form Data:**
| Field | Type | Required |
|-------|------|----------|
| avatar | file (image/jpeg, image/png) | Yes |
**Response (200 OK):**
```json
{
"avatar_url": "https://cdn.example.com/avatars/new-avatar.png"
}
```
**Errors:**
| Status | Meaning |
|--------|---------|
| 401 | Not authenticated |
| 422 | Invalid file type or size > 5MB |
---
### WebSocket Endpoints
#### `WS /ws/catalyst`
Real-time channel for Catalyst events (agent coordination, task updates).
**Connection:**
```javascript
const ws = new WebSocket('ws://localhost:8000/ws/catalyst');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log(data.event_type, data.campaign_name, data.value);
};
```
**Event Format:**
```json
{
"event_type": "task_complete",
"campaign_name": "codegen-sprint-42",
"value": 0.97,
"timestamp": "2026-04-21T16:00:00Z"
}
```
---
#### `WS /ws/crm`
Real-time channel for CRM events (customer interactions, lead updates).
**Connection:**
```javascript
const ws = new WebSocket('ws://localhost:8000/ws/crm');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log(data.type, data.payload);
};
```
**Event Format:**
```json
{
"type": "lead_created",
"payload": {
"id": "crm-uuid",
"name": "Acme Corp",
"status": "new"
},
"timestamp": "2026-04-21T16:00:00Z"
}
```
---
### Health Check
#### `GET /health`
Verify system health.
**Response (200 OK):**
```json
{
"status": "ok",
"database": "connected",
"ollama": "available",
"gpu": "present"
}
```
---
## Contributing
### Code Structure
```
Project_Velocity/
├── .Agent Context/ # Agent documentation, model specs
├── .Infrastructure/ # Deployment configs, systemd units
├── backend/ # FastAPI backend
│ ├── main.py # Application entry point
│ ├── requirements.txt # Python dependencies
│ └── migrate.py # Database migrations
├── app/ # React frontend
│ ├── src/
│ │ ├── App.tsx # Root component
│ │ └── ... # Components, routes, utils
│ ├── package.json # Node dependencies
│ └── vite.config.ts # Build config
├── bootstrap/ # Setup scripts
│ └── setup.sh # One-line bootstrap
└── README.md # This file
```
### Making a Contribution
1. **Fork and branch**
```bash
git checkout -b feature/your-feature-name
```
2. **Make changes**
- Backend: Follow FastAPI conventions, add type hints
- Frontend: Follow React + TypeScript patterns, use existing components
- Docs: Update this README if behavior changes
3. **Test locally**
```bash
# Backend tests
cd backend && pytest
# Frontend checks
cd app && npm run build
```
4. **Submit PR**
- Title: Clear, action-oriented
- Description: What + Why + How to test
- Link any related issues
### Documentation Standards
- **Every endpoint:** Document inputs, outputs, errors
- **Every component:** JSDoc for public APIs
- **Every runbook:** Write as if for on-call at 2am
- **Every decision:** Record in `DECISIONS.md` with rationale
---
## Appendix
### A. Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| `DATABASE_URL` | Yes | PostgreSQL connection string |
| `SECRET_KEY` | Yes | JWT signing key |
| `OLLAMA_BASE_URL` | No | Ollama API URL (default: `http://localhost:11434`) |
| `GPU_ENABLED` | No | Enable GPU path (default: `true`) |
| `LOG_LEVEL` | No | Logging level (default: `INFO`) |
### B. Troubleshooting Matrix
| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| Frontend blank screen | Backend down | `curl http://localhost:8000/health` |
| 401 on all calls | Token expired | Re-login |
| Agent returns empty | Model unloaded | `ollama pull qwen3.6:35b-a3b` |
| Slow responses | GPU not used | Check `nvidia-smi`, verify CUDA |
| Database errors | Pool exhausted | Check `max_connections`, restart backend |
| WebSocket disconnects | Network issue | Check firewall, reverse proxy config |
### C. Useful Commands Cheat Sheet
```bash
# Full system status
systemctl status velocity-backend ollama postgresql ollama-watchdog
# GPU实时监控
watch -n 1 nvidia-smi
# Model check
curl http://localhost:11434/api/tags | jq '.models[].name'
# API health
curl -s http://localhost:8000/health | jq .
# Database connection test
psql -U velocity -d velocity -c "SELECT version();"
# Frontend rebuild
cd app && npm run build && cp -r dist/* ../nginx/html/
# Restart everything (nuclear option)
sudo systemctl restart velocity-backend ollama postgresql
```
---
> **Last verified:** 2026-04-21
> **Maintained by:** Velocity Team
> **If this doc is wrong, the system is broken. Fix the doc first.**