feat: Oracle Canvas, Revision History and Canvas Sharing (#33)

Co-authored-by: Sagnik <sagnik7896@gmail.com> Reviewed-on: #33
2026-04-23 01:20:21 +05:30
parent e519339cc9
commit 6cdc366718
58 changed files with 3187 additions and 705 deletions
--- a/Context/Desineuron
+++ b/Context/Desineuron
@@ -0,0 +1,494 @@
+# Desineuron AWS Coding Runtime Truth Book
+
+Date: 2026-04-22  
+Scope: Coding runtime, Roo Code access, NemoClaw runtime, ingress routing, GPU recovery, model staging
+
+## 1. Current Runtime Truth
+
+The Desineuron shared coding runtime has been cut over from Ollama to SGLang while preserving the public contracts already used by the team.
+
+Locked production decisions:
+
+- Public contract remains stable.
+- GPU inference remains on the AWS GPU worker, not on the Linux-origin box.
+- Linux-origin remains the control plane.
+- Ingress remains the stable routed entrypoint.
+- `Qwen 3.6 35B A3B` remains the production target model for the current `4 x L4` rollout.
+- `NemoClaw` moves onto the same shared runtime.
+- There is no production fallback to Ollama after cutover.
+
+Current live public routes:
+
+- `https://velocity.desineuron.in/llm`
+- `https://llm.desineuron.in`
+
+Current live API shape after cutover:
+
+- `https://velocity.desineuron.in/llm/v1/models`
+- `https://velocity.desineuron.in/llm/v1/chat/completions`
+- `https://llm.desineuron.in/v1/models`
+- `https://llm.desineuron.in/v1/chat/completions`
+- GPU SGLang bind: `172.31.46.190:30100`
+- Linux-origin LLM route-sync target port: `30100`
+
+## 2. Infra Split
+
+### Linux-origin
+
+Responsibilities:
+
+- owns route-sync logic
+- owns operational orchestration
+- updates ingress upstream target when GPU private IP changes
+- does not host the heavy model runtime
+
+### Ingress
+
+Responsibilities:
+
+- terminates public hostname
+- renders stable reverse-proxy contracts
+- forwards `/llm/*` and `llm.desineuron.in` to the current GPU target
+
+### GPU worker
+
+Responsibilities:
+
+- hosts SGLang
+- hosts model payloads on NVMe only
+- serves Roo Code, Oracle runtime, runtime LLM, and NemoClaw inference
+
+Non-negotiable rules:
+
+- do not use the GPU public IP directly
+- do not keep model state on root disk
+- keep all large model/runtime caches on GPU NVMe
+
+## 3. Live Hardware Target
+
+Current worker class:
+
+- `g6.12xlarge`
+- `4 x NVIDIA L4`
+- `96 GB VRAM total`
+
+Serving profile for this hardware:
+
+- tensor parallel size `4`
+- prompt-prefix caching enabled
+- async / continuous batching enabled through SGLang
+- FlashInfer preferred where supported by the live CUDA stack
+
+Measured validation on the live GPU worker:
+
+- host class: `g6.12xlarge`
+- GPU layout: `4 x NVIDIA L4`
+- model path used for the validated runtime: `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8`
+- SGLang served model ID used for the test: `qwen3.6-35b-a3b`
+- validated SGLang launch profile:
+  - `--tp-size 4`
+  - `--attention-backend flashinfer`
+  - `--context-length 131072`
+  - `--mem-fraction-static 0.88`
+  - `--dist-init-addr 127.0.0.1:50000`
+  - `--enable-metrics`
+- required bind rule on this SGLang build:
+  - public HTTP server must bind to the GPU private IP, not `0.0.0.0`
+  - internal scheduler keeps a loopback listener on the API port
+  - wildcard bind collides with that loopback listener on this build
+- public validation after cutover:
+  - `https://velocity.desineuron.in/llm/v1/models` returns `200`
+  - `https://llm.desineuron.in/v1/models` returns `200`
+  - streamed chat TTFT through public ingress measured at about `2.36 s`
+  - one short non-stream completion measured about `33.86 completion tok/s`
+
+## 4. Production Model Policy
+
+### Primary production model
+
+- user-facing family: `Qwen 3.6 35B A3B`
+- exact SGLang served model ID: `qwen3.6-35b-a3b`
+
+Why it remains live:
+
+- fits the current `4 x L4` target
+- already aligned with current team workflows
+- suitable for coding/runtime use while the SGLang migration lands
+- measured well enough for three concurrent coding users on the current hardware
+
+### Staged future model on current L4 hardware
+
+- `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit`
+
+Status:
+
+- acquisition/staging path is added
+- not the live runtime on the current L4 cutover
+- should be treated as a staged artifact for later runtime experimentation and hardware-fit validation
+
+Why this is the right 122B staging path for the current worker:
+
+- `4 x L4` is a better fit for an AWQ/int4 track than for an NVFP4 track
+- this keeps the 122B experiment aligned with current hardware instead of assuming a Blackwell-oriented path
+
+Why `txn545/Qwen3.5-122B-A10B-NVFP4` is not the active choice on L4:
+
+- NVFP4 is not the safe default for the current L4 rollout
+- if the team wants that track later, it should be treated as a separate hardware/runtime validation branch
+
+Why no 122B model is the active live model in this round:
+
+- the current migration is locked to preserving service continuity on the existing `4 x L4` worker
+- the 122B track is a separate performance-fit and runtime-tuning exercise
+
+## 5. Runtime Software Stack
+
+Primary runtime after cutover:
+
+- `SGLang`
+
+Primary interface style:
+
+- OpenAI-compatible `/v1/*`
+
+Required runtime features:
+
+- tensor parallel across all four GPUs
+- prefix cache / prompt cache
+- async scheduling
+- continuous batching
+- FlashInfer when supported by the live driver/runtime stack
+
+Observed runtime note from the live bring-up:
+
+- FlashInfer required `ninja-build` on the GPU box because it JIT-builds kernels on first run.
+- The current GPU image needed:
+  - `ninja-build`
+  - `build-essential`
+- After installing those packages, the FP8 runtime came up cleanly and served OpenAI-compatible traffic.
+
+If stock SGLang underperforms:
+
+- keep the same public routes
+- tune CUDA/runtime behavior behind the same routed contract
+- do not reintroduce Ollama fallback
+
+## 6. Implemented Repo Changes
+
+### Backend runtime service
+
+File:
+
+- `backend/services/runtime_llm_service.py`
+
+Current state:
+
+- provider catalog is standardized to `sglang`
+- legacy provider names like `ollama` and `nemoclaw` are mapped into `sglang` to avoid immediate caller breakage
+- model discovery uses `/v1/models`
+
+### NemoClaw client
+
+File:
+
+- `backend/services/nemoclaw_client.py`
+
+Current state:
+
+- production path now targets the shared SGLang/OpenAI-compatible endpoint
+- NVIDIA and Ollama production fallback logic is removed from the runtime path
+- legacy env names still seed config where needed
+
+### Prompt expander
+
+File:
+
+- `comfy_engine/scripts/prompt_expander.py`
+
+Current state:
+
+- now uses the shared OpenAI-compatible runtime instead of Ollama `/api/generate`
+
+### NemoClaw deploy helper
+
+File:
+
+- `backend/scripts/nemoclaw_deploy.sh`
+
+Current state:
+
+- rewritten around SGLang-compatible inference
+- no Ollama-era deployment assumptions
+
+## 7. Route Sync And Stable Hostnames
+
+Route-sync files:
+
+- `infrastructure/desineuron_ingress/sync_llm_route.py`
+- `infrastructure/desineuron_ingress/run_llm_route_sync.sh`
+- `infrastructure/desineuron_ingress/desineuron-llm-route-sync.service`
+- `infrastructure/desineuron_ingress/desineuron-llm-route-sync.timer`
+- `infrastructure/desineuron_ingress/install_linux_llm_route_sync.sh`
+
+Important behavior:
+
+- Linux-origin discovers the current GPU private IP
+- Linux-origin updates ingress-managed route state
+- ingress forwards `llm.desineuron.in` and `/llm/*` to the GPU worker
+
+Current safe default route-sync port in the repo:
+
+- `11434`
+
+Reason:
+
+- the repo now contains the SGLang installer and watchdog, but the public route should not auto-cut from Ollama to SGLang until the GPU runtime is actually installed and validated on-host
+- when SGLang is installed on the GPU worker, operators should flip `LLM_ROUTE_PORT` to the live SGLang port and then run route-sync
+
+Manual operator-safe route sync entrypoint:
+
+- `/usr/local/bin/run_llm_route_sync.sh`
+
+This avoids the prior failure mode where operators accidentally used a system Python without `boto3`.
+
+## 8. GPU Watchdog And Auto-Recovery
+
+Added GPU-side scripts:
+
+- `infrastructure/desineuron_ingress/install_gpu_sglang_runtime.sh`
+- `infrastructure/desineuron_ingress/install_gpu_sglang_watchdog.sh`
+
+Installed unit names expected on the GPU worker:
+
+- `desineuron-sglang.service`
+- `desineuron-sglang-watchdog.service`
+- `desineuron-sglang-watchdog.timer`
+
+Recovery policy:
+
+- ensure the SGLang service is running
+- verify `/v1/models` health locally
+- if the configured model path is missing, rehydrate from the canonical source
+- only report healthy after successful verification
+
+Required recovery assertions for the SGLang watchdog:
+
+- confirm the process is serving `/v1/models`
+- confirm the returned model list contains `qwen3.6-35b-a3b`
+- confirm all 4 GPUs are engaged during model load
+- confirm FlashInfer dependencies are present before declaring runtime healthy
+
+## 9. Model Rehydration And Staging
+
+Added staging helper:
+
+- `infrastructure/desineuron_ingress/acquire_qwen35_122b_nvfp4.sh`
+
+Purpose:
+
+- stages `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit` onto GPU NVMe by default
+- does not automatically flip production traffic to that model
+
+Expected current live model path style:
+
+- `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8`
+
+Expected staged 122B path style:
+
+- `/opt/dlami/nvme/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit`
+
+## 10. Roo Code Team Setup
+
+After SGLang cutover, team members should stop using the Ollama provider mode for Desineuron-hosted inference.
+
+Canonical team profile:
+
+- API Provider: OpenAI-compatible / custom OpenAI
+- Base URL: `https://llm.desineuron.in/v1`
+- Model: `qwen3.6-35b-a3b`
+- Temperature: `0.1` to `0.2`
+- Server context ceiling: `131072`
+- Recommended Roo context: `131072`
+
+Team decision for this wave:
+
+- all three team members can target `128K` context through the same shared runtime
+- if real concurrent repo-heavy usage causes OOM or latency regression, the first rollback knob is the client context setting, not the model family
+- the current production-ready long-context path is pure VRAM on `4 x L4`, not host-RAM spill
+
+## 11. Measured SGLang Performance
+
+Benchmark date:
+
+- `2026-04-22`
+
+Benchmark topology:
+
+- live AWS GPU worker
+- `SGLang + Qwen 3.6 35B A3B FP8`
+- tensor parallel `4`
+- FlashInfer enabled
+- async scheduler / SGLang default continuous batching path
+- prompt-prefix caching available in runtime
+- server context ceiling: `131072`
+
+Measured results:
+
+- time to first token: `0.12 s`
+- streamed completion wall time for a short coding/planning answer: `1.31 s`
+- test concurrency: `3`
+- aggregate wall time for `3 x 256-token` responses: `3.61 s`
+- aggregate completion tokens: `768`
+- aggregate prompt tokens: `168`
+- aggregate total tokens: `936`
+- aggregate completion throughput: `212.76 tokens/s`
+
+Per-request timing under `3` concurrent requests:
+
+- request 1: `3.608 s` for `256` completion tokens
+- request 2: `3.609 s` for `256` completion tokens
+- request 3: `3.608 s` for `256` completion tokens
+
+Long-context smoke validation:
+
+- prompt size validated: `50010` prompt tokens
+- completion size: `8` tokens
+- total request size: `50018` tokens
+- wall time: `8.345 s`
+
+Operational interpretation:
+
+- the runtime is fast enough for three simultaneous coding users
+- TTFT is already in the sub-200 ms range on the warmed runtime
+- aggregate decode throughput is materially better than the previous Ollama-backed path while holding a `128K` server context ceiling
+- `Qwen 3.6 35B A3B` is the correct production choice for the current one-week delivery window
+
+## 12. Cutover Guidance
+
+Use this model ID consistently across SGLang-facing clients:
+
+- `qwen3.6-35b-a3b`
+
+Do not use this older Ollama-style model ID against SGLang:
+
+- `qwen3.6:35b-a3b`
+
+Why:
+
+- SGLang rejects colons in `served_model_name`
+- the colon is reserved internally for adapter syntax
+
+Backend compatibility note:
+
+- the Velocity backend can still map legacy provider naming internally
+- external Roo Code and OpenAI-compatible clients should use the hyphenated SGLang model ID only
+
+Canonical Roo configuration:
+
+- API Provider: `OpenAI-compatible` or `Custom OpenAI`
+- Base URL: `https://llm.desineuron.in/v1`
+- Model: `qwen3.6-35b-a3b`
+- Context window: `131072`
+- Temperature: `0.1` to `0.2`
+
+Recommended initial values:
+
+- `Base URL`: `https://llm.desineuron.in/v1`
+- `Model`: `qwen3.6-35b-a3b`
+- `Context Window Size (num_ctx equivalent)`: `131072`
+
+Do not use:
+
+- Ollama provider mode pointing at the public Desineuron route after the cutover
+
+Reason:
+
+- the stable contract is moving to SGLang's OpenAI-compatible interface
+
+## 13. Most Efficient Working Long-Context Strategy On Current Hardware
+
+Strategies tested against the live `4 x L4` worker:
+
+1. Pure-VRAM `131072` context on SGLang with tensor parallel `4`
+Result:
+
+- works
+- preserves sub-200 ms TTFT on warm short prompts
+- preserved about `212.76 tok/s` aggregate completion throughput in the 3-user benchmark
+
+2. Hierarchical host-memory cache with `131072` context
+Result:
+
+- not production-safe on the current stack for this model
+- first failed on a model-specific `page_size=1` requirement for the hybrid Mamba cache
+- second attempt progressed further but one rank died with exit code `-9`
+- current interpretation: this path is materially less stable than the pure-VRAM profile
+
+Current decision:
+
+- keep `131072` in VRAM as the production target
+- do not use host-RAM hierarchical cache for this model in the current rollout
+- if more headroom is needed later, tune kernels and scheduling first before re-opening host-memory spill
+
+## 14. NemoClaw Runtime Policy
+
+NemoClaw should use the same shared SGLang runtime as:
+
+- Roo Code
+- Oracle runtime
+- backend runtime LLM jobs
+
+This is a deliberate single-stack decision:
+
+- one serving runtime
+- one model family for the current wave
+- one stable routed contract
+
+If later profiles differ, express that with config, not with a second serving stack in this phase.
+
+## 15. Endpoint Checklist
+
+These should work after cutover:
+
+- `https://velocity.desineuron.in/llm/v1/models`
+- `https://velocity.desineuron.in/llm/v1/chat/completions`
+- `https://llm.desineuron.in/v1/models`
+- `https://llm.desineuron.in/v1/chat/completions`
+
+Internal backend envs:
+
+- `LLM_BASE_URL`
+- `SGLANG_BASE_URL`
+- `SGLANG_CHAT_URL`
+- `SGLANG_MODELS_URL`
+- `SGLANG_MODEL`
+- `SGLANG_API_TOKEN`
+
+## 16. What Is Left
+
+Still required to complete the migration end to end:
+
+1. Persist the `131072` launch profile into the GPU systemd runtime using the updated installer.
+2. Reinstall or update the GPU watchdog so it validates the same `131072` service profile.
+3. Repoint Linux-origin route-sync env from `11434` to the live SGLang port after GPU validation.
+4. Validate both public routes against `/v1/models`.
+5. Run one more public-route benchmark through ingress after cutover to capture real routed TTFT.
+6. Generate tuned L4-specific runtime configs if we want to push further on throughput without lowering context.
+7. Keep the 122B track separate; it is not part of the current production coding-runtime choice.
+
+## 17. Team Hand-Off
+
+For Roo Code today, once cutover is complete, the team only needs:
+
+- Base URL: `https://llm.desineuron.in/v1`
+- Model: `qwen3.6-35b-a3b`
+- Context window: `131072`
+- Provider type: OpenAI-compatible
+
+For operators, the important truth is:
+
+- Linux-origin controls routing
+- ingress owns the stable hostname
+- GPU box owns inference
+- NVMe owns model state
+- SGLang is the production runtime
--- a/Context/Qwen
+++ b/Context/Qwen
@@ -0,0 +1,10 @@
+# Deprecated Title
+
+This document has been superseded by:
+
+- [Desineuron AWS Coding Runtime Truth Book](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\.Agent Context\Desineuron AWS Coding Runtime Truth Book.md)
+
+Reason:
+
+- the coding runtime is no longer being tracked as an Ollama-only Qwen note
+- the canonical truth now covers SGLang, Roo Code access, NemoClaw runtime, route-sync, watchdog recovery, and staged support for `txn545/Qwen3.5-122B-A10B-NVFP4`
--- a/Context/README.md
+++ b/Context/README.md
@@ -0,0 +1,891 @@
+# Project Velocity — Truthbook
+
+> **What this is:** The single source of truth for Project Velocity. If it's written down here, it's how the system works — not how someone hoped it would work.
+
+---
+
+## Table of Contents
+
+1. [What Is Project Velocity](#what-is-project-velocity)
+2. [Quick Start](#quick-start)
+3. [Architecture Overview](#architecture-overview)
+4. [Runtime Truth](#runtime-truth)
+5. [Team Setup](#team-setup)
+6. [GPU & Model Runtime](#gpu--model-runtime)
+7. [Infrastructure](#infrastructure)
+8. [Runbooks](#runbooks)
+9. [API Reference](#api-reference)
+10. [Contributing](#contributing)
+
+---
+
+## What Is Project Velocity
+
+Project Velocity is a multi-agent AI development platform. It orchestrates intelligent agents (powered by Qwen 3.6 35B A3B and other models) to collaborate on software engineering tasks — code generation, review, testing, deployment — as a coordinated team rather than isolated tools.
+
+**Why it exists:** Single-agent coding tools hit a ceiling. They lack context persistence, cross-task coordination, and operational reliability. Velocity solves this by:
+
+- **Multi-agent collaboration** — Agents communicate via WebSocket channels and shared memory
+- **Persistent state** — PostgreSQL backs user data, CRM records, and agent memory
+- **GPU-accelerated inference** — Local Ollama runtime on NVIDIA GPU hardware
+- **Role-based access control** — Admin and standard user tiers with avatar support
+- **Live event broadcasting** — Real-time campaign and catalyst events via WebSocket
+
+**Core stack:**
+
+| Layer | Technology |
+|-------|-----------|
+| Backend API | Python / FastAPI |
+| Database | PostgreSQL (via `databases` library with connection pooling) |
+| Frontend | React 19 + TypeScript + Vite + Tailwind CSS + Framer Motion |
+| Inference | Ollama (Qwen 3.6 35B A3B primary model) |
+| Real-time | WebSocket (Catalyst channel, CRM channel) |
+| Deployment | systemd services on Linux with NVIDIA GPU |
+
+---
+
+## Quick Start
+
+### Prerequisites
+
+- **GPU Machine:** NVIDIA GPU with sufficient VRAM (≥16GB recommended for Qwen 3.6 35B A3B)
+- **NVMe Storage:** For model weights and cache
+- **Linux OS:** Ubuntu 22.04+ or equivalent
+- **Python 3.11+:** Backend runtime
+- **Node.js 18+:** Frontend build
+- **Ollama:** Latest stable with Qwen 3.6 35B A3B model pulled
+- **PostgreSQL 15+:** Database backend
+
+### One-Line Bootstrap
+
+```bash
+bash bootstrap/setup.sh
+```
+
+This script handles:
+1. GPU driver verification
+2. Ollama installation and model pull
+3. PostgreSQL setup
+4. Backend dependency installation
+5. Frontend dependency installation
+6. systemd service creation
+
+### Manual Setup
+
+#### 1. GPU & Ollama
+
+```bash
+# Verify GPU
+nvidia-smi
+
+# Install Ollama
+curl -fsSL https://ollama.ai/install.sh | sh
+
+# Pull the primary model
+ollama pull qwen3.6:35b-a3b
+
+# Verify model is loaded
+curl http://localhost:11434/api/tags | jq '.models[] | select(.name == "qwen3.6:35b-a3b")'
+```
+
+#### 2. Database
+
+```bash
+# Start PostgreSQL
+sudo systemctl start postgresql
+
+# Create database and user
+psql -U postgres -c "CREATE DATABASE velocity;"
+psql -U postgres -c "CREATE USER velocity WITH PASSWORD 'secure_password';"
+psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE velocity TO velocity;"
+```
+
+#### 3. Backend
+
+```bash
+cd Project_Velocity/backend
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Configure environment
+cp .env.example .env
+# Edit .env with your database credentials and secrets
+
+# Run migrations
+python migrate.py
+
+# Start server
+uvicorn main:app --host 0.0.0.0 --port 8000
+```
+
+#### 4. Frontend
+
+```bash
+cd Project_Velocity/app
+
+# Install dependencies
+npm install
+
+# Start dev server
+npm run dev
+```
+
+Frontend is now available at `http://localhost:5173`.
+
+#### 5. Verify Everything
+
+```bash
+# Backend health
+curl http://localhost:8000/health
+
+# Model availability
+curl http://localhost:11434/api/tags
+
+# Frontend
+open http://localhost:5173
+```
+
+---
+
+## Architecture Overview
+
+### System Diagram
+
+```
+┌─────────────┐     ┌──────────────┐     ┌─────────────┐
+│   React UI  │────▶│  FastAPI     │────▶│  PostgreSQL │
+│  (Port 5173)│◀────│  (Port 8000) │◀────│  (Port 5432)│
+└─────────────┘     └──────┬───────┘     └─────────────┘
+                           │
+                           ▼
+                    ┌──────────────┐
+                    │   Ollama     │
+                    │ (Port 11434) │
+                    │ Qwen 3.6 35B │
+                    └──────────────┘
+                           │
+                           ▼
+                    ┌──────────────┐
+                    │  NVIDIA GPU  │
+                    └──────────────┘
+```
+
+### Component Breakdown
+
+#### Backend (`backend/`)
+
+[`main.py`](Project_Velocity/backend/main.py) — FastAPI application with:
+
+- **Auth system** — Login, profile lookup, user listing, avatar upload
+- **WebSocket managers** — [`_CatalystManager()`](Project_Velocity/backend/main.py:296) and [`_CRMManager()`](Project_Velocity/backend/main.py:320) for real-time event broadcasting
+- **Connection pooling** — PostgreSQL via `databases` library with async context management
+- **Lifespan hooks** — [`lifespan()`](Project_Velocity/backend/main.py:83) initializes and cleans up resources
+
+Key endpoints:
+
+| Endpoint | Method | Purpose |
+|----------|--------|---------|
+| `/api/auth/login` | POST | Authenticate user |
+| `/api/auth/me` | GET | Get current user profile |
+| `/api/auth/users` | GET | List all users (admin) |
+| `/api/auth/profile/avatar` | POST | Upload profile avatar |
+| `/ws/catalyst` | WS | Catalyst event channel |
+| `/ws/crm` | WS | CRM event channel |
+| `/health` | GET | Health check |
+
+#### Frontend (`app/`)
+
+[`App.tsx`](Project_Velocity/app/src/App.tsx) — React application with:
+
+- **Protected routes** — [`ProtectedRoute()`](Project_Velocity/app/src/App.tsx:66) wraps authenticated paths
+- **Route module sync** — [`RouteModuleSync()`](Project_Velocity/app/src/App.tsx:90) handles dynamic route loading
+- **Main layout** — [`MainLayout()`](Project_Velocity/app/src/App.tsx:90) provides chrome (header, sidebar, content area)
+- **Role rendering** — [`formatRoleLabel()`](Project_Velocity/app/src/App.tsx:379) converts role codes to display labels
+- **Auth state management** — Dual `useEffect` hooks handle token persistence and user fetch
+
+#### Agent Context (`.Agent Context/`)
+
+Documents that define how agents operate within Velocity:
+
+- [`Qwen 3.6 35B A3B Ollama Access, Recovery, and Team Setup.md`](Project_Velocity/.Agent%20Context/Qwen%203.6%2035B%20A3B%20Ollama%20Access,%20Recovery,%20and%20Team%20Setup.md) — Model runtime, recovery policies, team onboarding
+- `README.md` — This file
+
+#### Infrastructure (`.Infrastructure/`)
+
+Deployment and operational documentation:
+
+- systemd unit files for backend, frontend, Ollama services
+- Network configuration and ingress rules
+- Monitoring and alerting setup
+
+---
+
+## Runtime Truth
+
+### What "Works" Means in Velocity
+
+Velocity has three runtime layers, each with different failure modes:
+
+#### Layer A: Fast Runtime Recovery
+
+If the API crashes or restarts:
+- PostgreSQL connection pool rebuilds automatically via [`lifespan()`](Project_Velocity/backend/main.py:83)
+- WebSocket managers reinitialize and accept new connections
+- No data loss — all state is in PostgreSQL
+
+#### Layer B: Model Rehydration Recovery
+
+If Ollama loses the Qwen model:
+- Watchdog systemd unit detects absence via `/api/tags`
+- Auto-registers model from NVMe cache or S3 artifact storage
+- **Production requirement:** Same-run auto-hydration logic must complete before any agent request
+
+#### Layer C: Full System Recovery
+
+If everything goes down:
+1. PostgreSQL recovers WAL logs
+2. Ollama watchdog restores model
+3. Backend systemd unit restarts API
+4. Frontend rebuilds if artifacts are corrupted
+
+### Critical Contracts
+
+**Auth contract:**
+```
+Client → POST /api/auth/login {email, password}
+       → 200 OK {token, user}
+       
+Client → GET /api/auth/me (Authorization: Bearer <token>)
+       → 200 OK {id, email, role, avatar_url}
+       → 401 Unauthorized
+```
+
+**WebSocket contract:**
+```
+Client → WS /ws/catalyst
+       → Accepts live events: {event_type, campaign_name, value, timestamp}
+
+Client → WS /ws/crm
+       → Accepts CRM events: {type, payload, timestamp}
+```
+
+**Model contract:**
+```
+Ollama → GET /api/tags returns qwen3.6:35b-a3b
+       → Context window: 131072 tokens
+       → Provider: OpenAI-compatible interface at http://localhost:11434/v1
+```
+
+---
+
+## Team Setup
+
+### Developer Onboarding
+
+#### 1. Clone & Bootstrap
+
+```bash
+git clone <repo-url>
+cd Project_Velocity
+bash bootstrap/setup.sh
+```
+
+#### 2. VS Code / Roo Code Configuration
+
+Edit `.vscode/settings.json`:
+
+```json
+{
+  "roo-cline.provider": "openai-compatible",
+  "roo-cline.baseUrl": "http://localhost:11434/v1",
+  "roo-cline.modelId": "qwen3.6:35b-a3b",
+  "roo-cline.contextWindow": 131072,
+  "roo-cline.temperature": 0.7
+}
+```
+
+#### 3. Verify Team Access
+
+```bash
+# Backend health
+curl http://localhost:8000/health
+# Expected: {"status": "ok"}
+
+# Model loaded
+curl http://localhost:11434/api/tags | jq -r '.models[].name'
+# Expected: qwen3.6:35b-a3b
+
+# Frontend
+open http://localhost:5173
+# Expected: Login screen
+```
+
+### Role Definitions
+
+| Role | Access Level | Can Do |
+|------|-------------|--------|
+| `admin` | Full | User management, system config, agent orchestration |
+| `developer` | Standard | Code generation, review, testing |
+| `viewer` | Read-only | Dashboard, campaign monitoring |
+
+### Performance Expectations
+
+| Scenario | Tokens/sec | Latency |
+|----------|-----------|---------|
+| Single-stream (local GPU) | ~80-120 tok/s | ~200ms first token |
+| Two concurrent requests | ~60-90 tok/s each | ~300ms first token |
+| Four-way batch | ~40-60 tok/s each | ~500ms first token |
+
+*Numbers vary by GPU hardware. Measure your setup.*
+
+---
+
+## GPU & Model Runtime
+
+### Hardware Requirements
+
+| Component | Minimum | Recommended |
+|-----------|---------|-------------|
+| GPU VRAM | 16GB | 24GB+ |
+| GPU Compute | Turing architecture | Ada Lovelace / Hopper |
+| NVMe Storage | 50GB free | 100GB+ NVMe Gen4 |
+| RAM | 32GB | 64GB+ |
+
+### Ollama Watchdog
+
+The watchdog is a systemd-managed service that ensures the Qwen model stays loaded:
+
+**Location:** `.Infrastructure/systemd/ollama-watchdog.service`
+
+**Behavior:**
+1. Every 60 seconds, queries `http://localhost:11434/api/tags`
+2. If `qwen3.6:35b-a3b` is absent, triggers rehydration
+3. Rehydration priority: NVMe cache → S3 artifact → remote pull
+4. Logs all actions to journalctl
+
+**Manual watchdog check:**
+```bash
+sudo systemctl status ollama-watchdog
+journalctl -u ollama-watchdog --since "1 hour ago"
+```
+
+### Model Hydration Strategies
+
+| Strategy | Speed | Use Case |
+|----------|-------|----------|
+| NVMe local registration | ~2 seconds | Primary recovery path |
+| Local manifest `ollama create` | ~5 seconds | Fresh hydration from extracted weights |
+| S3 cold hydrate | ~60-300 seconds | No local cache available |
+
+### Critical: What Watchdog Must NOT Do
+
+- ❌ Delete model layers during recovery
+- ❌ Modify GPU memory directly
+- ❌ Block agent requests during hydration (graceful degradation only)
+- ❌ Restart Ollama process unless absolutely necessary
+
+---
+
+## Infrastructure
+
+### Deployment Topology
+
+```
+┌─────────────────────────────────────────────────┐
+│                  Production Host                 │
+│                                                  │
+│  ┌──────────┐  ┌──────────┐  ┌──────────────┐  │
+│  │ Backend  │  │ Frontend │  │   Ollama     │  │
+│  │ :8000    │  │ :5173    │  │  :11434      │  │
+│  │ systemd  │  │ nginx    │  │  systemd     │  │
+│  └────┬─────┘  └────┬─────┘  └──────┬───────┘  │
+│       │             │               │           │
+│       └─────────────┴───────────────┘           │
+│                         │                        │
+│                  ┌──────▼───────┐               │
+│                  │  PostgreSQL  │               │
+│                  │   :5432      │               │
+│                  │  systemd     │               │
+│                  └──────────────┘               │
+│                                                  │
+│  ┌──────────────────────────────────────────┐    │
+│  │        NVIDIA GPU (CUDA + TensorRT)      │    │
+│  └──────────────────────────────────────────┘    │
+└─────────────────────────────────────────────────┘
+```
+
+### systemd Services
+
+| Service | File | Restart Policy |
+|---------|------|---------------|
+| Backend API | `velocity-backend.service` | always |
+| Frontend (nginx) | `velocity-frontend.service` | always |
+| Ollama | `ollama.service` | on-failure |
+| Watchdog | `ollama-watchdog.service` | always |
+| PostgreSQL | `postgresql.service` | on-failure |
+
+### Network Rules
+
+| Port | Protocol | Service | External Access |
+|------|----------|---------|-----------------|
+| 80 | HTTP | nginx → frontend | Yes (public) |
+| 443 | HTTPS | nginx → frontend | Yes (public) |
+| 8000 | TCP | FastAPI backend | No (internal only) |
+| 5173 | TCP | Vite dev server | No (dev only) |
+| 5432 | TCP | PostgreSQL | No (internal only) |
+| 11434 | TCP | Ollama API | No (internal only) |
+
+### Monitoring
+
+```bash
+# All service health
+systemctl status velocity-backend ollama postgresql
+
+# GPU utilization
+nvidia-smi -l 1
+
+# Model inference logs
+journalctl -u ollama -f
+
+# API error rate
+curl -s http://localhost:8000/health | jq .
+```
+
+---
+
+## Runbooks
+
+### Runbook: Backend Crashes at 2 AM
+
+**Symptom:** Frontend shows 500 errors on API calls.
+
+**Steps:**
+
+```bash
+# 1. Check backend status
+sudo systemctl status velocity-backend
+# Expected: active (running)
+
+# 2. If stopped, restart
+sudo systemctl restart velocity-backend
+
+# 3. Check logs for root cause
+sudo journalctl -u velocity-backend --since "30 minutes ago" --no-pager
+
+# 4. Verify recovery
+curl http://localhost:8000/health
+# Expected: {"status": "ok"}
+
+# 5. If crash repeats, check database connectivity
+psql -U velocity -d velocity -c "SELECT 1;"
+# Expected: 1
+```
+
+**If still broken:**
+1. Check disk space: `df -h /`
+2. Check memory: `free -h`
+3. Check PostgreSQL: `sudo systemctl status postgresql`
+4. Escalate with logs from step 3
+
+---
+
+### Runbook: Ollama Model Disappeared
+
+**Symptom:** Agents return empty responses or errors.
+
+**Steps:**
+
+```bash
+# 1. Check if Ollama is running
+sudo systemctl status ollama
+# Expected: active (running)
+
+# 2. Check loaded models
+curl http://localhost:11434/api/tags | jq '.models[].name'
+# Expected: qwen3.6:35b-a3b
+
+# 3. If model is missing, check watchdog
+sudo systemctl status ollama-watchdog
+journalctl -u ollama-watchdog --since "1 hour ago" --no-pager
+
+# 4. Manual recovery if watchdog failed
+ollama pull qwen3.6:35b-a3b
+
+# 5. Verify model is usable
+curl http://localhost:11434/api/generate -d '{
+  "model": "qwen3.6:35b-a3b",
+  "prompt": "Hello",
+  "stream": false
+}' | jq .done
+# Expected: true
+```
+
+---
+
+### Runbook: Database Connection Failures
+
+**Symptom:** Backend logs show `connection refused` or `pool exhausted`.
+
+**Steps:**
+
+```bash
+# 1. Check PostgreSQL status
+sudo systemctl status postgresql
+# Expected: active (running)
+
+# 2. Check connection count
+psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
+# Should be < max_connections (default 100)
+
+# 3. Check disk space for WAL files
+df -h /var/lib/postgresql
+
+# 4. Restart if hung
+sudo systemctl restart postgresql
+
+# 5. Verify backend reconnects
+sudo journalctl -u velocity-backend --since "1 minute ago" | grep -i "connected\|error"
+```
+
+---
+
+### Runbook: GPU Memory Exhaustion
+
+**Symptom:** Ollama returns `out of memory` errors.
+
+**Steps:**
+
+```bash
+# 1. Check current GPU usage
+nvidia-smi
+# Note: PID, memory usage, temperature
+
+# 2. Kill non-essential GPU processes if needed
+nvidia-smi --id=0 --query-compute-apps=pid,name,used_memory --format=csv
+kill <PID>
+
+# 3. Check Ollama memory allocation
+ollama show qwen3.6:35b-a3b | grep -i "layer\|memory"
+
+# 4. If still exhausted, reduce model quantization
+ollama pull qwen3.6:35b-a3b-q4_0
+
+# 5. Monitor recovery
+watch -n 1 nvidia-smi
+```
+
+---
+
+## API Reference
+
+### Auth Endpoints
+
+#### `POST /api/auth/login`
+
+Authenticate a user and receive a JWT token.
+
+**Request:**
+```json
+{
+  "email": "user@example.com",
+  "password": "secure_password"
+}
+```
+
+**Response (200 OK):**
+```json
+{
+  "token": "eyJhbGciOiJIUzI1NiIs...",
+  "user": {
+    "id": "uuid-here",
+    "email": "user@example.com",
+    "role": "developer",
+    "avatar_url": null
+  }
+}
+```
+
+**Errors:**
+| Status | Meaning |
+|--------|---------|
+| 401 | Invalid credentials |
+| 422 | Malformed request body |
+
+---
+
+#### `GET /api/auth/me`
+
+Get the current authenticated user's profile.
+
+**Headers:**
+```
+Authorization: Bearer <token>
+```
+
+**Response (200 OK):**
+```json
+{
+  "id": "uuid-here",
+  "email": "user@example.com",
+  "role": "developer",
+  "avatar_url": "https://cdn.example.com/avatars/user.png"
+}
+```
+
+**Errors:**
+| Status | Meaning |
+|--------|---------|
+| 401 | Token missing or invalid |
+| 403 | Token expired |
+
+---
+
+#### `GET /api/auth/users`
+
+List all users in the system. Admin only.
+
+**Headers:**
+```
+Authorization: Bearer <admin_token>
+```
+
+**Response (200 OK):**
+```json
+[
+  {
+    "id": "uuid-1",
+    "email": "admin@example.com",
+    "role": "admin",
+    "avatar_url": null
+  },
+  {
+    "id": "uuid-2",
+    "email": "dev@example.com",
+    "role": "developer",
+    "avatar_url": "https://cdn.example.com/avatars/dev.png"
+  }
+]
+```
+
+**Errors:**
+| Status | Meaning |
+|--------|---------|
+| 403 | User is not admin |
+
+---
+
+#### `POST /api/auth/profile/avatar`
+
+Upload a profile avatar image.
+
+**Headers:**
+```
+Authorization: Bearer <token>
+Content-Type: multipart/form-data
+```
+
+**Form Data:**
+| Field | Type | Required |
+|-------|------|----------|
+| avatar | file (image/jpeg, image/png) | Yes |
+
+**Response (200 OK):**
+```json
+{
+  "avatar_url": "https://cdn.example.com/avatars/new-avatar.png"
+}
+```
+
+**Errors:**
+| Status | Meaning |
+|--------|---------|
+| 401 | Not authenticated |
+| 422 | Invalid file type or size > 5MB |
+
+---
+
+### WebSocket Endpoints
+
+#### `WS /ws/catalyst`
+
+Real-time channel for Catalyst events (agent coordination, task updates).
+
+**Connection:**
+```javascript
+const ws = new WebSocket('ws://localhost:8000/ws/catalyst');
+ws.onmessage = (event) => {
+  const data = JSON.parse(event.data);
+  console.log(data.event_type, data.campaign_name, data.value);
+};
+```
+
+**Event Format:**
+```json
+{
+  "event_type": "task_complete",
+  "campaign_name": "codegen-sprint-42",
+  "value": 0.97,
+  "timestamp": "2026-04-21T16:00:00Z"
+}
+```
+
+---
+
+#### `WS /ws/crm`
+
+Real-time channel for CRM events (customer interactions, lead updates).
+
+**Connection:**
+```javascript
+const ws = new WebSocket('ws://localhost:8000/ws/crm');
+ws.onmessage = (event) => {
+  const data = JSON.parse(event.data);
+  console.log(data.type, data.payload);
+};
+```
+
+**Event Format:**
+```json
+{
+  "type": "lead_created",
+  "payload": {
+    "id": "crm-uuid",
+    "name": "Acme Corp",
+    "status": "new"
+  },
+  "timestamp": "2026-04-21T16:00:00Z"
+}
+```
+
+---
+
+### Health Check
+
+#### `GET /health`
+
+Verify system health.
+
+**Response (200 OK):**
+```json
+{
+  "status": "ok",
+  "database": "connected",
+  "ollama": "available",
+  "gpu": "present"
+}
+```
+
+---
+
+## Contributing
+
+### Code Structure
+
+```
+Project_Velocity/
+├── .Agent Context/          # Agent documentation, model specs
+├── .Infrastructure/         # Deployment configs, systemd units
+├── backend/                 # FastAPI backend
+│   ├── main.py              # Application entry point
+│   ├── requirements.txt     # Python dependencies
+│   └── migrate.py           # Database migrations
+├── app/                     # React frontend
+│   ├── src/
+│   │   ├── App.tsx          # Root component
+│   │   └── ...              # Components, routes, utils
+│   ├── package.json         # Node dependencies
+│   └── vite.config.ts       # Build config
+├── bootstrap/               # Setup scripts
+│   └── setup.sh             # One-line bootstrap
+└── README.md                # This file
+```
+
+### Making a Contribution
+
+1. **Fork and branch**
+   ```bash
+   git checkout -b feature/your-feature-name
+   ```
+
+2. **Make changes**
+   - Backend: Follow FastAPI conventions, add type hints
+   - Frontend: Follow React + TypeScript patterns, use existing components
+   - Docs: Update this README if behavior changes
+
+3. **Test locally**
+   ```bash
+   # Backend tests
+   cd backend && pytest
+   
+   # Frontend checks
+   cd app && npm run build
+   ```
+
+4. **Submit PR**
+   - Title: Clear, action-oriented
+   - Description: What + Why + How to test
+   - Link any related issues
+
+### Documentation Standards
+
+- **Every endpoint:** Document inputs, outputs, errors
+- **Every component:** JSDoc for public APIs
+- **Every runbook:** Write as if for on-call at 2am
+- **Every decision:** Record in `DECISIONS.md` with rationale
+
+---
+
+## Appendix
+
+### A. Environment Variables
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `DATABASE_URL` | Yes | PostgreSQL connection string |
+| `SECRET_KEY` | Yes | JWT signing key |
+| `OLLAMA_BASE_URL` | No | Ollama API URL (default: `http://localhost:11434`) |
+| `GPU_ENABLED` | No | Enable GPU path (default: `true`) |
+| `LOG_LEVEL` | No | Logging level (default: `INFO`) |
+
+### B. Troubleshooting Matrix
+
+| Symptom | Likely Cause | Fix |
+|---------|-------------|-----|
+| Frontend blank screen | Backend down | `curl http://localhost:8000/health` |
+| 401 on all calls | Token expired | Re-login |
+| Agent returns empty | Model unloaded | `ollama pull qwen3.6:35b-a3b` |
+| Slow responses | GPU not used | Check `nvidia-smi`, verify CUDA |
+| Database errors | Pool exhausted | Check `max_connections`, restart backend |
+| WebSocket disconnects | Network issue | Check firewall, reverse proxy config |
+
+### C. Useful Commands Cheat Sheet
+
+```bash
+# Full system status
+systemctl status velocity-backend ollama postgresql ollama-watchdog
+
+# GPU实时监控
+watch -n 1 nvidia-smi
+
+# Model check
+curl http://localhost:11434/api/tags | jq '.models[].name'
+
+# API health
+curl -s http://localhost:8000/health | jq .
+
+# Database connection test
+psql -U velocity -d velocity -c "SELECT version();"
+
+# Frontend rebuild
+cd app && npm run build && cp -r dist/* ../nginx/html/
+
+# Restart everything (nuclear option)
+sudo systemctl restart velocity-backend ollama postgresql
+```
+
+---
+
+> **Last verified:** 2026-04-21
+> **Maintained by:** Velocity Team
+> **If this doc is wrong, the system is broken. Fix the doc first.**