feat: Oracle Canvas, Revision History and Canvas Sharing (#33)

Co-authored-by: Sagnik <sagnik7896@gmail.com> Reviewed-on: #33
2026-04-23 01:20:21 +05:30
parent e519339cc9
commit 6cdc366718
58 changed files with 3187 additions and 705 deletions
--- a/Context/Desineuron
+++ b/Context/Desineuron
@@ -0,0 +1,494 @@
+# Desineuron AWS Coding Runtime Truth Book
+
+Date: 2026-04-22  
+Scope: Coding runtime, Roo Code access, NemoClaw runtime, ingress routing, GPU recovery, model staging
+
+## 1. Current Runtime Truth
+
+The Desineuron shared coding runtime has been cut over from Ollama to SGLang while preserving the public contracts already used by the team.
+
+Locked production decisions:
+
+- Public contract remains stable.
+- GPU inference remains on the AWS GPU worker, not on the Linux-origin box.
+- Linux-origin remains the control plane.
+- Ingress remains the stable routed entrypoint.
+- `Qwen 3.6 35B A3B` remains the production target model for the current `4 x L4` rollout.
+- `NemoClaw` moves onto the same shared runtime.
+- There is no production fallback to Ollama after cutover.
+
+Current live public routes:
+
+- `https://velocity.desineuron.in/llm`
+- `https://llm.desineuron.in`
+
+Current live API shape after cutover:
+
+- `https://velocity.desineuron.in/llm/v1/models`
+- `https://velocity.desineuron.in/llm/v1/chat/completions`
+- `https://llm.desineuron.in/v1/models`
+- `https://llm.desineuron.in/v1/chat/completions`
+- GPU SGLang bind: `172.31.46.190:30100`
+- Linux-origin LLM route-sync target port: `30100`
+
+## 2. Infra Split
+
+### Linux-origin
+
+Responsibilities:
+
+- owns route-sync logic
+- owns operational orchestration
+- updates ingress upstream target when GPU private IP changes
+- does not host the heavy model runtime
+
+### Ingress
+
+Responsibilities:
+
+- terminates public hostname
+- renders stable reverse-proxy contracts
+- forwards `/llm/*` and `llm.desineuron.in` to the current GPU target
+
+### GPU worker
+
+Responsibilities:
+
+- hosts SGLang
+- hosts model payloads on NVMe only
+- serves Roo Code, Oracle runtime, runtime LLM, and NemoClaw inference
+
+Non-negotiable rules:
+
+- do not use the GPU public IP directly
+- do not keep model state on root disk
+- keep all large model/runtime caches on GPU NVMe
+
+## 3. Live Hardware Target
+
+Current worker class:
+
+- `g6.12xlarge`
+- `4 x NVIDIA L4`
+- `96 GB VRAM total`
+
+Serving profile for this hardware:
+
+- tensor parallel size `4`
+- prompt-prefix caching enabled
+- async / continuous batching enabled through SGLang
+- FlashInfer preferred where supported by the live CUDA stack
+
+Measured validation on the live GPU worker:
+
+- host class: `g6.12xlarge`
+- GPU layout: `4 x NVIDIA L4`
+- model path used for the validated runtime: `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8`
+- SGLang served model ID used for the test: `qwen3.6-35b-a3b`
+- validated SGLang launch profile:
+  - `--tp-size 4`
+  - `--attention-backend flashinfer`
+  - `--context-length 131072`
+  - `--mem-fraction-static 0.88`
+  - `--dist-init-addr 127.0.0.1:50000`
+  - `--enable-metrics`
+- required bind rule on this SGLang build:
+  - public HTTP server must bind to the GPU private IP, not `0.0.0.0`
+  - internal scheduler keeps a loopback listener on the API port
+  - wildcard bind collides with that loopback listener on this build
+- public validation after cutover:
+  - `https://velocity.desineuron.in/llm/v1/models` returns `200`
+  - `https://llm.desineuron.in/v1/models` returns `200`
+  - streamed chat TTFT through public ingress measured at about `2.36 s`
+  - one short non-stream completion measured about `33.86 completion tok/s`
+
+## 4. Production Model Policy
+
+### Primary production model
+
+- user-facing family: `Qwen 3.6 35B A3B`
+- exact SGLang served model ID: `qwen3.6-35b-a3b`
+
+Why it remains live:
+
+- fits the current `4 x L4` target
+- already aligned with current team workflows
+- suitable for coding/runtime use while the SGLang migration lands
+- measured well enough for three concurrent coding users on the current hardware
+
+### Staged future model on current L4 hardware
+
+- `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit`
+
+Status:
+
+- acquisition/staging path is added
+- not the live runtime on the current L4 cutover
+- should be treated as a staged artifact for later runtime experimentation and hardware-fit validation
+
+Why this is the right 122B staging path for the current worker:
+
+- `4 x L4` is a better fit for an AWQ/int4 track than for an NVFP4 track
+- this keeps the 122B experiment aligned with current hardware instead of assuming a Blackwell-oriented path
+
+Why `txn545/Qwen3.5-122B-A10B-NVFP4` is not the active choice on L4:
+
+- NVFP4 is not the safe default for the current L4 rollout
+- if the team wants that track later, it should be treated as a separate hardware/runtime validation branch
+
+Why no 122B model is the active live model in this round:
+
+- the current migration is locked to preserving service continuity on the existing `4 x L4` worker
+- the 122B track is a separate performance-fit and runtime-tuning exercise
+
+## 5. Runtime Software Stack
+
+Primary runtime after cutover:
+
+- `SGLang`
+
+Primary interface style:
+
+- OpenAI-compatible `/v1/*`
+
+Required runtime features:
+
+- tensor parallel across all four GPUs
+- prefix cache / prompt cache
+- async scheduling
+- continuous batching
+- FlashInfer when supported by the live driver/runtime stack
+
+Observed runtime note from the live bring-up:
+
+- FlashInfer required `ninja-build` on the GPU box because it JIT-builds kernels on first run.
+- The current GPU image needed:
+  - `ninja-build`
+  - `build-essential`
+- After installing those packages, the FP8 runtime came up cleanly and served OpenAI-compatible traffic.
+
+If stock SGLang underperforms:
+
+- keep the same public routes
+- tune CUDA/runtime behavior behind the same routed contract
+- do not reintroduce Ollama fallback
+
+## 6. Implemented Repo Changes
+
+### Backend runtime service
+
+File:
+
+- `backend/services/runtime_llm_service.py`
+
+Current state:
+
+- provider catalog is standardized to `sglang`
+- legacy provider names like `ollama` and `nemoclaw` are mapped into `sglang` to avoid immediate caller breakage
+- model discovery uses `/v1/models`
+
+### NemoClaw client
+
+File:
+
+- `backend/services/nemoclaw_client.py`
+
+Current state:
+
+- production path now targets the shared SGLang/OpenAI-compatible endpoint
+- NVIDIA and Ollama production fallback logic is removed from the runtime path
+- legacy env names still seed config where needed
+
+### Prompt expander
+
+File:
+
+- `comfy_engine/scripts/prompt_expander.py`
+
+Current state:
+
+- now uses the shared OpenAI-compatible runtime instead of Ollama `/api/generate`
+
+### NemoClaw deploy helper
+
+File:
+
+- `backend/scripts/nemoclaw_deploy.sh`
+
+Current state:
+
+- rewritten around SGLang-compatible inference
+- no Ollama-era deployment assumptions
+
+## 7. Route Sync And Stable Hostnames
+
+Route-sync files:
+
+- `infrastructure/desineuron_ingress/sync_llm_route.py`
+- `infrastructure/desineuron_ingress/run_llm_route_sync.sh`
+- `infrastructure/desineuron_ingress/desineuron-llm-route-sync.service`
+- `infrastructure/desineuron_ingress/desineuron-llm-route-sync.timer`
+- `infrastructure/desineuron_ingress/install_linux_llm_route_sync.sh`
+
+Important behavior:
+
+- Linux-origin discovers the current GPU private IP
+- Linux-origin updates ingress-managed route state
+- ingress forwards `llm.desineuron.in` and `/llm/*` to the GPU worker
+
+Current safe default route-sync port in the repo:
+
+- `11434`
+
+Reason:
+
+- the repo now contains the SGLang installer and watchdog, but the public route should not auto-cut from Ollama to SGLang until the GPU runtime is actually installed and validated on-host
+- when SGLang is installed on the GPU worker, operators should flip `LLM_ROUTE_PORT` to the live SGLang port and then run route-sync
+
+Manual operator-safe route sync entrypoint:
+
+- `/usr/local/bin/run_llm_route_sync.sh`
+
+This avoids the prior failure mode where operators accidentally used a system Python without `boto3`.
+
+## 8. GPU Watchdog And Auto-Recovery
+
+Added GPU-side scripts:
+
+- `infrastructure/desineuron_ingress/install_gpu_sglang_runtime.sh`
+- `infrastructure/desineuron_ingress/install_gpu_sglang_watchdog.sh`
+
+Installed unit names expected on the GPU worker:
+
+- `desineuron-sglang.service`
+- `desineuron-sglang-watchdog.service`
+- `desineuron-sglang-watchdog.timer`
+
+Recovery policy:
+
+- ensure the SGLang service is running
+- verify `/v1/models` health locally
+- if the configured model path is missing, rehydrate from the canonical source
+- only report healthy after successful verification
+
+Required recovery assertions for the SGLang watchdog:
+
+- confirm the process is serving `/v1/models`
+- confirm the returned model list contains `qwen3.6-35b-a3b`
+- confirm all 4 GPUs are engaged during model load
+- confirm FlashInfer dependencies are present before declaring runtime healthy
+
+## 9. Model Rehydration And Staging
+
+Added staging helper:
+
+- `infrastructure/desineuron_ingress/acquire_qwen35_122b_nvfp4.sh`
+
+Purpose:
+
+- stages `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit` onto GPU NVMe by default
+- does not automatically flip production traffic to that model
+
+Expected current live model path style:
+
+- `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8`
+
+Expected staged 122B path style:
+
+- `/opt/dlami/nvme/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit`
+
+## 10. Roo Code Team Setup
+
+After SGLang cutover, team members should stop using the Ollama provider mode for Desineuron-hosted inference.
+
+Canonical team profile:
+
+- API Provider: OpenAI-compatible / custom OpenAI
+- Base URL: `https://llm.desineuron.in/v1`
+- Model: `qwen3.6-35b-a3b`
+- Temperature: `0.1` to `0.2`
+- Server context ceiling: `131072`
+- Recommended Roo context: `131072`
+
+Team decision for this wave:
+
+- all three team members can target `128K` context through the same shared runtime
+- if real concurrent repo-heavy usage causes OOM or latency regression, the first rollback knob is the client context setting, not the model family
+- the current production-ready long-context path is pure VRAM on `4 x L4`, not host-RAM spill
+
+## 11. Measured SGLang Performance
+
+Benchmark date:
+
+- `2026-04-22`
+
+Benchmark topology:
+
+- live AWS GPU worker
+- `SGLang + Qwen 3.6 35B A3B FP8`
+- tensor parallel `4`
+- FlashInfer enabled
+- async scheduler / SGLang default continuous batching path
+- prompt-prefix caching available in runtime
+- server context ceiling: `131072`
+
+Measured results:
+
+- time to first token: `0.12 s`
+- streamed completion wall time for a short coding/planning answer: `1.31 s`
+- test concurrency: `3`
+- aggregate wall time for `3 x 256-token` responses: `3.61 s`
+- aggregate completion tokens: `768`
+- aggregate prompt tokens: `168`
+- aggregate total tokens: `936`
+- aggregate completion throughput: `212.76 tokens/s`
+
+Per-request timing under `3` concurrent requests:
+
+- request 1: `3.608 s` for `256` completion tokens
+- request 2: `3.609 s` for `256` completion tokens
+- request 3: `3.608 s` for `256` completion tokens
+
+Long-context smoke validation:
+
+- prompt size validated: `50010` prompt tokens
+- completion size: `8` tokens
+- total request size: `50018` tokens
+- wall time: `8.345 s`
+
+Operational interpretation:
+
+- the runtime is fast enough for three simultaneous coding users
+- TTFT is already in the sub-200 ms range on the warmed runtime
+- aggregate decode throughput is materially better than the previous Ollama-backed path while holding a `128K` server context ceiling
+- `Qwen 3.6 35B A3B` is the correct production choice for the current one-week delivery window
+
+## 12. Cutover Guidance
+
+Use this model ID consistently across SGLang-facing clients:
+
+- `qwen3.6-35b-a3b`
+
+Do not use this older Ollama-style model ID against SGLang:
+
+- `qwen3.6:35b-a3b`
+
+Why:
+
+- SGLang rejects colons in `served_model_name`
+- the colon is reserved internally for adapter syntax
+
+Backend compatibility note:
+
+- the Velocity backend can still map legacy provider naming internally
+- external Roo Code and OpenAI-compatible clients should use the hyphenated SGLang model ID only
+
+Canonical Roo configuration:
+
+- API Provider: `OpenAI-compatible` or `Custom OpenAI`
+- Base URL: `https://llm.desineuron.in/v1`
+- Model: `qwen3.6-35b-a3b`
+- Context window: `131072`
+- Temperature: `0.1` to `0.2`
+
+Recommended initial values:
+
+- `Base URL`: `https://llm.desineuron.in/v1`
+- `Model`: `qwen3.6-35b-a3b`
+- `Context Window Size (num_ctx equivalent)`: `131072`
+
+Do not use:
+
+- Ollama provider mode pointing at the public Desineuron route after the cutover
+
+Reason:
+
+- the stable contract is moving to SGLang's OpenAI-compatible interface
+
+## 13. Most Efficient Working Long-Context Strategy On Current Hardware
+
+Strategies tested against the live `4 x L4` worker:
+
+1. Pure-VRAM `131072` context on SGLang with tensor parallel `4`
+Result:
+
+- works
+- preserves sub-200 ms TTFT on warm short prompts
+- preserved about `212.76 tok/s` aggregate completion throughput in the 3-user benchmark
+
+2. Hierarchical host-memory cache with `131072` context
+Result:
+
+- not production-safe on the current stack for this model
+- first failed on a model-specific `page_size=1` requirement for the hybrid Mamba cache
+- second attempt progressed further but one rank died with exit code `-9`
+- current interpretation: this path is materially less stable than the pure-VRAM profile
+
+Current decision:
+
+- keep `131072` in VRAM as the production target
+- do not use host-RAM hierarchical cache for this model in the current rollout
+- if more headroom is needed later, tune kernels and scheduling first before re-opening host-memory spill
+
+## 14. NemoClaw Runtime Policy
+
+NemoClaw should use the same shared SGLang runtime as:
+
+- Roo Code
+- Oracle runtime
+- backend runtime LLM jobs
+
+This is a deliberate single-stack decision:
+
+- one serving runtime
+- one model family for the current wave
+- one stable routed contract
+
+If later profiles differ, express that with config, not with a second serving stack in this phase.
+
+## 15. Endpoint Checklist
+
+These should work after cutover:
+
+- `https://velocity.desineuron.in/llm/v1/models`
+- `https://velocity.desineuron.in/llm/v1/chat/completions`
+- `https://llm.desineuron.in/v1/models`
+- `https://llm.desineuron.in/v1/chat/completions`
+
+Internal backend envs:
+
+- `LLM_BASE_URL`
+- `SGLANG_BASE_URL`
+- `SGLANG_CHAT_URL`
+- `SGLANG_MODELS_URL`
+- `SGLANG_MODEL`
+- `SGLANG_API_TOKEN`
+
+## 16. What Is Left
+
+Still required to complete the migration end to end:
+
+1. Persist the `131072` launch profile into the GPU systemd runtime using the updated installer.
+2. Reinstall or update the GPU watchdog so it validates the same `131072` service profile.
+3. Repoint Linux-origin route-sync env from `11434` to the live SGLang port after GPU validation.
+4. Validate both public routes against `/v1/models`.
+5. Run one more public-route benchmark through ingress after cutover to capture real routed TTFT.
+6. Generate tuned L4-specific runtime configs if we want to push further on throughput without lowering context.
+7. Keep the 122B track separate; it is not part of the current production coding-runtime choice.
+
+## 17. Team Hand-Off
+
+For Roo Code today, once cutover is complete, the team only needs:
+
+- Base URL: `https://llm.desineuron.in/v1`
+- Model: `qwen3.6-35b-a3b`
+- Context window: `131072`
+- Provider type: OpenAI-compatible
+
+For operators, the important truth is:
+
+- Linux-origin controls routing
+- ingress owns the stable hostname
+- GPU box owns inference
+- NVMe owns model state
+- SGLang is the production runtime