# Desineuron AWS Coding Runtime Truth Book Date: 2026-04-22 Scope: Coding runtime, Roo Code access, NemoClaw runtime, ingress routing, GPU recovery, model staging ## 1. Current Runtime Truth The Desineuron shared coding runtime has been cut over from Ollama to SGLang while preserving the public contracts already used by the team. Locked production decisions: - Public contract remains stable. - GPU inference remains on the AWS GPU worker, not on the Linux-origin box. - Linux-origin remains the control plane. - Ingress remains the stable routed entrypoint. - `Qwen 3.6 35B A3B` remains the production target model for the current `4 x L4` rollout. - `NemoClaw` moves onto the same shared runtime. - There is no production fallback to Ollama after cutover. Current live public routes: - `https://velocity.desineuron.in/llm` - `https://llm.desineuron.in` Current live API shape after cutover: - `https://velocity.desineuron.in/llm/v1/models` - `https://velocity.desineuron.in/llm/v1/chat/completions` - `https://llm.desineuron.in/v1/models` - `https://llm.desineuron.in/v1/chat/completions` - GPU SGLang bind: `172.31.46.190:30100` - Linux-origin LLM route-sync target port: `30100` ## 2. Infra Split ### Linux-origin Responsibilities: - owns route-sync logic - owns operational orchestration - updates ingress upstream target when GPU private IP changes - does not host the heavy model runtime ### Ingress Responsibilities: - terminates public hostname - renders stable reverse-proxy contracts - forwards `/llm/*` and `llm.desineuron.in` to the current GPU target ### GPU worker Responsibilities: - hosts SGLang - hosts model payloads on NVMe only - serves Roo Code, Oracle runtime, runtime LLM, and NemoClaw inference Non-negotiable rules: - do not use the GPU public IP directly - do not keep model state on root disk - keep all large model/runtime caches on GPU NVMe ## 3. Live Hardware Target Current worker class: - `g6.12xlarge` - `4 x NVIDIA L4` - `96 GB VRAM total` Serving profile for this hardware: - tensor parallel size `4` - prompt-prefix caching enabled - async / continuous batching enabled through SGLang - FlashInfer preferred where supported by the live CUDA stack Measured validation on the live GPU worker: - host class: `g6.12xlarge` - GPU layout: `4 x NVIDIA L4` - model path used for the validated runtime: `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8` - SGLang served model ID used for the test: `qwen3.6-35b-a3b` - validated SGLang launch profile: - `--tp-size 4` - `--attention-backend flashinfer` - `--context-length 131072` - `--mem-fraction-static 0.88` - `--dist-init-addr 127.0.0.1:50000` - `--enable-metrics` - required bind rule on this SGLang build: - public HTTP server must bind to the GPU private IP, not `0.0.0.0` - internal scheduler keeps a loopback listener on the API port - wildcard bind collides with that loopback listener on this build - public validation after cutover: - `https://velocity.desineuron.in/llm/v1/models` returns `200` - `https://llm.desineuron.in/v1/models` returns `200` - streamed chat TTFT through public ingress measured at about `2.36 s` - one short non-stream completion measured about `33.86 completion tok/s` ## 4. Production Model Policy ### Primary production model - user-facing family: `Qwen 3.6 35B A3B` - exact SGLang served model ID: `qwen3.6-35b-a3b` Why it remains live: - fits the current `4 x L4` target - already aligned with current team workflows - suitable for coding/runtime use while the SGLang migration lands - measured well enough for three concurrent coding users on the current hardware ### Staged future model on current L4 hardware - `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit` Status: - acquisition/staging path is added - not the live runtime on the current L4 cutover - should be treated as a staged artifact for later runtime experimentation and hardware-fit validation Why this is the right 122B staging path for the current worker: - `4 x L4` is a better fit for an AWQ/int4 track than for an NVFP4 track - this keeps the 122B experiment aligned with current hardware instead of assuming a Blackwell-oriented path Why `txn545/Qwen3.5-122B-A10B-NVFP4` is not the active choice on L4: - NVFP4 is not the safe default for the current L4 rollout - if the team wants that track later, it should be treated as a separate hardware/runtime validation branch Why no 122B model is the active live model in this round: - the current migration is locked to preserving service continuity on the existing `4 x L4` worker - the 122B track is a separate performance-fit and runtime-tuning exercise ## 5. Runtime Software Stack Primary runtime after cutover: - `SGLang` Primary interface style: - OpenAI-compatible `/v1/*` Required runtime features: - tensor parallel across all four GPUs - prefix cache / prompt cache - async scheduling - continuous batching - FlashInfer when supported by the live driver/runtime stack Observed runtime note from the live bring-up: - FlashInfer required `ninja-build` on the GPU box because it JIT-builds kernels on first run. - The current GPU image needed: - `ninja-build` - `build-essential` - After installing those packages, the FP8 runtime came up cleanly and served OpenAI-compatible traffic. If stock SGLang underperforms: - keep the same public routes - tune CUDA/runtime behavior behind the same routed contract - do not reintroduce Ollama fallback ## 6. Implemented Repo Changes ### Backend runtime service File: - `backend/services/runtime_llm_service.py` Current state: - provider catalog is standardized to `sglang` - legacy provider names like `ollama` and `nemoclaw` are mapped into `sglang` to avoid immediate caller breakage - model discovery uses `/v1/models` ### NemoClaw client File: - `backend/services/nemoclaw_client.py` Current state: - production path now targets the shared SGLang/OpenAI-compatible endpoint - NVIDIA and Ollama production fallback logic is removed from the runtime path - legacy env names still seed config where needed ### Prompt expander File: - `comfy_engine/scripts/prompt_expander.py` Current state: - now uses the shared OpenAI-compatible runtime instead of Ollama `/api/generate` ### NemoClaw deploy helper File: - `backend/scripts/nemoclaw_deploy.sh` Current state: - rewritten around SGLang-compatible inference - no Ollama-era deployment assumptions ## 7. Route Sync And Stable Hostnames Route-sync files: - `infrastructure/desineuron_ingress/sync_llm_route.py` - `infrastructure/desineuron_ingress/run_llm_route_sync.sh` - `infrastructure/desineuron_ingress/desineuron-llm-route-sync.service` - `infrastructure/desineuron_ingress/desineuron-llm-route-sync.timer` - `infrastructure/desineuron_ingress/install_linux_llm_route_sync.sh` Important behavior: - Linux-origin discovers the current GPU private IP - Linux-origin updates ingress-managed route state - ingress forwards `llm.desineuron.in` and `/llm/*` to the GPU worker Current safe default route-sync port in the repo: - `11434` Reason: - the repo now contains the SGLang installer and watchdog, but the public route should not auto-cut from Ollama to SGLang until the GPU runtime is actually installed and validated on-host - when SGLang is installed on the GPU worker, operators should flip `LLM_ROUTE_PORT` to the live SGLang port and then run route-sync Manual operator-safe route sync entrypoint: - `/usr/local/bin/run_llm_route_sync.sh` This avoids the prior failure mode where operators accidentally used a system Python without `boto3`. ## 8. GPU Watchdog And Auto-Recovery Added GPU-side scripts: - `infrastructure/desineuron_ingress/install_gpu_sglang_runtime.sh` - `infrastructure/desineuron_ingress/install_gpu_sglang_watchdog.sh` Installed unit names expected on the GPU worker: - `desineuron-sglang.service` - `desineuron-sglang-watchdog.service` - `desineuron-sglang-watchdog.timer` Recovery policy: - ensure the SGLang service is running - verify `/v1/models` health locally - if the configured model path is missing, rehydrate from the canonical source - only report healthy after successful verification Required recovery assertions for the SGLang watchdog: - confirm the process is serving `/v1/models` - confirm the returned model list contains `qwen3.6-35b-a3b` - confirm all 4 GPUs are engaged during model load - confirm FlashInfer dependencies are present before declaring runtime healthy ## 9. Model Rehydration And Staging Added staging helper: - `infrastructure/desineuron_ingress/acquire_qwen35_122b_nvfp4.sh` Purpose: - stages `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit` onto GPU NVMe by default - does not automatically flip production traffic to that model Expected current live model path style: - `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8` Expected staged 122B path style: - `/opt/dlami/nvme/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit` ## 10. Roo Code Team Setup After SGLang cutover, team members should stop using the Ollama provider mode for Desineuron-hosted inference. Canonical team profile: - API Provider: OpenAI-compatible / custom OpenAI - Base URL: `https://llm.desineuron.in/v1` - Model: `qwen3.6-35b-a3b` - Temperature: `0.1` to `0.2` - Server context ceiling: `131072` - Recommended Roo context: `131072` Team decision for this wave: - all three team members can target `128K` context through the same shared runtime - if real concurrent repo-heavy usage causes OOM or latency regression, the first rollback knob is the client context setting, not the model family - the current production-ready long-context path is pure VRAM on `4 x L4`, not host-RAM spill ## 11. Measured SGLang Performance Benchmark date: - `2026-04-22` Benchmark topology: - live AWS GPU worker - `SGLang + Qwen 3.6 35B A3B FP8` - tensor parallel `4` - FlashInfer enabled - async scheduler / SGLang default continuous batching path - prompt-prefix caching available in runtime - server context ceiling: `131072` Measured results: - time to first token: `0.12 s` - streamed completion wall time for a short coding/planning answer: `1.31 s` - test concurrency: `3` - aggregate wall time for `3 x 256-token` responses: `3.61 s` - aggregate completion tokens: `768` - aggregate prompt tokens: `168` - aggregate total tokens: `936` - aggregate completion throughput: `212.76 tokens/s` Per-request timing under `3` concurrent requests: - request 1: `3.608 s` for `256` completion tokens - request 2: `3.609 s` for `256` completion tokens - request 3: `3.608 s` for `256` completion tokens Long-context smoke validation: - prompt size validated: `50010` prompt tokens - completion size: `8` tokens - total request size: `50018` tokens - wall time: `8.345 s` Operational interpretation: - the runtime is fast enough for three simultaneous coding users - TTFT is already in the sub-200 ms range on the warmed runtime - aggregate decode throughput is materially better than the previous Ollama-backed path while holding a `128K` server context ceiling - `Qwen 3.6 35B A3B` is the correct production choice for the current one-week delivery window ## 12. Cutover Guidance Use this model ID consistently across SGLang-facing clients: - `qwen3.6-35b-a3b` Do not use this older Ollama-style model ID against SGLang: - `qwen3.6:35b-a3b` Why: - SGLang rejects colons in `served_model_name` - the colon is reserved internally for adapter syntax Backend compatibility note: - the Velocity backend can still map legacy provider naming internally - external Roo Code and OpenAI-compatible clients should use the hyphenated SGLang model ID only Canonical Roo configuration: - API Provider: `OpenAI-compatible` or `Custom OpenAI` - Base URL: `https://llm.desineuron.in/v1` - Model: `qwen3.6-35b-a3b` - Context window: `131072` - Temperature: `0.1` to `0.2` Recommended initial values: - `Base URL`: `https://llm.desineuron.in/v1` - `Model`: `qwen3.6-35b-a3b` - `Context Window Size (num_ctx equivalent)`: `131072` Do not use: - Ollama provider mode pointing at the public Desineuron route after the cutover Reason: - the stable contract is moving to SGLang's OpenAI-compatible interface ## 13. Most Efficient Working Long-Context Strategy On Current Hardware Strategies tested against the live `4 x L4` worker: 1. Pure-VRAM `131072` context on SGLang with tensor parallel `4` Result: - works - preserves sub-200 ms TTFT on warm short prompts - preserved about `212.76 tok/s` aggregate completion throughput in the 3-user benchmark 2. Hierarchical host-memory cache with `131072` context Result: - not production-safe on the current stack for this model - first failed on a model-specific `page_size=1` requirement for the hybrid Mamba cache - second attempt progressed further but one rank died with exit code `-9` - current interpretation: this path is materially less stable than the pure-VRAM profile Current decision: - keep `131072` in VRAM as the production target - do not use host-RAM hierarchical cache for this model in the current rollout - if more headroom is needed later, tune kernels and scheduling first before re-opening host-memory spill ## 14. NemoClaw Runtime Policy NemoClaw should use the same shared SGLang runtime as: - Roo Code - Oracle runtime - backend runtime LLM jobs This is a deliberate single-stack decision: - one serving runtime - one model family for the current wave - one stable routed contract If later profiles differ, express that with config, not with a second serving stack in this phase. ## 15. Endpoint Checklist These should work after cutover: - `https://velocity.desineuron.in/llm/v1/models` - `https://velocity.desineuron.in/llm/v1/chat/completions` - `https://llm.desineuron.in/v1/models` - `https://llm.desineuron.in/v1/chat/completions` Internal backend envs: - `LLM_BASE_URL` - `SGLANG_BASE_URL` - `SGLANG_CHAT_URL` - `SGLANG_MODELS_URL` - `SGLANG_MODEL` - `SGLANG_API_TOKEN` ## 16. What Is Left Still required to complete the migration end to end: 1. Persist the `131072` launch profile into the GPU systemd runtime using the updated installer. 2. Reinstall or update the GPU watchdog so it validates the same `131072` service profile. 3. Repoint Linux-origin route-sync env from `11434` to the live SGLang port after GPU validation. 4. Validate both public routes against `/v1/models`. 5. Run one more public-route benchmark through ingress after cutover to capture real routed TTFT. 6. Generate tuned L4-specific runtime configs if we want to push further on throughput without lowering context. 7. Keep the 122B track separate; it is not part of the current production coding-runtime choice. ## 17. Team Hand-Off For Roo Code today, once cutover is complete, the team only needs: - Base URL: `https://llm.desineuron.in/v1` - Model: `qwen3.6-35b-a3b` - Context window: `131072` - Provider type: OpenAI-compatible For operators, the important truth is: - Linux-origin controls routing - ingress owns the stable hostname - GPU box owns inference - NVMe owns model state - SGLang is the production runtime