Co-authored-by: Sagnik <sagnik7896@gmail.com> Reviewed-on: sagnik/Project_Velocity#33
15 KiB
Desineuron AWS Coding Runtime Truth Book
Date: 2026-04-22
Scope: Coding runtime, Roo Code access, NemoClaw runtime, ingress routing, GPU recovery, model staging
1. Current Runtime Truth
The Desineuron shared coding runtime has been cut over from Ollama to SGLang while preserving the public contracts already used by the team.
Locked production decisions:
- Public contract remains stable.
- GPU inference remains on the AWS GPU worker, not on the Linux-origin box.
- Linux-origin remains the control plane.
- Ingress remains the stable routed entrypoint.
Qwen 3.6 35B A3Bremains the production target model for the current4 x L4rollout.NemoClawmoves onto the same shared runtime.- There is no production fallback to Ollama after cutover.
Current live public routes:
https://velocity.desineuron.in/llmhttps://llm.desineuron.in
Current live API shape after cutover:
https://velocity.desineuron.in/llm/v1/modelshttps://velocity.desineuron.in/llm/v1/chat/completionshttps://llm.desineuron.in/v1/modelshttps://llm.desineuron.in/v1/chat/completions- GPU SGLang bind:
172.31.46.190:30100 - Linux-origin LLM route-sync target port:
30100
2. Infra Split
Linux-origin
Responsibilities:
- owns route-sync logic
- owns operational orchestration
- updates ingress upstream target when GPU private IP changes
- does not host the heavy model runtime
Ingress
Responsibilities:
- terminates public hostname
- renders stable reverse-proxy contracts
- forwards
/llm/*andllm.desineuron.into the current GPU target
GPU worker
Responsibilities:
- hosts SGLang
- hosts model payloads on NVMe only
- serves Roo Code, Oracle runtime, runtime LLM, and NemoClaw inference
Non-negotiable rules:
- do not use the GPU public IP directly
- do not keep model state on root disk
- keep all large model/runtime caches on GPU NVMe
3. Live Hardware Target
Current worker class:
g6.12xlarge4 x NVIDIA L496 GB VRAM total
Serving profile for this hardware:
- tensor parallel size
4 - prompt-prefix caching enabled
- async / continuous batching enabled through SGLang
- FlashInfer preferred where supported by the live CUDA stack
Measured validation on the live GPU worker:
- host class:
g6.12xlarge - GPU layout:
4 x NVIDIA L4 - model path used for the validated runtime:
/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8 - SGLang served model ID used for the test:
qwen3.6-35b-a3b - validated SGLang launch profile:
--tp-size 4--attention-backend flashinfer--context-length 131072--mem-fraction-static 0.88--dist-init-addr 127.0.0.1:50000--enable-metrics
- required bind rule on this SGLang build:
- public HTTP server must bind to the GPU private IP, not
0.0.0.0 - internal scheduler keeps a loopback listener on the API port
- wildcard bind collides with that loopback listener on this build
- public HTTP server must bind to the GPU private IP, not
- public validation after cutover:
https://velocity.desineuron.in/llm/v1/modelsreturns200https://llm.desineuron.in/v1/modelsreturns200- streamed chat TTFT through public ingress measured at about
2.36 s - one short non-stream completion measured about
33.86 completion tok/s
4. Production Model Policy
Primary production model
- user-facing family:
Qwen 3.6 35B A3B - exact SGLang served model ID:
qwen3.6-35b-a3b
Why it remains live:
- fits the current
4 x L4target - already aligned with current team workflows
- suitable for coding/runtime use while the SGLang migration lands
- measured well enough for three concurrent coding users on the current hardware
Staged future model on current L4 hardware
cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit
Status:
- acquisition/staging path is added
- not the live runtime on the current L4 cutover
- should be treated as a staged artifact for later runtime experimentation and hardware-fit validation
Why this is the right 122B staging path for the current worker:
4 x L4is a better fit for an AWQ/int4 track than for an NVFP4 track- this keeps the 122B experiment aligned with current hardware instead of assuming a Blackwell-oriented path
Why txn545/Qwen3.5-122B-A10B-NVFP4 is not the active choice on L4:
- NVFP4 is not the safe default for the current L4 rollout
- if the team wants that track later, it should be treated as a separate hardware/runtime validation branch
Why no 122B model is the active live model in this round:
- the current migration is locked to preserving service continuity on the existing
4 x L4worker - the 122B track is a separate performance-fit and runtime-tuning exercise
5. Runtime Software Stack
Primary runtime after cutover:
SGLang
Primary interface style:
- OpenAI-compatible
/v1/*
Required runtime features:
- tensor parallel across all four GPUs
- prefix cache / prompt cache
- async scheduling
- continuous batching
- FlashInfer when supported by the live driver/runtime stack
Observed runtime note from the live bring-up:
- FlashInfer required
ninja-buildon the GPU box because it JIT-builds kernels on first run. - The current GPU image needed:
ninja-buildbuild-essential
- After installing those packages, the FP8 runtime came up cleanly and served OpenAI-compatible traffic.
If stock SGLang underperforms:
- keep the same public routes
- tune CUDA/runtime behavior behind the same routed contract
- do not reintroduce Ollama fallback
6. Implemented Repo Changes
Backend runtime service
File:
backend/services/runtime_llm_service.py
Current state:
- provider catalog is standardized to
sglang - legacy provider names like
ollamaandnemoclaware mapped intosglangto avoid immediate caller breakage - model discovery uses
/v1/models
NemoClaw client
File:
backend/services/nemoclaw_client.py
Current state:
- production path now targets the shared SGLang/OpenAI-compatible endpoint
- NVIDIA and Ollama production fallback logic is removed from the runtime path
- legacy env names still seed config where needed
Prompt expander
File:
comfy_engine/scripts/prompt_expander.py
Current state:
- now uses the shared OpenAI-compatible runtime instead of Ollama
/api/generate
NemoClaw deploy helper
File:
backend/scripts/nemoclaw_deploy.sh
Current state:
- rewritten around SGLang-compatible inference
- no Ollama-era deployment assumptions
7. Route Sync And Stable Hostnames
Route-sync files:
infrastructure/desineuron_ingress/sync_llm_route.pyinfrastructure/desineuron_ingress/run_llm_route_sync.shinfrastructure/desineuron_ingress/desineuron-llm-route-sync.serviceinfrastructure/desineuron_ingress/desineuron-llm-route-sync.timerinfrastructure/desineuron_ingress/install_linux_llm_route_sync.sh
Important behavior:
- Linux-origin discovers the current GPU private IP
- Linux-origin updates ingress-managed route state
- ingress forwards
llm.desineuron.inand/llm/*to the GPU worker
Current safe default route-sync port in the repo:
11434
Reason:
- the repo now contains the SGLang installer and watchdog, but the public route should not auto-cut from Ollama to SGLang until the GPU runtime is actually installed and validated on-host
- when SGLang is installed on the GPU worker, operators should flip
LLM_ROUTE_PORTto the live SGLang port and then run route-sync
Manual operator-safe route sync entrypoint:
/usr/local/bin/run_llm_route_sync.sh
This avoids the prior failure mode where operators accidentally used a system Python without boto3.
8. GPU Watchdog And Auto-Recovery
Added GPU-side scripts:
infrastructure/desineuron_ingress/install_gpu_sglang_runtime.shinfrastructure/desineuron_ingress/install_gpu_sglang_watchdog.sh
Installed unit names expected on the GPU worker:
desineuron-sglang.servicedesineuron-sglang-watchdog.servicedesineuron-sglang-watchdog.timer
Recovery policy:
- ensure the SGLang service is running
- verify
/v1/modelshealth locally - if the configured model path is missing, rehydrate from the canonical source
- only report healthy after successful verification
Required recovery assertions for the SGLang watchdog:
- confirm the process is serving
/v1/models - confirm the returned model list contains
qwen3.6-35b-a3b - confirm all 4 GPUs are engaged during model load
- confirm FlashInfer dependencies are present before declaring runtime healthy
9. Model Rehydration And Staging
Added staging helper:
infrastructure/desineuron_ingress/acquire_qwen35_122b_nvfp4.sh
Purpose:
- stages
cyankiwi/Qwen3.5-122B-A10B-AWQ-4bitonto GPU NVMe by default - does not automatically flip production traffic to that model
Expected current live model path style:
/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8
Expected staged 122B path style:
/opt/dlami/nvme/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit
10. Roo Code Team Setup
After SGLang cutover, team members should stop using the Ollama provider mode for Desineuron-hosted inference.
Canonical team profile:
- API Provider: OpenAI-compatible / custom OpenAI
- Base URL:
https://llm.desineuron.in/v1 - Model:
qwen3.6-35b-a3b - Temperature:
0.1to0.2 - Server context ceiling:
131072 - Recommended Roo context:
131072
Team decision for this wave:
- all three team members can target
128Kcontext through the same shared runtime - if real concurrent repo-heavy usage causes OOM or latency regression, the first rollback knob is the client context setting, not the model family
- the current production-ready long-context path is pure VRAM on
4 x L4, not host-RAM spill
11. Measured SGLang Performance
Benchmark date:
2026-04-22
Benchmark topology:
- live AWS GPU worker
SGLang + Qwen 3.6 35B A3B FP8- tensor parallel
4 - FlashInfer enabled
- async scheduler / SGLang default continuous batching path
- prompt-prefix caching available in runtime
- server context ceiling:
131072
Measured results:
- time to first token:
0.12 s - streamed completion wall time for a short coding/planning answer:
1.31 s - test concurrency:
3 - aggregate wall time for
3 x 256-tokenresponses:3.61 s - aggregate completion tokens:
768 - aggregate prompt tokens:
168 - aggregate total tokens:
936 - aggregate completion throughput:
212.76 tokens/s
Per-request timing under 3 concurrent requests:
- request 1:
3.608 sfor256completion tokens - request 2:
3.609 sfor256completion tokens - request 3:
3.608 sfor256completion tokens
Long-context smoke validation:
- prompt size validated:
50010prompt tokens - completion size:
8tokens - total request size:
50018tokens - wall time:
8.345 s
Operational interpretation:
- the runtime is fast enough for three simultaneous coding users
- TTFT is already in the sub-200 ms range on the warmed runtime
- aggregate decode throughput is materially better than the previous Ollama-backed path while holding a
128Kserver context ceiling Qwen 3.6 35B A3Bis the correct production choice for the current one-week delivery window
12. Cutover Guidance
Use this model ID consistently across SGLang-facing clients:
qwen3.6-35b-a3b
Do not use this older Ollama-style model ID against SGLang:
qwen3.6:35b-a3b
Why:
- SGLang rejects colons in
served_model_name - the colon is reserved internally for adapter syntax
Backend compatibility note:
- the Velocity backend can still map legacy provider naming internally
- external Roo Code and OpenAI-compatible clients should use the hyphenated SGLang model ID only
Canonical Roo configuration:
- API Provider:
OpenAI-compatibleorCustom OpenAI - Base URL:
https://llm.desineuron.in/v1 - Model:
qwen3.6-35b-a3b - Context window:
131072 - Temperature:
0.1to0.2
Recommended initial values:
Base URL:https://llm.desineuron.in/v1Model:qwen3.6-35b-a3bContext Window Size (num_ctx equivalent):131072
Do not use:
- Ollama provider mode pointing at the public Desineuron route after the cutover
Reason:
- the stable contract is moving to SGLang's OpenAI-compatible interface
13. Most Efficient Working Long-Context Strategy On Current Hardware
Strategies tested against the live 4 x L4 worker:
- Pure-VRAM
131072context on SGLang with tensor parallel4Result:
- works
- preserves sub-200 ms TTFT on warm short prompts
- preserved about
212.76 tok/saggregate completion throughput in the 3-user benchmark
- Hierarchical host-memory cache with
131072context Result:
- not production-safe on the current stack for this model
- first failed on a model-specific
page_size=1requirement for the hybrid Mamba cache - second attempt progressed further but one rank died with exit code
-9 - current interpretation: this path is materially less stable than the pure-VRAM profile
Current decision:
- keep
131072in VRAM as the production target - do not use host-RAM hierarchical cache for this model in the current rollout
- if more headroom is needed later, tune kernels and scheduling first before re-opening host-memory spill
14. NemoClaw Runtime Policy
NemoClaw should use the same shared SGLang runtime as:
- Roo Code
- Oracle runtime
- backend runtime LLM jobs
This is a deliberate single-stack decision:
- one serving runtime
- one model family for the current wave
- one stable routed contract
If later profiles differ, express that with config, not with a second serving stack in this phase.
15. Endpoint Checklist
These should work after cutover:
https://velocity.desineuron.in/llm/v1/modelshttps://velocity.desineuron.in/llm/v1/chat/completionshttps://llm.desineuron.in/v1/modelshttps://llm.desineuron.in/v1/chat/completions
Internal backend envs:
LLM_BASE_URLSGLANG_BASE_URLSGLANG_CHAT_URLSGLANG_MODELS_URLSGLANG_MODELSGLANG_API_TOKEN
16. What Is Left
Still required to complete the migration end to end:
- Persist the
131072launch profile into the GPU systemd runtime using the updated installer. - Reinstall or update the GPU watchdog so it validates the same
131072service profile. - Repoint Linux-origin route-sync env from
11434to the live SGLang port after GPU validation. - Validate both public routes against
/v1/models. - Run one more public-route benchmark through ingress after cutover to capture real routed TTFT.
- Generate tuned L4-specific runtime configs if we want to push further on throughput without lowering context.
- Keep the 122B track separate; it is not part of the current production coding-runtime choice.
17. Team Hand-Off
For Roo Code today, once cutover is complete, the team only needs:
- Base URL:
https://llm.desineuron.in/v1 - Model:
qwen3.6-35b-a3b - Context window:
131072 - Provider type: OpenAI-compatible
For operators, the important truth is:
- Linux-origin controls routing
- ingress owns the stable hostname
- GPU box owns inference
- NVMe owns model state
- SGLang is the production runtime