495 lines
15 KiB
Markdown
495 lines
15 KiB
Markdown
# Desineuron AWS Coding Runtime Truth Book
|
|
|
|
Date: 2026-04-22
|
|
Scope: Coding runtime, Roo Code access, NemoClaw runtime, ingress routing, GPU recovery, model staging
|
|
|
|
## 1. Current Runtime Truth
|
|
|
|
The Desineuron shared coding runtime has been cut over from Ollama to SGLang while preserving the public contracts already used by the team.
|
|
|
|
Locked production decisions:
|
|
|
|
- Public contract remains stable.
|
|
- GPU inference remains on the AWS GPU worker, not on the Linux-origin box.
|
|
- Linux-origin remains the control plane.
|
|
- Ingress remains the stable routed entrypoint.
|
|
- `Qwen 3.6 35B A3B` remains the production target model for the current `4 x L4` rollout.
|
|
- `NemoClaw` moves onto the same shared runtime.
|
|
- There is no production fallback to Ollama after cutover.
|
|
|
|
Current live public routes:
|
|
|
|
- `https://velocity.desineuron.in/llm`
|
|
- `https://llm.desineuron.in`
|
|
|
|
Current live API shape after cutover:
|
|
|
|
- `https://velocity.desineuron.in/llm/v1/models`
|
|
- `https://velocity.desineuron.in/llm/v1/chat/completions`
|
|
- `https://llm.desineuron.in/v1/models`
|
|
- `https://llm.desineuron.in/v1/chat/completions`
|
|
- GPU SGLang bind: `172.31.46.190:30100`
|
|
- Linux-origin LLM route-sync target port: `30100`
|
|
|
|
## 2. Infra Split
|
|
|
|
### Linux-origin
|
|
|
|
Responsibilities:
|
|
|
|
- owns route-sync logic
|
|
- owns operational orchestration
|
|
- updates ingress upstream target when GPU private IP changes
|
|
- does not host the heavy model runtime
|
|
|
|
### Ingress
|
|
|
|
Responsibilities:
|
|
|
|
- terminates public hostname
|
|
- renders stable reverse-proxy contracts
|
|
- forwards `/llm/*` and `llm.desineuron.in` to the current GPU target
|
|
|
|
### GPU worker
|
|
|
|
Responsibilities:
|
|
|
|
- hosts SGLang
|
|
- hosts model payloads on NVMe only
|
|
- serves Roo Code, Oracle runtime, runtime LLM, and NemoClaw inference
|
|
|
|
Non-negotiable rules:
|
|
|
|
- do not use the GPU public IP directly
|
|
- do not keep model state on root disk
|
|
- keep all large model/runtime caches on GPU NVMe
|
|
|
|
## 3. Live Hardware Target
|
|
|
|
Current worker class:
|
|
|
|
- `g6.12xlarge`
|
|
- `4 x NVIDIA L4`
|
|
- `96 GB VRAM total`
|
|
|
|
Serving profile for this hardware:
|
|
|
|
- tensor parallel size `4`
|
|
- prompt-prefix caching enabled
|
|
- async / continuous batching enabled through SGLang
|
|
- FlashInfer preferred where supported by the live CUDA stack
|
|
|
|
Measured validation on the live GPU worker:
|
|
|
|
- host class: `g6.12xlarge`
|
|
- GPU layout: `4 x NVIDIA L4`
|
|
- model path used for the validated runtime: `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8`
|
|
- SGLang served model ID used for the test: `qwen3.6-35b-a3b`
|
|
- validated SGLang launch profile:
|
|
- `--tp-size 4`
|
|
- `--attention-backend flashinfer`
|
|
- `--context-length 131072`
|
|
- `--mem-fraction-static 0.88`
|
|
- `--dist-init-addr 127.0.0.1:50000`
|
|
- `--enable-metrics`
|
|
- required bind rule on this SGLang build:
|
|
- public HTTP server must bind to the GPU private IP, not `0.0.0.0`
|
|
- internal scheduler keeps a loopback listener on the API port
|
|
- wildcard bind collides with that loopback listener on this build
|
|
- public validation after cutover:
|
|
- `https://velocity.desineuron.in/llm/v1/models` returns `200`
|
|
- `https://llm.desineuron.in/v1/models` returns `200`
|
|
- streamed chat TTFT through public ingress measured at about `2.36 s`
|
|
- one short non-stream completion measured about `33.86 completion tok/s`
|
|
|
|
## 4. Production Model Policy
|
|
|
|
### Primary production model
|
|
|
|
- user-facing family: `Qwen 3.6 35B A3B`
|
|
- exact SGLang served model ID: `qwen3.6-35b-a3b`
|
|
|
|
Why it remains live:
|
|
|
|
- fits the current `4 x L4` target
|
|
- already aligned with current team workflows
|
|
- suitable for coding/runtime use while the SGLang migration lands
|
|
- measured well enough for three concurrent coding users on the current hardware
|
|
|
|
### Staged future model on current L4 hardware
|
|
|
|
- `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit`
|
|
|
|
Status:
|
|
|
|
- acquisition/staging path is added
|
|
- not the live runtime on the current L4 cutover
|
|
- should be treated as a staged artifact for later runtime experimentation and hardware-fit validation
|
|
|
|
Why this is the right 122B staging path for the current worker:
|
|
|
|
- `4 x L4` is a better fit for an AWQ/int4 track than for an NVFP4 track
|
|
- this keeps the 122B experiment aligned with current hardware instead of assuming a Blackwell-oriented path
|
|
|
|
Why `txn545/Qwen3.5-122B-A10B-NVFP4` is not the active choice on L4:
|
|
|
|
- NVFP4 is not the safe default for the current L4 rollout
|
|
- if the team wants that track later, it should be treated as a separate hardware/runtime validation branch
|
|
|
|
Why no 122B model is the active live model in this round:
|
|
|
|
- the current migration is locked to preserving service continuity on the existing `4 x L4` worker
|
|
- the 122B track is a separate performance-fit and runtime-tuning exercise
|
|
|
|
## 5. Runtime Software Stack
|
|
|
|
Primary runtime after cutover:
|
|
|
|
- `SGLang`
|
|
|
|
Primary interface style:
|
|
|
|
- OpenAI-compatible `/v1/*`
|
|
|
|
Required runtime features:
|
|
|
|
- tensor parallel across all four GPUs
|
|
- prefix cache / prompt cache
|
|
- async scheduling
|
|
- continuous batching
|
|
- FlashInfer when supported by the live driver/runtime stack
|
|
|
|
Observed runtime note from the live bring-up:
|
|
|
|
- FlashInfer required `ninja-build` on the GPU box because it JIT-builds kernels on first run.
|
|
- The current GPU image needed:
|
|
- `ninja-build`
|
|
- `build-essential`
|
|
- After installing those packages, the FP8 runtime came up cleanly and served OpenAI-compatible traffic.
|
|
|
|
If stock SGLang underperforms:
|
|
|
|
- keep the same public routes
|
|
- tune CUDA/runtime behavior behind the same routed contract
|
|
- do not reintroduce Ollama fallback
|
|
|
|
## 6. Implemented Repo Changes
|
|
|
|
### Backend runtime service
|
|
|
|
File:
|
|
|
|
- `backend/services/runtime_llm_service.py`
|
|
|
|
Current state:
|
|
|
|
- provider catalog is standardized to `sglang`
|
|
- legacy provider names like `ollama` and `nemoclaw` are mapped into `sglang` to avoid immediate caller breakage
|
|
- model discovery uses `/v1/models`
|
|
|
|
### NemoClaw client
|
|
|
|
File:
|
|
|
|
- `backend/services/nemoclaw_client.py`
|
|
|
|
Current state:
|
|
|
|
- production path now targets the shared SGLang/OpenAI-compatible endpoint
|
|
- NVIDIA and Ollama production fallback logic is removed from the runtime path
|
|
- legacy env names still seed config where needed
|
|
|
|
### Prompt expander
|
|
|
|
File:
|
|
|
|
- `comfy_engine/scripts/prompt_expander.py`
|
|
|
|
Current state:
|
|
|
|
- now uses the shared OpenAI-compatible runtime instead of Ollama `/api/generate`
|
|
|
|
### NemoClaw deploy helper
|
|
|
|
File:
|
|
|
|
- `backend/scripts/nemoclaw_deploy.sh`
|
|
|
|
Current state:
|
|
|
|
- rewritten around SGLang-compatible inference
|
|
- no Ollama-era deployment assumptions
|
|
|
|
## 7. Route Sync And Stable Hostnames
|
|
|
|
Route-sync files:
|
|
|
|
- `infrastructure/desineuron_ingress/sync_llm_route.py`
|
|
- `infrastructure/desineuron_ingress/run_llm_route_sync.sh`
|
|
- `infrastructure/desineuron_ingress/desineuron-llm-route-sync.service`
|
|
- `infrastructure/desineuron_ingress/desineuron-llm-route-sync.timer`
|
|
- `infrastructure/desineuron_ingress/install_linux_llm_route_sync.sh`
|
|
|
|
Important behavior:
|
|
|
|
- Linux-origin discovers the current GPU private IP
|
|
- Linux-origin updates ingress-managed route state
|
|
- ingress forwards `llm.desineuron.in` and `/llm/*` to the GPU worker
|
|
|
|
Current safe default route-sync port in the repo:
|
|
|
|
- `11434`
|
|
|
|
Reason:
|
|
|
|
- the repo now contains the SGLang installer and watchdog, but the public route should not auto-cut from Ollama to SGLang until the GPU runtime is actually installed and validated on-host
|
|
- when SGLang is installed on the GPU worker, operators should flip `LLM_ROUTE_PORT` to the live SGLang port and then run route-sync
|
|
|
|
Manual operator-safe route sync entrypoint:
|
|
|
|
- `/usr/local/bin/run_llm_route_sync.sh`
|
|
|
|
This avoids the prior failure mode where operators accidentally used a system Python without `boto3`.
|
|
|
|
## 8. GPU Watchdog And Auto-Recovery
|
|
|
|
Added GPU-side scripts:
|
|
|
|
- `infrastructure/desineuron_ingress/install_gpu_sglang_runtime.sh`
|
|
- `infrastructure/desineuron_ingress/install_gpu_sglang_watchdog.sh`
|
|
|
|
Installed unit names expected on the GPU worker:
|
|
|
|
- `desineuron-sglang.service`
|
|
- `desineuron-sglang-watchdog.service`
|
|
- `desineuron-sglang-watchdog.timer`
|
|
|
|
Recovery policy:
|
|
|
|
- ensure the SGLang service is running
|
|
- verify `/v1/models` health locally
|
|
- if the configured model path is missing, rehydrate from the canonical source
|
|
- only report healthy after successful verification
|
|
|
|
Required recovery assertions for the SGLang watchdog:
|
|
|
|
- confirm the process is serving `/v1/models`
|
|
- confirm the returned model list contains `qwen3.6-35b-a3b`
|
|
- confirm all 4 GPUs are engaged during model load
|
|
- confirm FlashInfer dependencies are present before declaring runtime healthy
|
|
|
|
## 9. Model Rehydration And Staging
|
|
|
|
Added staging helper:
|
|
|
|
- `infrastructure/desineuron_ingress/acquire_qwen35_122b_nvfp4.sh`
|
|
|
|
Purpose:
|
|
|
|
- stages `cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit` onto GPU NVMe by default
|
|
- does not automatically flip production traffic to that model
|
|
|
|
Expected current live model path style:
|
|
|
|
- `/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8`
|
|
|
|
Expected staged 122B path style:
|
|
|
|
- `/opt/dlami/nvme/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit`
|
|
|
|
## 10. Roo Code Team Setup
|
|
|
|
After SGLang cutover, team members should stop using the Ollama provider mode for Desineuron-hosted inference.
|
|
|
|
Canonical team profile:
|
|
|
|
- API Provider: OpenAI-compatible / custom OpenAI
|
|
- Base URL: `https://llm.desineuron.in/v1`
|
|
- Model: `qwen3.6-35b-a3b`
|
|
- Temperature: `0.1` to `0.2`
|
|
- Server context ceiling: `131072`
|
|
- Recommended Roo context: `131072`
|
|
|
|
Team decision for this wave:
|
|
|
|
- all three team members can target `128K` context through the same shared runtime
|
|
- if real concurrent repo-heavy usage causes OOM or latency regression, the first rollback knob is the client context setting, not the model family
|
|
- the current production-ready long-context path is pure VRAM on `4 x L4`, not host-RAM spill
|
|
|
|
## 11. Measured SGLang Performance
|
|
|
|
Benchmark date:
|
|
|
|
- `2026-04-22`
|
|
|
|
Benchmark topology:
|
|
|
|
- live AWS GPU worker
|
|
- `SGLang + Qwen 3.6 35B A3B FP8`
|
|
- tensor parallel `4`
|
|
- FlashInfer enabled
|
|
- async scheduler / SGLang default continuous batching path
|
|
- prompt-prefix caching available in runtime
|
|
- server context ceiling: `131072`
|
|
|
|
Measured results:
|
|
|
|
- time to first token: `0.12 s`
|
|
- streamed completion wall time for a short coding/planning answer: `1.31 s`
|
|
- test concurrency: `3`
|
|
- aggregate wall time for `3 x 256-token` responses: `3.61 s`
|
|
- aggregate completion tokens: `768`
|
|
- aggregate prompt tokens: `168`
|
|
- aggregate total tokens: `936`
|
|
- aggregate completion throughput: `212.76 tokens/s`
|
|
|
|
Per-request timing under `3` concurrent requests:
|
|
|
|
- request 1: `3.608 s` for `256` completion tokens
|
|
- request 2: `3.609 s` for `256` completion tokens
|
|
- request 3: `3.608 s` for `256` completion tokens
|
|
|
|
Long-context smoke validation:
|
|
|
|
- prompt size validated: `50010` prompt tokens
|
|
- completion size: `8` tokens
|
|
- total request size: `50018` tokens
|
|
- wall time: `8.345 s`
|
|
|
|
Operational interpretation:
|
|
|
|
- the runtime is fast enough for three simultaneous coding users
|
|
- TTFT is already in the sub-200 ms range on the warmed runtime
|
|
- aggregate decode throughput is materially better than the previous Ollama-backed path while holding a `128K` server context ceiling
|
|
- `Qwen 3.6 35B A3B` is the correct production choice for the current one-week delivery window
|
|
|
|
## 12. Cutover Guidance
|
|
|
|
Use this model ID consistently across SGLang-facing clients:
|
|
|
|
- `qwen3.6-35b-a3b`
|
|
|
|
Do not use this older Ollama-style model ID against SGLang:
|
|
|
|
- `qwen3.6:35b-a3b`
|
|
|
|
Why:
|
|
|
|
- SGLang rejects colons in `served_model_name`
|
|
- the colon is reserved internally for adapter syntax
|
|
|
|
Backend compatibility note:
|
|
|
|
- the Velocity backend can still map legacy provider naming internally
|
|
- external Roo Code and OpenAI-compatible clients should use the hyphenated SGLang model ID only
|
|
|
|
Canonical Roo configuration:
|
|
|
|
- API Provider: `OpenAI-compatible` or `Custom OpenAI`
|
|
- Base URL: `https://llm.desineuron.in/v1`
|
|
- Model: `qwen3.6-35b-a3b`
|
|
- Context window: `131072`
|
|
- Temperature: `0.1` to `0.2`
|
|
|
|
Recommended initial values:
|
|
|
|
- `Base URL`: `https://llm.desineuron.in/v1`
|
|
- `Model`: `qwen3.6-35b-a3b`
|
|
- `Context Window Size (num_ctx equivalent)`: `131072`
|
|
|
|
Do not use:
|
|
|
|
- Ollama provider mode pointing at the public Desineuron route after the cutover
|
|
|
|
Reason:
|
|
|
|
- the stable contract is moving to SGLang's OpenAI-compatible interface
|
|
|
|
## 13. Most Efficient Working Long-Context Strategy On Current Hardware
|
|
|
|
Strategies tested against the live `4 x L4` worker:
|
|
|
|
1. Pure-VRAM `131072` context on SGLang with tensor parallel `4`
|
|
Result:
|
|
|
|
- works
|
|
- preserves sub-200 ms TTFT on warm short prompts
|
|
- preserved about `212.76 tok/s` aggregate completion throughput in the 3-user benchmark
|
|
|
|
2. Hierarchical host-memory cache with `131072` context
|
|
Result:
|
|
|
|
- not production-safe on the current stack for this model
|
|
- first failed on a model-specific `page_size=1` requirement for the hybrid Mamba cache
|
|
- second attempt progressed further but one rank died with exit code `-9`
|
|
- current interpretation: this path is materially less stable than the pure-VRAM profile
|
|
|
|
Current decision:
|
|
|
|
- keep `131072` in VRAM as the production target
|
|
- do not use host-RAM hierarchical cache for this model in the current rollout
|
|
- if more headroom is needed later, tune kernels and scheduling first before re-opening host-memory spill
|
|
|
|
## 14. NemoClaw Runtime Policy
|
|
|
|
NemoClaw should use the same shared SGLang runtime as:
|
|
|
|
- Roo Code
|
|
- Oracle runtime
|
|
- backend runtime LLM jobs
|
|
|
|
This is a deliberate single-stack decision:
|
|
|
|
- one serving runtime
|
|
- one model family for the current wave
|
|
- one stable routed contract
|
|
|
|
If later profiles differ, express that with config, not with a second serving stack in this phase.
|
|
|
|
## 15. Endpoint Checklist
|
|
|
|
These should work after cutover:
|
|
|
|
- `https://velocity.desineuron.in/llm/v1/models`
|
|
- `https://velocity.desineuron.in/llm/v1/chat/completions`
|
|
- `https://llm.desineuron.in/v1/models`
|
|
- `https://llm.desineuron.in/v1/chat/completions`
|
|
|
|
Internal backend envs:
|
|
|
|
- `LLM_BASE_URL`
|
|
- `SGLANG_BASE_URL`
|
|
- `SGLANG_CHAT_URL`
|
|
- `SGLANG_MODELS_URL`
|
|
- `SGLANG_MODEL`
|
|
- `SGLANG_API_TOKEN`
|
|
|
|
## 16. What Is Left
|
|
|
|
Still required to complete the migration end to end:
|
|
|
|
1. Persist the `131072` launch profile into the GPU systemd runtime using the updated installer.
|
|
2. Reinstall or update the GPU watchdog so it validates the same `131072` service profile.
|
|
3. Repoint Linux-origin route-sync env from `11434` to the live SGLang port after GPU validation.
|
|
4. Validate both public routes against `/v1/models`.
|
|
5. Run one more public-route benchmark through ingress after cutover to capture real routed TTFT.
|
|
6. Generate tuned L4-specific runtime configs if we want to push further on throughput without lowering context.
|
|
7. Keep the 122B track separate; it is not part of the current production coding-runtime choice.
|
|
|
|
## 17. Team Hand-Off
|
|
|
|
For Roo Code today, once cutover is complete, the team only needs:
|
|
|
|
- Base URL: `https://llm.desineuron.in/v1`
|
|
- Model: `qwen3.6-35b-a3b`
|
|
- Context window: `131072`
|
|
- Provider type: OpenAI-compatible
|
|
|
|
For operators, the important truth is:
|
|
|
|
- Linux-origin controls routing
|
|
- ingress owns the stable hostname
|
|
- GPU box owns inference
|
|
- NVMe owns model state
|
|
- SGLang is the production runtime
|