Files
Project_Velocity/.Agent Context/Desineuron AWS Coding Runtime Truth Book.md

15 KiB

Desineuron AWS Coding Runtime Truth Book

Date: 2026-04-22
Scope: Coding runtime, Roo Code access, NemoClaw runtime, ingress routing, GPU recovery, model staging

1. Current Runtime Truth

The Desineuron shared coding runtime has been cut over from Ollama to SGLang while preserving the public contracts already used by the team.

Locked production decisions:

  • Public contract remains stable.
  • GPU inference remains on the AWS GPU worker, not on the Linux-origin box.
  • Linux-origin remains the control plane.
  • Ingress remains the stable routed entrypoint.
  • Qwen 3.6 35B A3B remains the production target model for the current 4 x L4 rollout.
  • NemoClaw moves onto the same shared runtime.
  • There is no production fallback to Ollama after cutover.

Current live public routes:

  • https://velocity.desineuron.in/llm
  • https://llm.desineuron.in

Current live API shape after cutover:

  • https://velocity.desineuron.in/llm/v1/models
  • https://velocity.desineuron.in/llm/v1/chat/completions
  • https://llm.desineuron.in/v1/models
  • https://llm.desineuron.in/v1/chat/completions
  • GPU SGLang bind: 172.31.46.190:30100
  • Linux-origin LLM route-sync target port: 30100

2. Infra Split

Linux-origin

Responsibilities:

  • owns route-sync logic
  • owns operational orchestration
  • updates ingress upstream target when GPU private IP changes
  • does not host the heavy model runtime

Ingress

Responsibilities:

  • terminates public hostname
  • renders stable reverse-proxy contracts
  • forwards /llm/* and llm.desineuron.in to the current GPU target

GPU worker

Responsibilities:

  • hosts SGLang
  • hosts model payloads on NVMe only
  • serves Roo Code, Oracle runtime, runtime LLM, and NemoClaw inference

Non-negotiable rules:

  • do not use the GPU public IP directly
  • do not keep model state on root disk
  • keep all large model/runtime caches on GPU NVMe

3. Live Hardware Target

Current worker class:

  • g6.12xlarge
  • 4 x NVIDIA L4
  • 96 GB VRAM total

Serving profile for this hardware:

  • tensor parallel size 4
  • prompt-prefix caching enabled
  • async / continuous batching enabled through SGLang
  • FlashInfer preferred where supported by the live CUDA stack

Measured validation on the live GPU worker:

  • host class: g6.12xlarge
  • GPU layout: 4 x NVIDIA L4
  • model path used for the validated runtime: /opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8
  • SGLang served model ID used for the test: qwen3.6-35b-a3b
  • validated SGLang launch profile:
    • --tp-size 4
    • --attention-backend flashinfer
    • --context-length 131072
    • --mem-fraction-static 0.88
    • --dist-init-addr 127.0.0.1:50000
    • --enable-metrics
  • required bind rule on this SGLang build:
    • public HTTP server must bind to the GPU private IP, not 0.0.0.0
    • internal scheduler keeps a loopback listener on the API port
    • wildcard bind collides with that loopback listener on this build
  • public validation after cutover:
    • https://velocity.desineuron.in/llm/v1/models returns 200
    • https://llm.desineuron.in/v1/models returns 200
    • streamed chat TTFT through public ingress measured at about 2.36 s
    • one short non-stream completion measured about 33.86 completion tok/s

4. Production Model Policy

Primary production model

  • user-facing family: Qwen 3.6 35B A3B
  • exact SGLang served model ID: qwen3.6-35b-a3b

Why it remains live:

  • fits the current 4 x L4 target
  • already aligned with current team workflows
  • suitable for coding/runtime use while the SGLang migration lands
  • measured well enough for three concurrent coding users on the current hardware

Staged future model on current L4 hardware

  • cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit

Status:

  • acquisition/staging path is added
  • not the live runtime on the current L4 cutover
  • should be treated as a staged artifact for later runtime experimentation and hardware-fit validation

Why this is the right 122B staging path for the current worker:

  • 4 x L4 is a better fit for an AWQ/int4 track than for an NVFP4 track
  • this keeps the 122B experiment aligned with current hardware instead of assuming a Blackwell-oriented path

Why txn545/Qwen3.5-122B-A10B-NVFP4 is not the active choice on L4:

  • NVFP4 is not the safe default for the current L4 rollout
  • if the team wants that track later, it should be treated as a separate hardware/runtime validation branch

Why no 122B model is the active live model in this round:

  • the current migration is locked to preserving service continuity on the existing 4 x L4 worker
  • the 122B track is a separate performance-fit and runtime-tuning exercise

5. Runtime Software Stack

Primary runtime after cutover:

  • SGLang

Primary interface style:

  • OpenAI-compatible /v1/*

Required runtime features:

  • tensor parallel across all four GPUs
  • prefix cache / prompt cache
  • async scheduling
  • continuous batching
  • FlashInfer when supported by the live driver/runtime stack

Observed runtime note from the live bring-up:

  • FlashInfer required ninja-build on the GPU box because it JIT-builds kernels on first run.
  • The current GPU image needed:
    • ninja-build
    • build-essential
  • After installing those packages, the FP8 runtime came up cleanly and served OpenAI-compatible traffic.

If stock SGLang underperforms:

  • keep the same public routes
  • tune CUDA/runtime behavior behind the same routed contract
  • do not reintroduce Ollama fallback

6. Implemented Repo Changes

Backend runtime service

File:

  • backend/services/runtime_llm_service.py

Current state:

  • provider catalog is standardized to sglang
  • legacy provider names like ollama and nemoclaw are mapped into sglang to avoid immediate caller breakage
  • model discovery uses /v1/models

NemoClaw client

File:

  • backend/services/nemoclaw_client.py

Current state:

  • production path now targets the shared SGLang/OpenAI-compatible endpoint
  • NVIDIA and Ollama production fallback logic is removed from the runtime path
  • legacy env names still seed config where needed

Prompt expander

File:

  • comfy_engine/scripts/prompt_expander.py

Current state:

  • now uses the shared OpenAI-compatible runtime instead of Ollama /api/generate

NemoClaw deploy helper

File:

  • backend/scripts/nemoclaw_deploy.sh

Current state:

  • rewritten around SGLang-compatible inference
  • no Ollama-era deployment assumptions

7. Route Sync And Stable Hostnames

Route-sync files:

  • infrastructure/desineuron_ingress/sync_llm_route.py
  • infrastructure/desineuron_ingress/run_llm_route_sync.sh
  • infrastructure/desineuron_ingress/desineuron-llm-route-sync.service
  • infrastructure/desineuron_ingress/desineuron-llm-route-sync.timer
  • infrastructure/desineuron_ingress/install_linux_llm_route_sync.sh

Important behavior:

  • Linux-origin discovers the current GPU private IP
  • Linux-origin updates ingress-managed route state
  • ingress forwards llm.desineuron.in and /llm/* to the GPU worker

Current safe default route-sync port in the repo:

  • 11434

Reason:

  • the repo now contains the SGLang installer and watchdog, but the public route should not auto-cut from Ollama to SGLang until the GPU runtime is actually installed and validated on-host
  • when SGLang is installed on the GPU worker, operators should flip LLM_ROUTE_PORT to the live SGLang port and then run route-sync

Manual operator-safe route sync entrypoint:

  • /usr/local/bin/run_llm_route_sync.sh

This avoids the prior failure mode where operators accidentally used a system Python without boto3.

8. GPU Watchdog And Auto-Recovery

Added GPU-side scripts:

  • infrastructure/desineuron_ingress/install_gpu_sglang_runtime.sh
  • infrastructure/desineuron_ingress/install_gpu_sglang_watchdog.sh

Installed unit names expected on the GPU worker:

  • desineuron-sglang.service
  • desineuron-sglang-watchdog.service
  • desineuron-sglang-watchdog.timer

Recovery policy:

  • ensure the SGLang service is running
  • verify /v1/models health locally
  • if the configured model path is missing, rehydrate from the canonical source
  • only report healthy after successful verification

Required recovery assertions for the SGLang watchdog:

  • confirm the process is serving /v1/models
  • confirm the returned model list contains qwen3.6-35b-a3b
  • confirm all 4 GPUs are engaged during model load
  • confirm FlashInfer dependencies are present before declaring runtime healthy

9. Model Rehydration And Staging

Added staging helper:

  • infrastructure/desineuron_ingress/acquire_qwen35_122b_nvfp4.sh

Purpose:

  • stages cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit onto GPU NVMe by default
  • does not automatically flip production traffic to that model

Expected current live model path style:

  • /opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8

Expected staged 122B path style:

  • /opt/dlami/nvme/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit

10. Roo Code Team Setup

After SGLang cutover, team members should stop using the Ollama provider mode for Desineuron-hosted inference.

Canonical team profile:

  • API Provider: OpenAI-compatible / custom OpenAI
  • Base URL: https://llm.desineuron.in/v1
  • Model: qwen3.6-35b-a3b
  • Temperature: 0.1 to 0.2
  • Server context ceiling: 131072
  • Recommended Roo context: 131072

Team decision for this wave:

  • all three team members can target 128K context through the same shared runtime
  • if real concurrent repo-heavy usage causes OOM or latency regression, the first rollback knob is the client context setting, not the model family
  • the current production-ready long-context path is pure VRAM on 4 x L4, not host-RAM spill

11. Measured SGLang Performance

Benchmark date:

  • 2026-04-22

Benchmark topology:

  • live AWS GPU worker
  • SGLang + Qwen 3.6 35B A3B FP8
  • tensor parallel 4
  • FlashInfer enabled
  • async scheduler / SGLang default continuous batching path
  • prompt-prefix caching available in runtime
  • server context ceiling: 131072

Measured results:

  • time to first token: 0.12 s
  • streamed completion wall time for a short coding/planning answer: 1.31 s
  • test concurrency: 3
  • aggregate wall time for 3 x 256-token responses: 3.61 s
  • aggregate completion tokens: 768
  • aggregate prompt tokens: 168
  • aggregate total tokens: 936
  • aggregate completion throughput: 212.76 tokens/s

Per-request timing under 3 concurrent requests:

  • request 1: 3.608 s for 256 completion tokens
  • request 2: 3.609 s for 256 completion tokens
  • request 3: 3.608 s for 256 completion tokens

Long-context smoke validation:

  • prompt size validated: 50010 prompt tokens
  • completion size: 8 tokens
  • total request size: 50018 tokens
  • wall time: 8.345 s

Operational interpretation:

  • the runtime is fast enough for three simultaneous coding users
  • TTFT is already in the sub-200 ms range on the warmed runtime
  • aggregate decode throughput is materially better than the previous Ollama-backed path while holding a 128K server context ceiling
  • Qwen 3.6 35B A3B is the correct production choice for the current one-week delivery window

12. Cutover Guidance

Use this model ID consistently across SGLang-facing clients:

  • qwen3.6-35b-a3b

Do not use this older Ollama-style model ID against SGLang:

  • qwen3.6:35b-a3b

Why:

  • SGLang rejects colons in served_model_name
  • the colon is reserved internally for adapter syntax

Backend compatibility note:

  • the Velocity backend can still map legacy provider naming internally
  • external Roo Code and OpenAI-compatible clients should use the hyphenated SGLang model ID only

Canonical Roo configuration:

  • API Provider: OpenAI-compatible or Custom OpenAI
  • Base URL: https://llm.desineuron.in/v1
  • Model: qwen3.6-35b-a3b
  • Context window: 131072
  • Temperature: 0.1 to 0.2

Recommended initial values:

  • Base URL: https://llm.desineuron.in/v1
  • Model: qwen3.6-35b-a3b
  • Context Window Size (num_ctx equivalent): 131072

Do not use:

  • Ollama provider mode pointing at the public Desineuron route after the cutover

Reason:

  • the stable contract is moving to SGLang's OpenAI-compatible interface

13. Most Efficient Working Long-Context Strategy On Current Hardware

Strategies tested against the live 4 x L4 worker:

  1. Pure-VRAM 131072 context on SGLang with tensor parallel 4 Result:
  • works
  • preserves sub-200 ms TTFT on warm short prompts
  • preserved about 212.76 tok/s aggregate completion throughput in the 3-user benchmark
  1. Hierarchical host-memory cache with 131072 context Result:
  • not production-safe on the current stack for this model
  • first failed on a model-specific page_size=1 requirement for the hybrid Mamba cache
  • second attempt progressed further but one rank died with exit code -9
  • current interpretation: this path is materially less stable than the pure-VRAM profile

Current decision:

  • keep 131072 in VRAM as the production target
  • do not use host-RAM hierarchical cache for this model in the current rollout
  • if more headroom is needed later, tune kernels and scheduling first before re-opening host-memory spill

14. NemoClaw Runtime Policy

NemoClaw should use the same shared SGLang runtime as:

  • Roo Code
  • Oracle runtime
  • backend runtime LLM jobs

This is a deliberate single-stack decision:

  • one serving runtime
  • one model family for the current wave
  • one stable routed contract

If later profiles differ, express that with config, not with a second serving stack in this phase.

15. Endpoint Checklist

These should work after cutover:

  • https://velocity.desineuron.in/llm/v1/models
  • https://velocity.desineuron.in/llm/v1/chat/completions
  • https://llm.desineuron.in/v1/models
  • https://llm.desineuron.in/v1/chat/completions

Internal backend envs:

  • LLM_BASE_URL
  • SGLANG_BASE_URL
  • SGLANG_CHAT_URL
  • SGLANG_MODELS_URL
  • SGLANG_MODEL
  • SGLANG_API_TOKEN

16. What Is Left

Still required to complete the migration end to end:

  1. Persist the 131072 launch profile into the GPU systemd runtime using the updated installer.
  2. Reinstall or update the GPU watchdog so it validates the same 131072 service profile.
  3. Repoint Linux-origin route-sync env from 11434 to the live SGLang port after GPU validation.
  4. Validate both public routes against /v1/models.
  5. Run one more public-route benchmark through ingress after cutover to capture real routed TTFT.
  6. Generate tuned L4-specific runtime configs if we want to push further on throughput without lowering context.
  7. Keep the 122B track separate; it is not part of the current production coding-runtime choice.

17. Team Hand-Off

For Roo Code today, once cutover is complete, the team only needs:

  • Base URL: https://llm.desineuron.in/v1
  • Model: qwen3.6-35b-a3b
  • Context window: 131072
  • Provider type: OpenAI-compatible

For operators, the important truth is:

  • Linux-origin controls routing
  • ingress owns the stable hostname
  • GPU box owns inference
  • NVMe owns model state
  • SGLang is the production runtime