Files

sagnik 6cdc366718 feat: Oracle Canvas, Revision History and Canvas Sharing (#33 )

Co-authored-by: Sagnik <sagnik7896@gmail.com>
Reviewed-on: sagnik/Project_Velocity#33

2026-04-23 01:20:21 +05:30

15 KiB

Raw Blame History

Desineuron AWS Coding Runtime Truth Book

Date: 2026-04-22
Scope: Coding runtime, Roo Code access, NemoClaw runtime, ingress routing, GPU recovery, model staging

1. Current Runtime Truth

The Desineuron shared coding runtime has been cut over from Ollama to SGLang while preserving the public contracts already used by the team.

Locked production decisions:

Public contract remains stable.
GPU inference remains on the AWS GPU worker, not on the Linux-origin box.
Linux-origin remains the control plane.
Ingress remains the stable routed entrypoint.
Qwen 3.6 35B A3B remains the production target model for the current 4 x L4 rollout.
NemoClaw moves onto the same shared runtime.
There is no production fallback to Ollama after cutover.

Current live public routes:

https://velocity.desineuron.in/llm
https://llm.desineuron.in

Current live API shape after cutover:

https://velocity.desineuron.in/llm/v1/models
https://velocity.desineuron.in/llm/v1/chat/completions
https://llm.desineuron.in/v1/models
https://llm.desineuron.in/v1/chat/completions
GPU SGLang bind: 172.31.46.190:30100
Linux-origin LLM route-sync target port: 30100

2. Infra Split

Linux-origin

Responsibilities:

owns route-sync logic
owns operational orchestration
updates ingress upstream target when GPU private IP changes
does not host the heavy model runtime

Ingress

Responsibilities:

terminates public hostname
renders stable reverse-proxy contracts
forwards /llm/* and llm.desineuron.in to the current GPU target

GPU worker

Responsibilities:

hosts SGLang
hosts model payloads on NVMe only
serves Roo Code, Oracle runtime, runtime LLM, and NemoClaw inference

Non-negotiable rules:

do not use the GPU public IP directly
do not keep model state on root disk
keep all large model/runtime caches on GPU NVMe

3. Live Hardware Target

Current worker class:

g6.12xlarge
4 x NVIDIA L4
96 GB VRAM total

Serving profile for this hardware:

tensor parallel size 4
prompt-prefix caching enabled
async / continuous batching enabled through SGLang
FlashInfer preferred where supported by the live CUDA stack

Measured validation on the live GPU worker:

host class: g6.12xlarge
GPU layout: 4 x NVIDIA L4
model path used for the validated runtime: /opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8
SGLang served model ID used for the test: qwen3.6-35b-a3b
validated SGLang launch profile:
- --tp-size 4
- --attention-backend flashinfer
- --context-length 131072
- --mem-fraction-static 0.88
- --dist-init-addr 127.0.0.1:50000
- --enable-metrics
required bind rule on this SGLang build:
- public HTTP server must bind to the GPU private IP, not 0.0.0.0
- internal scheduler keeps a loopback listener on the API port
- wildcard bind collides with that loopback listener on this build
public validation after cutover:
- https://velocity.desineuron.in/llm/v1/models returns 200
- https://llm.desineuron.in/v1/models returns 200
- streamed chat TTFT through public ingress measured at about 2.36 s
- one short non-stream completion measured about 33.86 completion tok/s

4. Production Model Policy

Primary production model

user-facing family: Qwen 3.6 35B A3B
exact SGLang served model ID: qwen3.6-35b-a3b

Why it remains live:

fits the current 4 x L4 target
already aligned with current team workflows
suitable for coding/runtime use while the SGLang migration lands
measured well enough for three concurrent coding users on the current hardware

Staged future model on current L4 hardware

cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit

Status:

acquisition/staging path is added
not the live runtime on the current L4 cutover
should be treated as a staged artifact for later runtime experimentation and hardware-fit validation

Why this is the right 122B staging path for the current worker:

4 x L4 is a better fit for an AWQ/int4 track than for an NVFP4 track
this keeps the 122B experiment aligned with current hardware instead of assuming a Blackwell-oriented path

Why txn545/Qwen3.5-122B-A10B-NVFP4 is not the active choice on L4:

NVFP4 is not the safe default for the current L4 rollout
if the team wants that track later, it should be treated as a separate hardware/runtime validation branch

Why no 122B model is the active live model in this round:

the current migration is locked to preserving service continuity on the existing 4 x L4 worker
the 122B track is a separate performance-fit and runtime-tuning exercise

5. Runtime Software Stack

Primary runtime after cutover:

SGLang

Primary interface style:

OpenAI-compatible /v1/*

Required runtime features:

tensor parallel across all four GPUs
prefix cache / prompt cache
async scheduling
continuous batching
FlashInfer when supported by the live driver/runtime stack

Observed runtime note from the live bring-up:

FlashInfer required ninja-build on the GPU box because it JIT-builds kernels on first run.
The current GPU image needed:
- ninja-build
- build-essential
After installing those packages, the FP8 runtime came up cleanly and served OpenAI-compatible traffic.

If stock SGLang underperforms:

keep the same public routes
tune CUDA/runtime behavior behind the same routed contract
do not reintroduce Ollama fallback

6. Implemented Repo Changes

Backend runtime service

File:

backend/services/runtime_llm_service.py

Current state:

provider catalog is standardized to sglang
legacy provider names like ollama and nemoclaw are mapped into sglang to avoid immediate caller breakage
model discovery uses /v1/models

NemoClaw client

File:

backend/services/nemoclaw_client.py

Current state:

production path now targets the shared SGLang/OpenAI-compatible endpoint
NVIDIA and Ollama production fallback logic is removed from the runtime path
legacy env names still seed config where needed

Prompt expander

File:

comfy_engine/scripts/prompt_expander.py

Current state:

now uses the shared OpenAI-compatible runtime instead of Ollama /api/generate

NemoClaw deploy helper

File:

backend/scripts/nemoclaw_deploy.sh

Current state:

rewritten around SGLang-compatible inference
no Ollama-era deployment assumptions

7. Route Sync And Stable Hostnames

Route-sync files:

infrastructure/desineuron_ingress/sync_llm_route.py
infrastructure/desineuron_ingress/run_llm_route_sync.sh
infrastructure/desineuron_ingress/desineuron-llm-route-sync.service
infrastructure/desineuron_ingress/desineuron-llm-route-sync.timer
infrastructure/desineuron_ingress/install_linux_llm_route_sync.sh

Important behavior:

Linux-origin discovers the current GPU private IP
Linux-origin updates ingress-managed route state
ingress forwards llm.desineuron.in and /llm/* to the GPU worker

Current safe default route-sync port in the repo:

11434

Reason:

the repo now contains the SGLang installer and watchdog, but the public route should not auto-cut from Ollama to SGLang until the GPU runtime is actually installed and validated on-host
when SGLang is installed on the GPU worker, operators should flip LLM_ROUTE_PORT to the live SGLang port and then run route-sync

Manual operator-safe route sync entrypoint:

/usr/local/bin/run_llm_route_sync.sh

This avoids the prior failure mode where operators accidentally used a system Python without boto3.

8. GPU Watchdog And Auto-Recovery

Added GPU-side scripts:

infrastructure/desineuron_ingress/install_gpu_sglang_runtime.sh
infrastructure/desineuron_ingress/install_gpu_sglang_watchdog.sh

Installed unit names expected on the GPU worker:

desineuron-sglang.service
desineuron-sglang-watchdog.service
desineuron-sglang-watchdog.timer

Recovery policy:

ensure the SGLang service is running
verify /v1/models health locally
if the configured model path is missing, rehydrate from the canonical source
only report healthy after successful verification

Required recovery assertions for the SGLang watchdog:

confirm the process is serving /v1/models
confirm the returned model list contains qwen3.6-35b-a3b
confirm all 4 GPUs are engaged during model load
confirm FlashInfer dependencies are present before declaring runtime healthy

9. Model Rehydration And Staging

Added staging helper:

infrastructure/desineuron_ingress/acquire_qwen35_122b_nvfp4.sh

Purpose:

stages cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit onto GPU NVMe by default
does not automatically flip production traffic to that model

Expected current live model path style:

/opt/dlami/nvme/models/Qwen-Qwen3.6-35B-A3B-FP8

Expected staged 122B path style:

/opt/dlami/nvme/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit

10. Roo Code Team Setup

After SGLang cutover, team members should stop using the Ollama provider mode for Desineuron-hosted inference.

Canonical team profile:

API Provider: OpenAI-compatible / custom OpenAI
Base URL: https://llm.desineuron.in/v1
Model: qwen3.6-35b-a3b
Temperature: 0.1 to 0.2
Server context ceiling: 131072
Recommended Roo context: 131072

Team decision for this wave:

all three team members can target 128K context through the same shared runtime
if real concurrent repo-heavy usage causes OOM or latency regression, the first rollback knob is the client context setting, not the model family
the current production-ready long-context path is pure VRAM on 4 x L4, not host-RAM spill

11. Measured SGLang Performance

Benchmark date:

2026-04-22

Benchmark topology:

live AWS GPU worker
SGLang + Qwen 3.6 35B A3B FP8
tensor parallel 4
FlashInfer enabled
async scheduler / SGLang default continuous batching path
prompt-prefix caching available in runtime
server context ceiling: 131072

Measured results:

time to first token: 0.12 s
streamed completion wall time for a short coding/planning answer: 1.31 s
test concurrency: 3
aggregate wall time for 3 x 256-token responses: 3.61 s
aggregate completion tokens: 768
aggregate prompt tokens: 168
aggregate total tokens: 936
aggregate completion throughput: 212.76 tokens/s

Per-request timing under 3 concurrent requests:

request 1: 3.608 s for 256 completion tokens
request 2: 3.609 s for 256 completion tokens
request 3: 3.608 s for 256 completion tokens

Long-context smoke validation:

prompt size validated: 50010 prompt tokens
completion size: 8 tokens
total request size: 50018 tokens
wall time: 8.345 s

Operational interpretation:

the runtime is fast enough for three simultaneous coding users
TTFT is already in the sub-200 ms range on the warmed runtime
aggregate decode throughput is materially better than the previous Ollama-backed path while holding a 128K server context ceiling
Qwen 3.6 35B A3B is the correct production choice for the current one-week delivery window

12. Cutover Guidance

Use this model ID consistently across SGLang-facing clients:

qwen3.6-35b-a3b

Do not use this older Ollama-style model ID against SGLang:

qwen3.6:35b-a3b

Why:

SGLang rejects colons in served_model_name
the colon is reserved internally for adapter syntax

Backend compatibility note:

the Velocity backend can still map legacy provider naming internally
external Roo Code and OpenAI-compatible clients should use the hyphenated SGLang model ID only

Canonical Roo configuration:

API Provider: OpenAI-compatible or Custom OpenAI
Base URL: https://llm.desineuron.in/v1
Model: qwen3.6-35b-a3b
Context window: 131072
Temperature: 0.1 to 0.2

Recommended initial values:

Base URL: https://llm.desineuron.in/v1
Model: qwen3.6-35b-a3b
Context Window Size (num_ctx equivalent): 131072

Do not use:

Ollama provider mode pointing at the public Desineuron route after the cutover

Reason:

the stable contract is moving to SGLang's OpenAI-compatible interface

13. Most Efficient Working Long-Context Strategy On Current Hardware

Strategies tested against the live 4 x L4 worker:

Pure-VRAM 131072 context on SGLang with tensor parallel 4 Result:

works
preserves sub-200 ms TTFT on warm short prompts
preserved about 212.76 tok/s aggregate completion throughput in the 3-user benchmark

Hierarchical host-memory cache with 131072 context Result:

not production-safe on the current stack for this model
first failed on a model-specific page_size=1 requirement for the hybrid Mamba cache
second attempt progressed further but one rank died with exit code -9
current interpretation: this path is materially less stable than the pure-VRAM profile

Current decision:

keep 131072 in VRAM as the production target
do not use host-RAM hierarchical cache for this model in the current rollout
if more headroom is needed later, tune kernels and scheduling first before re-opening host-memory spill

14. NemoClaw Runtime Policy

NemoClaw should use the same shared SGLang runtime as:

Roo Code
Oracle runtime
backend runtime LLM jobs

This is a deliberate single-stack decision:

one serving runtime
one model family for the current wave
one stable routed contract

If later profiles differ, express that with config, not with a second serving stack in this phase.

15. Endpoint Checklist

These should work after cutover:

https://velocity.desineuron.in/llm/v1/models
https://velocity.desineuron.in/llm/v1/chat/completions
https://llm.desineuron.in/v1/models
https://llm.desineuron.in/v1/chat/completions

Internal backend envs:

LLM_BASE_URL
SGLANG_BASE_URL
SGLANG_CHAT_URL
SGLANG_MODELS_URL
SGLANG_MODEL
SGLANG_API_TOKEN

16. What Is Left

Still required to complete the migration end to end:

Persist the 131072 launch profile into the GPU systemd runtime using the updated installer.
Reinstall or update the GPU watchdog so it validates the same 131072 service profile.
Repoint Linux-origin route-sync env from 11434 to the live SGLang port after GPU validation.
Validate both public routes against /v1/models.
Run one more public-route benchmark through ingress after cutover to capture real routed TTFT.
Generate tuned L4-specific runtime configs if we want to push further on throughput without lowering context.
Keep the 122B track separate; it is not part of the current production coding-runtime choice.

17. Team Hand-Off

For Roo Code today, once cutover is complete, the team only needs:

Base URL: https://llm.desineuron.in/v1
Model: qwen3.6-35b-a3b
Context window: 131072
Provider type: OpenAI-compatible

For operators, the important truth is:

Linux-origin controls routing
ingress owns the stable hostname
GPU box owns inference
NVMe owns model state
SGLang is the production runtime

15 KiB Raw Blame History

Desineuron AWS Coding Runtime Truth Book

1. Current Runtime Truth

2. Infra Split

Linux-origin

Ingress

GPU worker

3. Live Hardware Target

4. Production Model Policy

Primary production model

Staged future model on current L4 hardware

5. Runtime Software Stack

6. Implemented Repo Changes

Backend runtime service

NemoClaw client

Prompt expander

NemoClaw deploy helper

7. Route Sync And Stable Hostnames

8. GPU Watchdog And Auto-Recovery

9. Model Rehydration And Staging

10. Roo Code Team Setup

11. Measured SGLang Performance

12. Cutover Guidance

13. Most Efficient Working Long-Context Strategy On Current Hardware

14. NemoClaw Runtime Policy

15. Endpoint Checklist

16. What Is Left

17. Team Hand-Off

15 KiB

Raw Blame History