Files
Project_Velocity/.Agent Context/Bibels/Desineuron Ops Control Plane Bibel.md
2026-04-12 02:02:58 +05:30

9.4 KiB

Desineuron Ops Control Plane Bibel

Chapter Index

  1. Purpose and Operating Model
  2. Architecture Map
  3. Linux Control-Plane Stack
  4. AWS Machine Profiles
  5. Market Data and Pricing Logic
  6. S3 Asset Model and Bucket Structure
  7. Model Hydration Lifecycle
  8. Model Ingest From Linux to S3
  9. Route Management Through the t4g Ingress
  10. Daily Operations Guide
  11. Launching a GPU Box
  12. Hydrating a Model
  13. Starting ComfyUI or Another Workload
  14. Tracking Session Time and Cost
  15. CSV Exports and Reporting
  16. Failure Recovery Runbooks
  17. Security Model and Access Control
  18. Adding a New Model
  19. Adding a New Instance Profile
  20. Adding a New Route or Service
  21. Backup and Restore
  22. Validated Live Behaviors
  23. Operator Retrieval Commands

1. Purpose and Operating Model

The Desineuron Ops Control Plane is the persistent Linux-hosted operator surface for AWS infrastructure. It centralizes machine launch, model hydration, workload control, cost estimation, route management, and audit history so the team no longer depends on ad hoc Windows terminals or fragile one-off SSH sessions.

Core planes:

  • Linux box: control plane
  • S3: canonical asset plane
  • AWS GPU nodes: ephemeral compute plane
  • t4g.micro ingress: stable public edge

Current live endpoint:

  • https://ops.desineuron.in/login

Current canonical S3 bucket:

  • desineuron-ops-control-plane-819079556187-us-east-1

2. Architecture Map

Team
  -> ops.desineuron.in
  -> Linux control plane
     -> ops-web
     -> ops-api
     -> ops-worker
     -> ops-db
  -> AWS APIs
  -> S3 bucket
  -> ingress route helper
  -> GPU worker nodes

3. Linux Control-Plane Stack

  • Docker Compose stack under /opt/desineuron-ops-control-plane
  • PostgreSQL stores machine/session/job/audit state
  • API and web share the same FastAPI app
  • Worker refreshes markets, machines, and session costs
  • systemd keeps the stack persistent after reboot

Primary Linux service:

  • desineuron-ops-control-plane.service

4. AWS Machine Profiles

Initial curated profiles:

  • g6-xlarge
  • g6-2xlarge
  • g6-4xlarge
  • g6-12xlarge

Each profile contains:

  • instance type
  • GPU label
  • vCPU / RAM
  • intended workloads
  • launch config: AMI, subnet, SGs, key, role/profile, root volume

5. Market Data and Pricing Logic

The control plane collects:

  • instance offerings by region
  • on-demand pricing from AWS Pricing API
  • latest spot price history from EC2
  • runtime state of all visible machines

Estimated cost model v1:

  • live instance price signal
  • gp3 storage cost estimate
  • public IPv4/EIP cost estimate

6. S3 Asset Model and Bucket Structure

Canonical bucket prefixes:

  • models/
  • workflows/
  • references/
  • outputs/
  • manifests/
  • bootstrap/

Models are defined in the model_catalog table and hydrated to AWS NVMe on demand.

7. Model Hydration Lifecycle

  1. Operator selects machine and model
  2. Worker ensures s5cmd exists on target
  3. Assets copy from S3 to /opt/dlami/nvme/models/...
  4. Operation result is logged
  5. Cache state is stored per machine

Hydration verification:

  • if a manifest exists at manifests/models/<model-key>.json, the control plane verifies the expected files are present on the GPU node after copy

8. Model Ingest From Linux to S3

The control plane can now ingest a real model directory from the Linux box into S3 without manual bucket prep.

Source of truth on Linux:

  • /mnt/ServerStorage/ai-models/models

Container mount inside the control plane:

  • /model-library

Operator flow:

  1. enter model key
  2. enter human label
  3. enter source path relative to the Linux model library root
  4. optionally set workload and compatibility tags
  5. submit Upload to S3 + Generate Manifest

Result:

  • every file is uploaded under models/<model-key>/
  • manifest JSON is written to manifests/models/<model-key>.json
  • the model catalog entry is upserted in PostgreSQL
  • future hydrations can use that manifest for verification

9. Route Management Through the t4g Ingress

The ingress remains the stable public edge.

Managed route flow:

  1. control plane writes hostname mapping through manage_desineuron_routes.py
  2. helper renders managed Caddy snippets
  3. Caddy reloads
  4. route becomes live

Static Linux-origin routes still flow through the existing tunnel/nginx path.

10. Daily Operations Guide

  • open ops.desineuron.in
  • log in with internal ops credentials
  • review markets and costs
  • launch the required GPU profile
  • hydrate the model
  • start the workload
  • map the route if needed
  • export session CSVs for accounting or review

11. Launching a GPU Box

Use the Launch form in the GUI:

  • choose profile
  • choose spot or on-demand
  • submit

Result:

  • instance launches with Desineuron tags
  • session row is created
  • runtime and cost begin tracking
  • if spot capacity is unavailable, the UI records a failed launch job and shows an operator-facing error instead of crashing
  • the launcher automatically tries sibling subnets in the same VPC instead of hard-failing on one overloaded AZ

12. Hydrating a Model

Use the Hydrate form:

  • choose machine
  • choose model

Hydration copies from S3 to the instance NVMe path.

13. Starting ComfyUI or Another Workload

Use the workload form:

  • choose machine
  • choose workload

Current v1 workload profile:

  • comfyui

14. Tracking Session Time and Cost

Each machine session is tracked in the DB and can be exported to CSV.

Cost components:

  • compute
  • storage
  • public IPv4

v1 note:

  • machine cost is estimate-based
  • instance pricing comes from AWS live data where available
  • storage and public IPv4 are blended in as estimated infrastructure cost

15. CSV Exports and Reporting

CSV export path:

  • exports/sessions_latest.csv

Use this for:

  • session duration review
  • estimated expenditure review
  • internal ops accounting

Current export path:

  • /opt/desineuron-ops-control-plane/exports/sessions_latest.csv

The export is also logged in the database as a csv_exports record.

16. Failure Recovery Runbooks

If the worker stops:

  • restart the systemd unit on Linux

If a GPU node is unhealthy:

  • inspect machine state
  • inspect workload status
  • stop or terminate the node
  • relaunch from a clean profile

If route mapping fails:

  • inspect the ingress helper
  • inspect Caddy reload status
  • verify the ops container has SSH access to the ingress node

If redeploy breaks PostgreSQL permissions:

  • verify /opt/desineuron-ops-control-plane/data/postgres is owned by UID/GID 999:999
  • restart desineuron-ops-control-plane.service
  • never sync runtime directories from repo into the live stack

17. Security Model and Access Control

  • app is intended to be private
  • secrets stay on Linux, not in repo
  • actions are audited
  • AWS workers expose only minimal required ports
  • operator accounts can be provisioned as email-style usernames for team access

Current protected secrets:

  • /opt/desineuron-ops-control-plane/.env
  • /opt/desineuron-ops-control-plane/state/desineuron-l4-node.pem

18. Adding a New Model

Preferred method:

  1. place the model directory under /mnt/ServerStorage/ai-models/models
  2. use the Model Library Ingest form in the ops console
  3. let the control plane upload the files, create the manifest, and upsert the catalog entry

Fallback manual method:

  1. upload to S3 canonical bucket
  2. add catalog entry
  3. define expected prefix and optional manifest/checksum

19. Adding a New Instance Profile

  1. add curated profile definition
  2. set launch config
  3. verify market visibility
  4. test launch

20. Adding a New Route or Service

  1. define hostname
  2. define target backend
  3. add route through GUI or helper
  4. reload ingress
  5. validate health

If the route is for a new public hostname:

  1. create the Cloudflare DNS record pointing to 98.87.120.120
  2. keep the record in DNS only mode
  3. validate TLS issuance on first public request

21. Backup and Restore

Persist:

  • Postgres data
  • .env
  • exported CSVs
  • state directory
  • route helper state on ingress

Restore by:

  • recreating the compose stack
  • restoring DB data
  • restoring config/env
  • validating machine, model, and route state

22. Validated Live Behaviors

As of the latest implementation pass, the following were validated against the live environment:

  • ops.desineuron.in login and dashboard render correctly
  • /api/markets/instances, /api/markets/pricing, /api/sessions, /api/costs, and /api/exports/csv return live data
  • a g6.xlarge on-demand launch was executed through the control plane and then terminated through the same surface
  • a g6.xlarge spot launch failure was handled cleanly and recorded as InsufficientInstanceCapacity
  • managed ingress route upsert/delete was executed successfully through the route helper
  • session and audit data now persist because API DB writes are committed per request
  • a model ingest smoke test uploaded ops-smoke-model from the Linux model library into S3 and generated a manifest

23. Operator Retrieval Commands

Retrieve the admin password on Linux:

sudo sed -n 's/^OPS_ADMIN_PASSWORD=//p' /opt/desineuron-ops-control-plane/.env

Check stack health:

sudo systemctl status desineuron-ops-control-plane.service
sudo docker compose -f /opt/desineuron-ops-control-plane/docker-compose.yml ps

Inspect recent API logs:

sudo docker logs --tail 100 desineuron-ops-api
sudo docker logs --tail 100 desineuron-ops-worker

Inspect exports:

ls -lah /opt/desineuron-ops-control-plane/exports