# Desineuron Ops Control Plane Bibel ## Chapter Index 1. Purpose and Operating Model 2. Architecture Map 3. Linux Control-Plane Stack 4. AWS Machine Profiles 5. Market Data and Pricing Logic 6. S3 Asset Model and Bucket Structure 7. Model Hydration Lifecycle 8. Model Ingest From Linux to S3 9. Route Management Through the t4g Ingress 10. Daily Operations Guide 11. Launching a GPU Box 12. Hydrating a Model 13. Starting ComfyUI or Another Workload 14. Tracking Session Time and Cost 15. CSV Exports and Reporting 16. Failure Recovery Runbooks 17. Security Model and Access Control 18. Adding a New Model 19. Adding a New Instance Profile 20. Adding a New Route or Service 21. Backup and Restore 22. Validated Live Behaviors 23. Operator Retrieval Commands ## 1. Purpose and Operating Model The Desineuron Ops Control Plane is the persistent Linux-hosted operator surface for AWS infrastructure. It centralizes machine launch, model hydration, workload control, cost estimation, route management, and audit history so the team no longer depends on ad hoc Windows terminals or fragile one-off SSH sessions. Core planes: - Linux box: control plane - S3: canonical asset plane - AWS GPU nodes: ephemeral compute plane - `t4g.micro` ingress: stable public edge Current live endpoint: - `https://ops.desineuron.in/login` Current canonical S3 bucket: - `desineuron-ops-control-plane-819079556187-us-east-1` ## 2. Architecture Map ```text Team -> ops.desineuron.in -> Linux control plane -> ops-web -> ops-api -> ops-worker -> ops-db -> AWS APIs -> S3 bucket -> ingress route helper -> GPU worker nodes ``` ## 3. Linux Control-Plane Stack - Docker Compose stack under `/opt/desineuron-ops-control-plane` - PostgreSQL stores machine/session/job/audit state - API and web share the same FastAPI app - Worker refreshes markets, machines, and session costs - systemd keeps the stack persistent after reboot Primary Linux service: - `desineuron-ops-control-plane.service` ## 4. AWS Machine Profiles Initial curated profiles: - `g6-xlarge` - `g6-2xlarge` - `g6-4xlarge` - `g6-12xlarge` Each profile contains: - instance type - GPU label - vCPU / RAM - intended workloads - launch config: AMI, subnet, SGs, key, role/profile, root volume ## 5. Market Data and Pricing Logic The control plane collects: - instance offerings by region - on-demand pricing from AWS Pricing API - latest spot price history from EC2 - runtime state of all visible machines Estimated cost model v1: - live instance price signal - gp3 storage cost estimate - public IPv4/EIP cost estimate ## 6. S3 Asset Model and Bucket Structure Canonical bucket prefixes: - `models/` - `workflows/` - `references/` - `outputs/` - `manifests/` - `bootstrap/` Models are defined in the `model_catalog` table and hydrated to AWS NVMe on demand. ## 7. Model Hydration Lifecycle 1. Operator selects machine and model 2. Worker ensures `s5cmd` exists on target 3. Assets copy from S3 to `/opt/dlami/nvme/models/...` 4. Operation result is logged 5. Cache state is stored per machine Hydration verification: - if a manifest exists at `manifests/models/.json`, the control plane verifies the expected files are present on the GPU node after copy ## 8. Model Ingest From Linux to S3 The control plane can now ingest a real model directory from the Linux box into S3 without manual bucket prep. Source of truth on Linux: - `/mnt/ServerStorage/ai-models/models` Container mount inside the control plane: - `/model-library` Operator flow: 1. enter model key 2. enter human label 3. enter source path relative to the Linux model library root 4. optionally set workload and compatibility tags 5. submit `Upload to S3 + Generate Manifest` Result: - every file is uploaded under `models//` - manifest JSON is written to `manifests/models/.json` - the model catalog entry is upserted in PostgreSQL - future hydrations can use that manifest for verification ## 9. Route Management Through the t4g Ingress The ingress remains the stable public edge. Managed route flow: 1. control plane writes hostname mapping through `manage_desineuron_routes.py` 2. helper renders managed Caddy snippets 3. Caddy reloads 4. route becomes live Static Linux-origin routes still flow through the existing tunnel/nginx path. ## 10. Daily Operations Guide - open `ops.desineuron.in` - log in with internal ops credentials - review markets and costs - launch the required GPU profile - hydrate the model - start the workload - map the route if needed - export session CSVs for accounting or review ## 11. Launching a GPU Box Use the Launch form in the GUI: - choose profile - choose spot or on-demand - submit Result: - instance launches with Desineuron tags - session row is created - runtime and cost begin tracking - if spot capacity is unavailable, the UI records a failed launch job and shows an operator-facing error instead of crashing - the launcher automatically tries sibling subnets in the same VPC instead of hard-failing on one overloaded AZ ## 12. Hydrating a Model Use the Hydrate form: - choose machine - choose model Hydration copies from S3 to the instance NVMe path. ## 13. Starting ComfyUI or Another Workload Use the workload form: - choose machine - choose workload Current v1 workload profile: - `comfyui` ## 14. Tracking Session Time and Cost Each machine session is tracked in the DB and can be exported to CSV. Cost components: - compute - storage - public IPv4 v1 note: - machine cost is estimate-based - instance pricing comes from AWS live data where available - storage and public IPv4 are blended in as estimated infrastructure cost ## 15. CSV Exports and Reporting CSV export path: - `exports/sessions_latest.csv` Use this for: - session duration review - estimated expenditure review - internal ops accounting Current export path: - `/opt/desineuron-ops-control-plane/exports/sessions_latest.csv` The export is also logged in the database as a `csv_exports` record. ## 16. Failure Recovery Runbooks If the worker stops: - restart the systemd unit on Linux If a GPU node is unhealthy: - inspect machine state - inspect workload status - stop or terminate the node - relaunch from a clean profile If route mapping fails: - inspect the ingress helper - inspect Caddy reload status - verify the ops container has SSH access to the ingress node If redeploy breaks PostgreSQL permissions: - verify `/opt/desineuron-ops-control-plane/data/postgres` is owned by UID/GID `999:999` - restart `desineuron-ops-control-plane.service` - never sync runtime directories from repo into the live stack ## 17. Security Model and Access Control - app is intended to be private - secrets stay on Linux, not in repo - actions are audited - AWS workers expose only minimal required ports - operator accounts can be provisioned as email-style usernames for team access Current protected secrets: - `/opt/desineuron-ops-control-plane/.env` - `/opt/desineuron-ops-control-plane/state/desineuron-l4-node.pem` ## 18. Adding a New Model Preferred method: 1. place the model directory under `/mnt/ServerStorage/ai-models/models` 2. use the `Model Library Ingest` form in the ops console 3. let the control plane upload the files, create the manifest, and upsert the catalog entry Fallback manual method: 1. upload to S3 canonical bucket 2. add catalog entry 3. define expected prefix and optional manifest/checksum ## 19. Adding a New Instance Profile 1. add curated profile definition 2. set launch config 3. verify market visibility 4. test launch ## 20. Adding a New Route or Service 1. define hostname 2. define target backend 3. add route through GUI or helper 4. reload ingress 5. validate health If the route is for a new public hostname: 6. create the Cloudflare DNS record pointing to `98.87.120.120` 7. keep the record in `DNS only` mode 8. validate TLS issuance on first public request ## 21. Backup and Restore Persist: - Postgres data - `.env` - exported CSVs - state directory - route helper state on ingress Restore by: - recreating the compose stack - restoring DB data - restoring config/env - validating machine, model, and route state ## 22. Validated Live Behaviors As of the latest implementation pass, the following were validated against the live environment: - `ops.desineuron.in` login and dashboard render correctly - `/api/markets/instances`, `/api/markets/pricing`, `/api/sessions`, `/api/costs`, and `/api/exports/csv` return live data - a `g6.xlarge` on-demand launch was executed through the control plane and then terminated through the same surface - a `g6.xlarge` spot launch failure was handled cleanly and recorded as `InsufficientInstanceCapacity` - managed ingress route upsert/delete was executed successfully through the route helper - session and audit data now persist because API DB writes are committed per request - a model ingest smoke test uploaded `ops-smoke-model` from the Linux model library into S3 and generated a manifest ## 23. Operator Retrieval Commands Retrieve the admin password on Linux: ```bash sudo sed -n 's/^OPS_ADMIN_PASSWORD=//p' /opt/desineuron-ops-control-plane/.env ``` Check stack health: ```bash sudo systemctl status desineuron-ops-control-plane.service sudo docker compose -f /opt/desineuron-ops-control-plane/docker-compose.yml ps ``` Inspect recent API logs: ```bash sudo docker logs --tail 100 desineuron-ops-api sudo docker logs --tail 100 desineuron-ops-worker ``` Inspect exports: ```bash ls -lah /opt/desineuron-ops-control-plane/exports ```