9.4 KiB
Desineuron Ops Control Plane Bibel
Chapter Index
- Purpose and Operating Model
- Architecture Map
- Linux Control-Plane Stack
- AWS Machine Profiles
- Market Data and Pricing Logic
- S3 Asset Model and Bucket Structure
- Model Hydration Lifecycle
- Model Ingest From Linux to S3
- Route Management Through the t4g Ingress
- Daily Operations Guide
- Launching a GPU Box
- Hydrating a Model
- Starting ComfyUI or Another Workload
- Tracking Session Time and Cost
- CSV Exports and Reporting
- Failure Recovery Runbooks
- Security Model and Access Control
- Adding a New Model
- Adding a New Instance Profile
- Adding a New Route or Service
- Backup and Restore
- Validated Live Behaviors
- Operator Retrieval Commands
1. Purpose and Operating Model
The Desineuron Ops Control Plane is the persistent Linux-hosted operator surface for AWS infrastructure. It centralizes machine launch, model hydration, workload control, cost estimation, route management, and audit history so the team no longer depends on ad hoc Windows terminals or fragile one-off SSH sessions.
Core planes:
- Linux box: control plane
- S3: canonical asset plane
- AWS GPU nodes: ephemeral compute plane
t4g.microingress: stable public edge
Current live endpoint:
https://ops.desineuron.in/login
Current canonical S3 bucket:
desineuron-ops-control-plane-819079556187-us-east-1
2. Architecture Map
Team
-> ops.desineuron.in
-> Linux control plane
-> ops-web
-> ops-api
-> ops-worker
-> ops-db
-> AWS APIs
-> S3 bucket
-> ingress route helper
-> GPU worker nodes
3. Linux Control-Plane Stack
- Docker Compose stack under
/opt/desineuron-ops-control-plane - PostgreSQL stores machine/session/job/audit state
- API and web share the same FastAPI app
- Worker refreshes markets, machines, and session costs
- systemd keeps the stack persistent after reboot
Primary Linux service:
desineuron-ops-control-plane.service
4. AWS Machine Profiles
Initial curated profiles:
g6-xlargeg6-2xlargeg6-4xlargeg6-12xlarge
Each profile contains:
- instance type
- GPU label
- vCPU / RAM
- intended workloads
- launch config: AMI, subnet, SGs, key, role/profile, root volume
5. Market Data and Pricing Logic
The control plane collects:
- instance offerings by region
- on-demand pricing from AWS Pricing API
- latest spot price history from EC2
- runtime state of all visible machines
Estimated cost model v1:
- live instance price signal
- gp3 storage cost estimate
- public IPv4/EIP cost estimate
6. S3 Asset Model and Bucket Structure
Canonical bucket prefixes:
models/workflows/references/outputs/manifests/bootstrap/
Models are defined in the model_catalog table and hydrated to AWS NVMe on demand.
7. Model Hydration Lifecycle
- Operator selects machine and model
- Worker ensures
s5cmdexists on target - Assets copy from S3 to
/opt/dlami/nvme/models/... - Operation result is logged
- Cache state is stored per machine
Hydration verification:
- if a manifest exists at
manifests/models/<model-key>.json, the control plane verifies the expected files are present on the GPU node after copy
8. Model Ingest From Linux to S3
The control plane can now ingest a real model directory from the Linux box into S3 without manual bucket prep.
Source of truth on Linux:
/mnt/ServerStorage/ai-models/models
Container mount inside the control plane:
/model-library
Operator flow:
- enter model key
- enter human label
- enter source path relative to the Linux model library root
- optionally set workload and compatibility tags
- submit
Upload to S3 + Generate Manifest
Result:
- every file is uploaded under
models/<model-key>/ - manifest JSON is written to
manifests/models/<model-key>.json - the model catalog entry is upserted in PostgreSQL
- future hydrations can use that manifest for verification
9. Route Management Through the t4g Ingress
The ingress remains the stable public edge.
Managed route flow:
- control plane writes hostname mapping through
manage_desineuron_routes.py - helper renders managed Caddy snippets
- Caddy reloads
- route becomes live
Static Linux-origin routes still flow through the existing tunnel/nginx path.
10. Daily Operations Guide
- open
ops.desineuron.in - log in with internal ops credentials
- review markets and costs
- launch the required GPU profile
- hydrate the model
- start the workload
- map the route if needed
- export session CSVs for accounting or review
11. Launching a GPU Box
Use the Launch form in the GUI:
- choose profile
- choose spot or on-demand
- submit
Result:
- instance launches with Desineuron tags
- session row is created
- runtime and cost begin tracking
- if spot capacity is unavailable, the UI records a failed launch job and shows an operator-facing error instead of crashing
- the launcher automatically tries sibling subnets in the same VPC instead of hard-failing on one overloaded AZ
12. Hydrating a Model
Use the Hydrate form:
- choose machine
- choose model
Hydration copies from S3 to the instance NVMe path.
13. Starting ComfyUI or Another Workload
Use the workload form:
- choose machine
- choose workload
Current v1 workload profile:
comfyui
14. Tracking Session Time and Cost
Each machine session is tracked in the DB and can be exported to CSV.
Cost components:
- compute
- storage
- public IPv4
v1 note:
- machine cost is estimate-based
- instance pricing comes from AWS live data where available
- storage and public IPv4 are blended in as estimated infrastructure cost
15. CSV Exports and Reporting
CSV export path:
exports/sessions_latest.csv
Use this for:
- session duration review
- estimated expenditure review
- internal ops accounting
Current export path:
/opt/desineuron-ops-control-plane/exports/sessions_latest.csv
The export is also logged in the database as a csv_exports record.
16. Failure Recovery Runbooks
If the worker stops:
- restart the systemd unit on Linux
If a GPU node is unhealthy:
- inspect machine state
- inspect workload status
- stop or terminate the node
- relaunch from a clean profile
If route mapping fails:
- inspect the ingress helper
- inspect Caddy reload status
- verify the ops container has SSH access to the ingress node
If redeploy breaks PostgreSQL permissions:
- verify
/opt/desineuron-ops-control-plane/data/postgresis owned by UID/GID999:999 - restart
desineuron-ops-control-plane.service - never sync runtime directories from repo into the live stack
17. Security Model and Access Control
- app is intended to be private
- secrets stay on Linux, not in repo
- actions are audited
- AWS workers expose only minimal required ports
- operator accounts can be provisioned as email-style usernames for team access
Current protected secrets:
/opt/desineuron-ops-control-plane/.env/opt/desineuron-ops-control-plane/state/desineuron-l4-node.pem
18. Adding a New Model
Preferred method:
- place the model directory under
/mnt/ServerStorage/ai-models/models - use the
Model Library Ingestform in the ops console - let the control plane upload the files, create the manifest, and upsert the catalog entry
Fallback manual method:
- upload to S3 canonical bucket
- add catalog entry
- define expected prefix and optional manifest/checksum
19. Adding a New Instance Profile
- add curated profile definition
- set launch config
- verify market visibility
- test launch
20. Adding a New Route or Service
- define hostname
- define target backend
- add route through GUI or helper
- reload ingress
- validate health
If the route is for a new public hostname:
- create the Cloudflare DNS record pointing to
98.87.120.120 - keep the record in
DNS onlymode - validate TLS issuance on first public request
21. Backup and Restore
Persist:
- Postgres data
.env- exported CSVs
- state directory
- route helper state on ingress
Restore by:
- recreating the compose stack
- restoring DB data
- restoring config/env
- validating machine, model, and route state
22. Validated Live Behaviors
As of the latest implementation pass, the following were validated against the live environment:
ops.desineuron.inlogin and dashboard render correctly/api/markets/instances,/api/markets/pricing,/api/sessions,/api/costs, and/api/exports/csvreturn live data- a
g6.xlargeon-demand launch was executed through the control plane and then terminated through the same surface - a
g6.xlargespot launch failure was handled cleanly and recorded asInsufficientInstanceCapacity - managed ingress route upsert/delete was executed successfully through the route helper
- session and audit data now persist because API DB writes are committed per request
- a model ingest smoke test uploaded
ops-smoke-modelfrom the Linux model library into S3 and generated a manifest
23. Operator Retrieval Commands
Retrieve the admin password on Linux:
sudo sed -n 's/^OPS_ADMIN_PASSWORD=//p' /opt/desineuron-ops-control-plane/.env
Check stack health:
sudo systemctl status desineuron-ops-control-plane.service
sudo docker compose -f /opt/desineuron-ops-control-plane/docker-compose.yml ps
Inspect recent API logs:
sudo docker logs --tail 100 desineuron-ops-api
sudo docker logs --tail 100 desineuron-ops-worker
Inspect exports:
ls -lah /opt/desineuron-ops-control-plane/exports