sagnik/Project_Velocity

Fork 2

Files

Sagnik 075ab280ad Built the Sentinel Tab

2026-04-12 02:02:58 +05:30

9.4 KiB

Raw Permalink Blame History

Desineuron Ops Control Plane Bibel

Chapter Index

Purpose and Operating Model
Architecture Map
Linux Control-Plane Stack
AWS Machine Profiles
Market Data and Pricing Logic
S3 Asset Model and Bucket Structure
Model Hydration Lifecycle
Model Ingest From Linux to S3
Route Management Through the t4g Ingress
Daily Operations Guide
Launching a GPU Box
Hydrating a Model
Starting ComfyUI or Another Workload
Tracking Session Time and Cost
CSV Exports and Reporting
Failure Recovery Runbooks
Security Model and Access Control
Adding a New Model
Adding a New Instance Profile
Adding a New Route or Service
Backup and Restore
Validated Live Behaviors
Operator Retrieval Commands

1. Purpose and Operating Model

The Desineuron Ops Control Plane is the persistent Linux-hosted operator surface for AWS infrastructure. It centralizes machine launch, model hydration, workload control, cost estimation, route management, and audit history so the team no longer depends on ad hoc Windows terminals or fragile one-off SSH sessions.

Core planes:

Linux box: control plane
S3: canonical asset plane
AWS GPU nodes: ephemeral compute plane
t4g.micro ingress: stable public edge

Current live endpoint:

https://ops.desineuron.in/login

Current canonical S3 bucket:

desineuron-ops-control-plane-819079556187-us-east-1

2. Architecture Map

Team
  -> ops.desineuron.in
  -> Linux control plane
     -> ops-web
     -> ops-api
     -> ops-worker
     -> ops-db
  -> AWS APIs
  -> S3 bucket
  -> ingress route helper
  -> GPU worker nodes

3. Linux Control-Plane Stack

Docker Compose stack under /opt/desineuron-ops-control-plane
PostgreSQL stores machine/session/job/audit state
API and web share the same FastAPI app
Worker refreshes markets, machines, and session costs
systemd keeps the stack persistent after reboot

Primary Linux service:

desineuron-ops-control-plane.service

4. AWS Machine Profiles

Initial curated profiles:

g6-xlarge
g6-2xlarge
g6-4xlarge
g6-12xlarge

Each profile contains:

instance type
GPU label
vCPU / RAM
intended workloads
launch config: AMI, subnet, SGs, key, role/profile, root volume

5. Market Data and Pricing Logic

The control plane collects:

instance offerings by region
on-demand pricing from AWS Pricing API
latest spot price history from EC2
runtime state of all visible machines

Estimated cost model v1:

live instance price signal
gp3 storage cost estimate
public IPv4/EIP cost estimate

6. S3 Asset Model and Bucket Structure

Canonical bucket prefixes:

models/
workflows/
references/
outputs/
manifests/
bootstrap/

Models are defined in the model_catalog table and hydrated to AWS NVMe on demand.

7. Model Hydration Lifecycle

Operator selects machine and model
Worker ensures s5cmd exists on target
Assets copy from S3 to /opt/dlami/nvme/models/...
Operation result is logged
Cache state is stored per machine

Hydration verification:

if a manifest exists at manifests/models/<model-key>.json, the control plane verifies the expected files are present on the GPU node after copy

8. Model Ingest From Linux to S3

The control plane can now ingest a real model directory from the Linux box into S3 without manual bucket prep.

Source of truth on Linux:

/mnt/ServerStorage/ai-models/models

Container mount inside the control plane:

/model-library

Operator flow:

enter model key
enter human label
enter source path relative to the Linux model library root
optionally set workload and compatibility tags
submit Upload to S3 + Generate Manifest

Result:

every file is uploaded under models/<model-key>/
manifest JSON is written to manifests/models/<model-key>.json
the model catalog entry is upserted in PostgreSQL
future hydrations can use that manifest for verification

9. Route Management Through the t4g Ingress

The ingress remains the stable public edge.

Managed route flow:

control plane writes hostname mapping through manage_desineuron_routes.py
helper renders managed Caddy snippets
Caddy reloads
route becomes live

Static Linux-origin routes still flow through the existing tunnel/nginx path.

10. Daily Operations Guide

open ops.desineuron.in
log in with internal ops credentials
review markets and costs
launch the required GPU profile
hydrate the model
start the workload
map the route if needed
export session CSVs for accounting or review

11. Launching a GPU Box

Use the Launch form in the GUI:

choose profile
choose spot or on-demand
submit

Result:

instance launches with Desineuron tags
session row is created
runtime and cost begin tracking
if spot capacity is unavailable, the UI records a failed launch job and shows an operator-facing error instead of crashing
the launcher automatically tries sibling subnets in the same VPC instead of hard-failing on one overloaded AZ

12. Hydrating a Model

Use the Hydrate form:

choose machine
choose model

Hydration copies from S3 to the instance NVMe path.

13. Starting ComfyUI or Another Workload

Use the workload form:

choose machine
choose workload

Current v1 workload profile:

comfyui

14. Tracking Session Time and Cost

Each machine session is tracked in the DB and can be exported to CSV.

Cost components:

compute
storage
public IPv4

v1 note:

machine cost is estimate-based
instance pricing comes from AWS live data where available
storage and public IPv4 are blended in as estimated infrastructure cost

15. CSV Exports and Reporting

CSV export path:

exports/sessions_latest.csv

Use this for:

session duration review
estimated expenditure review
internal ops accounting

Current export path:

/opt/desineuron-ops-control-plane/exports/sessions_latest.csv

The export is also logged in the database as a csv_exports record.

16. Failure Recovery Runbooks

If the worker stops:

restart the systemd unit on Linux

If a GPU node is unhealthy:

inspect machine state
inspect workload status
stop or terminate the node
relaunch from a clean profile

If route mapping fails:

inspect the ingress helper
inspect Caddy reload status
verify the ops container has SSH access to the ingress node

If redeploy breaks PostgreSQL permissions:

verify /opt/desineuron-ops-control-plane/data/postgres is owned by UID/GID 999:999
restart desineuron-ops-control-plane.service
never sync runtime directories from repo into the live stack

17. Security Model and Access Control

app is intended to be private
secrets stay on Linux, not in repo
actions are audited
AWS workers expose only minimal required ports
operator accounts can be provisioned as email-style usernames for team access

Current protected secrets:

/opt/desineuron-ops-control-plane/.env
/opt/desineuron-ops-control-plane/state/desineuron-l4-node.pem

18. Adding a New Model

Preferred method:

place the model directory under /mnt/ServerStorage/ai-models/models
use the Model Library Ingest form in the ops console
let the control plane upload the files, create the manifest, and upsert the catalog entry

Fallback manual method:

upload to S3 canonical bucket
add catalog entry
define expected prefix and optional manifest/checksum

19. Adding a New Instance Profile

add curated profile definition
set launch config
verify market visibility
test launch

20. Adding a New Route or Service

define hostname
define target backend
add route through GUI or helper
reload ingress
validate health

If the route is for a new public hostname:

create the Cloudflare DNS record pointing to 98.87.120.120
keep the record in DNS only mode
validate TLS issuance on first public request

21. Backup and Restore

Persist:

Postgres data
.env
exported CSVs
state directory
route helper state on ingress

Restore by:

recreating the compose stack
restoring DB data
restoring config/env
validating machine, model, and route state

22. Validated Live Behaviors

As of the latest implementation pass, the following were validated against the live environment:

ops.desineuron.in login and dashboard render correctly
/api/markets/instances, /api/markets/pricing, /api/sessions, /api/costs, and /api/exports/csv return live data
a g6.xlarge on-demand launch was executed through the control plane and then terminated through the same surface
a g6.xlarge spot launch failure was handled cleanly and recorded as InsufficientInstanceCapacity
managed ingress route upsert/delete was executed successfully through the route helper
session and audit data now persist because API DB writes are committed per request
a model ingest smoke test uploaded ops-smoke-model from the Linux model library into S3 and generated a manifest

23. Operator Retrieval Commands

Retrieve the admin password on Linux:

sudo sed -n 's/^OPS_ADMIN_PASSWORD=//p' /opt/desineuron-ops-control-plane/.env

Check stack health:

sudo systemctl status desineuron-ops-control-plane.service
sudo docker compose -f /opt/desineuron-ops-control-plane/docker-compose.yml ps

Inspect recent API logs:

sudo docker logs --tail 100 desineuron-ops-api
sudo docker logs --tail 100 desineuron-ops-worker

Inspect exports:

ls -lah /opt/desineuron-ops-control-plane/exports

9.4 KiB Raw Permalink Blame History