Built the Sentinel Tab

This commit is contained in:
Sagnik
2026-04-12 02:02:58 +05:30
parent fb656d1443
commit 075ab280ad
526 changed files with 17646 additions and 70931 deletions

View File

@@ -0,0 +1,382 @@
# Desineuron Ops Control Plane Bibel
## Chapter Index
1. Purpose and Operating Model
2. Architecture Map
3. Linux Control-Plane Stack
4. AWS Machine Profiles
5. Market Data and Pricing Logic
6. S3 Asset Model and Bucket Structure
7. Model Hydration Lifecycle
8. Model Ingest From Linux to S3
9. Route Management Through the t4g Ingress
10. Daily Operations Guide
11. Launching a GPU Box
12. Hydrating a Model
13. Starting ComfyUI or Another Workload
14. Tracking Session Time and Cost
15. CSV Exports and Reporting
16. Failure Recovery Runbooks
17. Security Model and Access Control
18. Adding a New Model
19. Adding a New Instance Profile
20. Adding a New Route or Service
21. Backup and Restore
22. Validated Live Behaviors
23. Operator Retrieval Commands
## 1. Purpose and Operating Model
The Desineuron Ops Control Plane is the persistent Linux-hosted operator surface for AWS infrastructure. It centralizes machine launch, model hydration, workload control, cost estimation, route management, and audit history so the team no longer depends on ad hoc Windows terminals or fragile one-off SSH sessions.
Core planes:
- Linux box: control plane
- S3: canonical asset plane
- AWS GPU nodes: ephemeral compute plane
- `t4g.micro` ingress: stable public edge
Current live endpoint:
- `https://ops.desineuron.in/login`
Current canonical S3 bucket:
- `desineuron-ops-control-plane-819079556187-us-east-1`
## 2. Architecture Map
```text
Team
-> ops.desineuron.in
-> Linux control plane
-> ops-web
-> ops-api
-> ops-worker
-> ops-db
-> AWS APIs
-> S3 bucket
-> ingress route helper
-> GPU worker nodes
```
## 3. Linux Control-Plane Stack
- Docker Compose stack under `/opt/desineuron-ops-control-plane`
- PostgreSQL stores machine/session/job/audit state
- API and web share the same FastAPI app
- Worker refreshes markets, machines, and session costs
- systemd keeps the stack persistent after reboot
Primary Linux service:
- `desineuron-ops-control-plane.service`
## 4. AWS Machine Profiles
Initial curated profiles:
- `g6-xlarge`
- `g6-2xlarge`
- `g6-4xlarge`
- `g6-12xlarge`
Each profile contains:
- instance type
- GPU label
- vCPU / RAM
- intended workloads
- launch config: AMI, subnet, SGs, key, role/profile, root volume
## 5. Market Data and Pricing Logic
The control plane collects:
- instance offerings by region
- on-demand pricing from AWS Pricing API
- latest spot price history from EC2
- runtime state of all visible machines
Estimated cost model v1:
- live instance price signal
- gp3 storage cost estimate
- public IPv4/EIP cost estimate
## 6. S3 Asset Model and Bucket Structure
Canonical bucket prefixes:
- `models/`
- `workflows/`
- `references/`
- `outputs/`
- `manifests/`
- `bootstrap/`
Models are defined in the `model_catalog` table and hydrated to AWS NVMe on demand.
## 7. Model Hydration Lifecycle
1. Operator selects machine and model
2. Worker ensures `s5cmd` exists on target
3. Assets copy from S3 to `/opt/dlami/nvme/models/...`
4. Operation result is logged
5. Cache state is stored per machine
Hydration verification:
- if a manifest exists at `manifests/models/<model-key>.json`, the control plane verifies the expected files are present on the GPU node after copy
## 8. Model Ingest From Linux to S3
The control plane can now ingest a real model directory from the Linux box into S3 without manual bucket prep.
Source of truth on Linux:
- `/mnt/ServerStorage/ai-models/models`
Container mount inside the control plane:
- `/model-library`
Operator flow:
1. enter model key
2. enter human label
3. enter source path relative to the Linux model library root
4. optionally set workload and compatibility tags
5. submit `Upload to S3 + Generate Manifest`
Result:
- every file is uploaded under `models/<model-key>/`
- manifest JSON is written to `manifests/models/<model-key>.json`
- the model catalog entry is upserted in PostgreSQL
- future hydrations can use that manifest for verification
## 9. Route Management Through the t4g Ingress
The ingress remains the stable public edge.
Managed route flow:
1. control plane writes hostname mapping through `manage_desineuron_routes.py`
2. helper renders managed Caddy snippets
3. Caddy reloads
4. route becomes live
Static Linux-origin routes still flow through the existing tunnel/nginx path.
## 10. Daily Operations Guide
- open `ops.desineuron.in`
- log in with internal ops credentials
- review markets and costs
- launch the required GPU profile
- hydrate the model
- start the workload
- map the route if needed
- export session CSVs for accounting or review
## 11. Launching a GPU Box
Use the Launch form in the GUI:
- choose profile
- choose spot or on-demand
- submit
Result:
- instance launches with Desineuron tags
- session row is created
- runtime and cost begin tracking
- if spot capacity is unavailable, the UI records a failed launch job and shows an operator-facing error instead of crashing
- the launcher automatically tries sibling subnets in the same VPC instead of hard-failing on one overloaded AZ
## 12. Hydrating a Model
Use the Hydrate form:
- choose machine
- choose model
Hydration copies from S3 to the instance NVMe path.
## 13. Starting ComfyUI or Another Workload
Use the workload form:
- choose machine
- choose workload
Current v1 workload profile:
- `comfyui`
## 14. Tracking Session Time and Cost
Each machine session is tracked in the DB and can be exported to CSV.
Cost components:
- compute
- storage
- public IPv4
v1 note:
- machine cost is estimate-based
- instance pricing comes from AWS live data where available
- storage and public IPv4 are blended in as estimated infrastructure cost
## 15. CSV Exports and Reporting
CSV export path:
- `exports/sessions_latest.csv`
Use this for:
- session duration review
- estimated expenditure review
- internal ops accounting
Current export path:
- `/opt/desineuron-ops-control-plane/exports/sessions_latest.csv`
The export is also logged in the database as a `csv_exports` record.
## 16. Failure Recovery Runbooks
If the worker stops:
- restart the systemd unit on Linux
If a GPU node is unhealthy:
- inspect machine state
- inspect workload status
- stop or terminate the node
- relaunch from a clean profile
If route mapping fails:
- inspect the ingress helper
- inspect Caddy reload status
- verify the ops container has SSH access to the ingress node
If redeploy breaks PostgreSQL permissions:
- verify `/opt/desineuron-ops-control-plane/data/postgres` is owned by UID/GID `999:999`
- restart `desineuron-ops-control-plane.service`
- never sync runtime directories from repo into the live stack
## 17. Security Model and Access Control
- app is intended to be private
- secrets stay on Linux, not in repo
- actions are audited
- AWS workers expose only minimal required ports
- operator accounts can be provisioned as email-style usernames for team access
Current protected secrets:
- `/opt/desineuron-ops-control-plane/.env`
- `/opt/desineuron-ops-control-plane/state/desineuron-l4-node.pem`
## 18. Adding a New Model
Preferred method:
1. place the model directory under `/mnt/ServerStorage/ai-models/models`
2. use the `Model Library Ingest` form in the ops console
3. let the control plane upload the files, create the manifest, and upsert the catalog entry
Fallback manual method:
1. upload to S3 canonical bucket
2. add catalog entry
3. define expected prefix and optional manifest/checksum
## 19. Adding a New Instance Profile
1. add curated profile definition
2. set launch config
3. verify market visibility
4. test launch
## 20. Adding a New Route or Service
1. define hostname
2. define target backend
3. add route through GUI or helper
4. reload ingress
5. validate health
If the route is for a new public hostname:
6. create the Cloudflare DNS record pointing to `98.87.120.120`
7. keep the record in `DNS only` mode
8. validate TLS issuance on first public request
## 21. Backup and Restore
Persist:
- Postgres data
- `.env`
- exported CSVs
- state directory
- route helper state on ingress
Restore by:
- recreating the compose stack
- restoring DB data
- restoring config/env
- validating machine, model, and route state
## 22. Validated Live Behaviors
As of the latest implementation pass, the following were validated against the live environment:
- `ops.desineuron.in` login and dashboard render correctly
- `/api/markets/instances`, `/api/markets/pricing`, `/api/sessions`, `/api/costs`, and `/api/exports/csv` return live data
- a `g6.xlarge` on-demand launch was executed through the control plane and then terminated through the same surface
- a `g6.xlarge` spot launch failure was handled cleanly and recorded as `InsufficientInstanceCapacity`
- managed ingress route upsert/delete was executed successfully through the route helper
- session and audit data now persist because API DB writes are committed per request
- a model ingest smoke test uploaded `ops-smoke-model` from the Linux model library into S3 and generated a manifest
## 23. Operator Retrieval Commands
Retrieve the admin password on Linux:
```bash
sudo sed -n 's/^OPS_ADMIN_PASSWORD=//p' /opt/desineuron-ops-control-plane/.env
```
Check stack health:
```bash
sudo systemctl status desineuron-ops-control-plane.service
sudo docker compose -f /opt/desineuron-ops-control-plane/docker-compose.yml ps
```
Inspect recent API logs:
```bash
sudo docker logs --tail 100 desineuron-ops-api
sudo docker logs --tail 100 desineuron-ops-worker
```
Inspect exports:
```bash
ls -lah /opt/desineuron-ops-control-plane/exports
```