Built the Sentinel Tab
This commit is contained in:
59
infrastructure/desineuron_ingress/Caddyfile
Normal file
59
infrastructure/desineuron_ingress/Caddyfile
Normal file
@@ -0,0 +1,59 @@
|
||||
{
|
||||
email admin@desineuron.in
|
||||
log {
|
||||
output file /var/log/caddy/admin.log
|
||||
format json
|
||||
}
|
||||
}
|
||||
|
||||
office.desineuron.in, git.desineuron.in, cloud.desineuron.in, projects.desineuron.in, talk.desineuron.in, vpn.desineuron.in {
|
||||
tls /etc/caddy/tls/fullchain.pem /etc/caddy/tls/privkey.pem
|
||||
|
||||
log {
|
||||
output file /var/log/caddy/access.log
|
||||
format json
|
||||
}
|
||||
|
||||
reverse_proxy https://127.0.0.1:8443 {
|
||||
header_up Host {host}
|
||||
header_up X-Forwarded-Host {host}
|
||||
header_up X-Forwarded-Proto {scheme}
|
||||
header_up X-Forwarded-For {remote_host}
|
||||
transport http {
|
||||
tls_insecure_skip_verify
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
ops.desineuron.in {
|
||||
log {
|
||||
output file /var/log/caddy/access.log
|
||||
format json
|
||||
}
|
||||
|
||||
reverse_proxy https://127.0.0.1:8443 {
|
||||
header_up Host {host}
|
||||
header_up X-Forwarded-Host {host}
|
||||
header_up X-Forwarded-Proto {scheme}
|
||||
header_up X-Forwarded-For {remote_host}
|
||||
transport http {
|
||||
tls_insecure_skip_verify
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
comfy.desineuron.in {
|
||||
log {
|
||||
output file /var/log/caddy/access.log
|
||||
format json
|
||||
}
|
||||
|
||||
reverse_proxy http://172.31.46.190:8188 {
|
||||
header_up Host {host}
|
||||
header_up X-Forwarded-Host {host}
|
||||
header_up X-Forwarded-Proto {scheme}
|
||||
header_up X-Forwarded-For {remote_host}
|
||||
}
|
||||
}
|
||||
|
||||
import /etc/caddy/managed/*.caddy
|
||||
38
infrastructure/desineuron_ingress/README.md
Normal file
38
infrastructure/desineuron_ingress/README.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Desineuron Ingress
|
||||
|
||||
This directory contains the reproducible bootstrap artifacts for the
|
||||
`desineuron-ingress-01` EC2 node.
|
||||
|
||||
Architecture:
|
||||
- EC2 `t4g.micro` on-demand in `us-east-1`
|
||||
- Amazon Linux 2023 ARM64
|
||||
- `20 GB` gp3 root volume
|
||||
- `Caddy` as the public HTTPS edge
|
||||
- `rathole` as the reverse TCP relay from the Linux origin box
|
||||
|
||||
Traffic model:
|
||||
- Public DNS stays in Cloudflare
|
||||
- Public HTTPS terminates on EC2
|
||||
- All six public hostnames proxy through EC2 to one local relay socket
|
||||
- Linux origin continues to serve the actual apps on `https://localhost:443`
|
||||
|
||||
Key files:
|
||||
- `user_data.sh`: first-boot provisioning for the EC2 ingress node
|
||||
- `Caddyfile`: public edge routing
|
||||
- `rathole-server.toml`: EC2-side relay config
|
||||
- `rathole-client.toml`: Linux-side relay config template
|
||||
- `install_linux_rathole_client.sh`: Linux-side installer/service script
|
||||
- `sync_ingress_home_ip.py`: detects current home public IP and updates the ingress SSH allowlist rule
|
||||
- `desineuron-ingress-home-ip-sync.service`: systemd oneshot service for the IP sync
|
||||
- `desineuron-ingress-home-ip-sync.timer`: persistent timer that reruns the sync every 5 minutes and on boot
|
||||
- `install_linux_ingress_ip_sync.sh`: Linux-side installer for the IP sync service
|
||||
|
||||
Manual Cloudflare work still required unless API credentials are provided:
|
||||
- set the six hostnames to DNS-only
|
||||
- point them to the ingress Elastic IP
|
||||
- retire the Cloudflare Tunnel routes once public validation passes
|
||||
|
||||
Dynamic home IP handling:
|
||||
- `rathole` control port `2333/tcp` is intentionally open on the ingress so public services do not break when the ISP IP changes
|
||||
- SSH fallback on the ingress remains restricted to the current home public IP on `22/tcp`
|
||||
- the Linux-side IP sync service keeps that SSH fallback rule current after ISP churn or reboot
|
||||
540
infrastructure/desineuron_ingress/TEAM_HANDOFF_2026-04-08.md
Normal file
540
infrastructure/desineuron_ingress/TEAM_HANDOFF_2026-04-08.md
Normal file
@@ -0,0 +1,540 @@
|
||||
## Desineuron Stable Ingress Handoff
|
||||
|
||||
Date: 2026-04-08
|
||||
|
||||
### Chapters
|
||||
|
||||
1. Outcome
|
||||
2. Final Architecture
|
||||
3. AWS Resources
|
||||
4. Linux Origin State
|
||||
5. Migration Changes Applied
|
||||
6. Validation Results
|
||||
7. ComfyUI Recovery and GPU Route
|
||||
8. Files and Config Artifacts
|
||||
9. Dynamic Home IP Sync
|
||||
10. Operational Commands
|
||||
11. Future Service Mapping Runbook
|
||||
12. Security Notes
|
||||
13. Remaining Improvement Ideas
|
||||
14. Rollback
|
||||
15. Team Summary
|
||||
16. Current Status Snapshot - 2026-04-11
|
||||
17. Linux Ops Control Plane
|
||||
|
||||
### Outcome
|
||||
|
||||
The Cloudflare Tunnel dependency for the six public `desineuron.in` services has been replaced with a self-hosted AWS ingress layer:
|
||||
|
||||
- Public edge: AWS EC2 `t4g.micro`
|
||||
- Stable public IP: `98.87.120.120`
|
||||
- TLS termination: `Caddy` on the ingress node
|
||||
- Private backend relay: `rathole`
|
||||
- Origin: Linux box at `192.168.1.4`
|
||||
- DNS: Cloudflare, `DNS only`
|
||||
|
||||
Public hostnames now route through AWS instead of Cloudflare Tunnel:
|
||||
|
||||
- `office.desineuron.in`
|
||||
- `git.desineuron.in`
|
||||
- `cloud.desineuron.in`
|
||||
- `projects.desineuron.in`
|
||||
- `talk.desineuron.in`
|
||||
- `vpn.desineuron.in`
|
||||
- `comfy.desineuron.in` (ingress route created for AWS GPU ComfyUI)
|
||||
- `ops.desineuron.in` (private operator control surface on the Linux box)
|
||||
|
||||
### Final Architecture
|
||||
|
||||
```text
|
||||
Internet
|
||||
-> Cloudflare DNS
|
||||
-> 98.87.120.120
|
||||
-> EC2 ingress: desineuron-ingress-01
|
||||
-> Caddy :443
|
||||
-> rathole server (control on 2333, local relay on 127.0.0.1:8443)
|
||||
-> Linux origin tunnel client
|
||||
-> Linux nginx :443
|
||||
-> per-host upstream routing
|
||||
-> Gitea
|
||||
-> Nextcloud
|
||||
-> Taiga
|
||||
-> OnlyOffice
|
||||
-> NetBird
|
||||
-> comfy.desineuron.in
|
||||
-> EC2 ingress Caddy
|
||||
-> private proxy to AWS GPU box `172.31.46.190:8188`
|
||||
-> ComfyUI endpoints on systemd-managed GPU service
|
||||
```
|
||||
|
||||
### AWS Resources
|
||||
|
||||
- Instance name: `desineuron-ingress-01`
|
||||
- Instance ID: `i-094df09acafb72494`
|
||||
- Type: `t4g.micro`
|
||||
- Region: `us-east-1`
|
||||
- Subnet: `subnet-03d684ed15f327151`
|
||||
- VPC: `vpc-081d2397920aad268`
|
||||
- Root disk: `20 GB gp3`
|
||||
- Elastic IP: `98.87.120.120`
|
||||
- IAM role: `desineuron-ingress-role`
|
||||
- Instance profile: `desineuron-ingress-profile`
|
||||
- Security group: `sg-0721b8b48e12c531d`
|
||||
|
||||
Current GPU worker:
|
||||
|
||||
- Instance ID: `i-0e4eab5fe67cf9abe`
|
||||
- Type: `g6.12xlarge`
|
||||
- Region: `us-east-1`
|
||||
- Private IP: `172.31.46.190`
|
||||
- Current public IP: `18.208.176.121`
|
||||
- Launch time: `2026-04-11T06:14:04Z`
|
||||
|
||||
Open ingress ports:
|
||||
|
||||
- `80/tcp` from internet
|
||||
- `443/tcp` from internet
|
||||
- `22/tcp` restricted to the current home public IP and auto-synced from the Linux origin
|
||||
- `2333/tcp` from internet for `rathole` control and data relay
|
||||
|
||||
GPU node security posture for ComfyUI:
|
||||
|
||||
- public `8118/tcp` removed
|
||||
- public `8188/tcp` removed
|
||||
- `8188/tcp` now allowed only from ingress security group `sg-0721b8b48e12c531d`
|
||||
|
||||
### Linux Origin State
|
||||
|
||||
Services exposed to local nginx:
|
||||
|
||||
- `git.desineuron.in` -> `127.0.0.1:3000` (`gitea`)
|
||||
- `cloud.desineuron.in` -> `127.0.0.1:11000` (`nextcloud_app`)
|
||||
- `talk.desineuron.in` -> `127.0.0.1:11000` (`nextcloud_app`, Talk-focused hostname)
|
||||
- `projects.desineuron.in` -> `127.0.0.1:9100` (`taiga-gateway`)
|
||||
- `office.desineuron.in` -> `127.0.0.1:9980` (`nextcloud_onlyoffice`)
|
||||
- `vpn.desineuron.in` -> `127.0.0.1:8080` / `127.0.0.1:8081` (`netbird`)
|
||||
|
||||
Tunnel state:
|
||||
|
||||
- `rathole-client.service` active on Linux
|
||||
- `rathole-server.service` active on AWS
|
||||
- `cloudflared` inactive on Linux
|
||||
|
||||
### Migration Changes Applied
|
||||
|
||||
#### Cloudflare
|
||||
|
||||
Old CNAME tunnel records were removed for the six public hostnames.
|
||||
|
||||
New records were created:
|
||||
|
||||
- Type: `A`
|
||||
- Value: `98.87.120.120`
|
||||
- Proxy status: `DNS only`
|
||||
- TTL: `300`
|
||||
|
||||
#### AWS Ingress
|
||||
|
||||
Installed and configured:
|
||||
|
||||
- `Caddy`
|
||||
- `rathole`
|
||||
- `amazon-ssm-agent`
|
||||
- Linux-driven SSH allowlist sync for the ingress node
|
||||
|
||||
TLS:
|
||||
|
||||
- Existing valid certificate/key pair from the Linux origin was copied to the ingress node.
|
||||
- Caddy now terminates HTTPS at the edge.
|
||||
|
||||
#### Linux Origin
|
||||
|
||||
nginx was already routing by hostname and remains the origin router.
|
||||
|
||||
Nextcloud was adjusted so `talk.desineuron.in` no longer canonicalizes to `cloud.desineuron.in`:
|
||||
|
||||
- removed `overwritehost` pin
|
||||
- added `talk.desineuron.in` to trusted domains
|
||||
- restarted `nextcloud_app`
|
||||
|
||||
### Validation Results
|
||||
|
||||
Public hostname checks through the new ingress:
|
||||
|
||||
- `office.desineuron.in` -> `200 /welcome/`
|
||||
- `git.desineuron.in` -> `200`
|
||||
- `cloud.desineuron.in` -> `200 /login`
|
||||
- `projects.desineuron.in` -> `200`
|
||||
- `talk.desineuron.in` -> `200 /login` on `talk.desineuron.in`
|
||||
- `vpn.desineuron.in` -> `200`
|
||||
- `ops.desineuron.in/login` -> `200`
|
||||
- `comfy.desineuron.in` -> `502`
|
||||
|
||||
Important note:
|
||||
|
||||
- `talk.desineuron.in` now stays on the `talk` hostname.
|
||||
- It is still backed by the same Nextcloud origin and presents the Nextcloud login flow, which is expected given the current Linux-side app layout.
|
||||
|
||||
### ComfyUI Recovery and GPU Route
|
||||
|
||||
Root cause of the earlier `502`:
|
||||
|
||||
- ingress route and TLS were correct
|
||||
- the GPU spot node had lost the actual `/opt/dlami/nvme/ComfyUI` app tree
|
||||
- nothing was listening on `172.31.46.190:8188`
|
||||
|
||||
Permanent fix applied:
|
||||
|
||||
- restored `/opt/dlami/nvme/ComfyUI` from upstream source control
|
||||
- installed ComfyUI Python requirements on the GPU node
|
||||
- created `systemd` unit `comfyui.service`
|
||||
- enabled `comfyui.service` at boot with automatic restart
|
||||
- kept `comfy.desineuron.in` mapped through ingress Caddy
|
||||
- removed direct public access to `8118` and `8188`
|
||||
- allowed `8188` only from ingress security group
|
||||
|
||||
Current live path:
|
||||
|
||||
- `https://comfy.desineuron.in`
|
||||
-> ingress `98.87.120.120`
|
||||
-> Caddy reverse proxy
|
||||
-> GPU private IP `172.31.46.190:8188`
|
||||
-> `comfyui.service`
|
||||
|
||||
Current public result:
|
||||
|
||||
- `comfy.desineuron.in` currently returns `502 Bad Gateway`
|
||||
- ingress route is present and Caddy is healthy
|
||||
- the current GPU backend is not yet listening on `172.31.46.190:8188`, so this is a backend readiness issue, not a DNS or edge-TLS issue
|
||||
|
||||
Current GPU service:
|
||||
|
||||
- `comfyui.service`
|
||||
- app path: `/opt/dlami/nvme/ComfyUI`
|
||||
- log path: `/var/log/comfyui/service.log`
|
||||
- port: `8188/tcp`
|
||||
|
||||
Current backend state on `2026-04-11`:
|
||||
|
||||
- `comfyui.service` is `activating`
|
||||
- latest log shows ComfyUI startup and `Starting server`
|
||||
- the process is still not binding `8188`, so ingress sees the backend as unavailable
|
||||
|
||||
Expected endpoints:
|
||||
|
||||
- `https://comfy.desineuron.in/`
|
||||
- `https://comfy.desineuron.in/prompt`
|
||||
- `https://comfy.desineuron.in/history/{prompt_id}`
|
||||
- `https://comfy.desineuron.in/queue`
|
||||
- `https://comfy.desineuron.in/upload/image`
|
||||
|
||||
### Files and Config Artifacts
|
||||
|
||||
Infrastructure artifacts in repo:
|
||||
|
||||
- [README.md](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/README.md)
|
||||
- [Caddyfile](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/Caddyfile)
|
||||
- [rathole-server.toml](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/rathole-server.toml)
|
||||
- [rathole-client.toml](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/rathole-client.toml)
|
||||
- [install_linux_rathole_client.sh](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/install_linux_rathole_client.sh)
|
||||
- [user_data.sh](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/user_data.sh)
|
||||
- [install_gpu_comfyui_service.sh](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/install_gpu_comfyui_service.sh)
|
||||
- [map_gpu_comfy_security.ps1](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/map_gpu_comfy_security.ps1)
|
||||
- [sync_ingress_home_ip.py](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/sync_ingress_home_ip.py)
|
||||
- [desineuron-ingress-home-ip-sync.service](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/desineuron-ingress-home-ip-sync.service)
|
||||
- [desineuron-ingress-home-ip-sync.timer](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/desineuron-ingress-home-ip-sync.timer)
|
||||
- [install_linux_ingress_ip_sync.sh](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/install_linux_ingress_ip_sync.sh)
|
||||
- [README.md](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/ops_control_plane/README.md)
|
||||
- [Desineuron Ops Control Plane Bibel.md](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/.Agent%20Context/Bibels/Desineuron%20Ops%20Control%20Plane%20Bibel.md)
|
||||
|
||||
Linux origin files touched:
|
||||
|
||||
- `/etc/nginx/sites-enabled/desineuron.conf`
|
||||
- `/mnt/ServerStorage/docker_apps/nextcloud/.env`
|
||||
- `/mnt/ServerStorage/docker_apps/nextcloud/data/config/config.php`
|
||||
- `/mnt/ServerStorage/docker_apps/nextcloud/data/config/reverse-proxy.config.php`
|
||||
|
||||
Backups created on Linux:
|
||||
|
||||
- `/mnt/ServerStorage/docker_apps/nextcloud/.env.pre_ingress_backup_2026-04-08`
|
||||
- `/mnt/ServerStorage/docker_apps/nextcloud/data/config/reverse-proxy.config.php.pre_ingress_backup_2026-04-08`
|
||||
|
||||
### Dynamic Home IP Sync
|
||||
|
||||
Purpose:
|
||||
|
||||
- Keep ingress `22/tcp` restricted to the current Airtel public IP even when the ISP changes it
|
||||
- Prevent future manual outages for SSH fallback caused by stale home-IP security-group rules
|
||||
|
||||
Design:
|
||||
|
||||
- Linux origin runs `desineuron-ingress-home-ip-sync.timer`
|
||||
- Timer fires on boot and every 5 minutes
|
||||
- Service resolves the current home public IP via `https://api.ipify.org`
|
||||
- Service updates only the ingress security group `sg-0721b8b48e12c531d`
|
||||
- Only the SSH fallback rule is mutated
|
||||
- `rathole` is no longer dependent on the Airtel IP because `2333/tcp` remains open on the ingress
|
||||
|
||||
Installed Linux paths:
|
||||
|
||||
- `/usr/local/bin/sync_ingress_home_ip.py`
|
||||
- `/etc/systemd/system/desineuron-ingress-home-ip-sync.service`
|
||||
- `/etc/systemd/system/desineuron-ingress-home-ip-sync.timer`
|
||||
- `/etc/desineuron-ingress-home-ip-sync.env`
|
||||
- `/opt/desineuron-ingress-ip-sync/.venv`
|
||||
- `/var/lib/desineuron-ingress-ip-sync/current_ip.txt`
|
||||
|
||||
Current state:
|
||||
|
||||
- Timer: enabled and active
|
||||
- Last recorded home public IP: `223.185.28.89`
|
||||
- Ingress SSH rule CIDR: `223.185.28.89/32`
|
||||
|
||||
### Operational Commands
|
||||
|
||||
Check AWS ingress status:
|
||||
|
||||
```powershell
|
||||
aws ec2 describe-instances --instance-ids i-094df09acafb72494 --region us-east-1
|
||||
aws ec2 describe-addresses --allocation-ids eipalloc-0d54fc0f827450e7b --region us-east-1
|
||||
```
|
||||
|
||||
Check ingress services:
|
||||
|
||||
```powershell
|
||||
aws ssm send-command --region us-east-1 --instance-ids i-094df09acafb72494 --document-name AWS-RunShellScript --parameters commands="sudo systemctl status caddy rathole-server --no-pager"
|
||||
```
|
||||
|
||||
Check GPU ComfyUI service:
|
||||
|
||||
```powershell
|
||||
aws ssm send-command --region us-east-1 --instance-ids i-0e4eab5fe67cf9abe --document-name AWS-RunShellScript --parameters commands="sudo systemctl status comfyui --no-pager","ss -ltnp | grep 8188 || true","tail -n 40 /var/log/comfyui/service.log || true"
|
||||
```
|
||||
|
||||
Check Linux origin services:
|
||||
|
||||
```powershell
|
||||
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S systemctl status rathole-client nginx"
|
||||
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S systemctl status desineuron-ingress-home-ip-sync.service desineuron-ingress-home-ip-sync.timer"
|
||||
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S journalctl -u desineuron-ingress-home-ip-sync -n 50 --no-pager"
|
||||
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S systemctl status desineuron-ops-control-plane.service --no-pager"
|
||||
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S docker compose -f /opt/desineuron-ops-control-plane/docker-compose.yml ps"
|
||||
```
|
||||
|
||||
Public endpoint validation:
|
||||
|
||||
```powershell
|
||||
curl.exe -I https://office.desineuron.in
|
||||
curl.exe -I https://git.desineuron.in
|
||||
curl.exe -I https://cloud.desineuron.in
|
||||
curl.exe -I https://projects.desineuron.in
|
||||
curl.exe -I https://talk.desineuron.in
|
||||
curl.exe -I https://vpn.desineuron.in
|
||||
curl.exe -I https://comfy.desineuron.in
|
||||
curl.exe -I https://ops.desineuron.in/login
|
||||
```
|
||||
|
||||
### Future Service Mapping Runbook
|
||||
|
||||
Use this pattern for any future public service behind the stable ingress layer.
|
||||
|
||||
1. Decide the backend location.
|
||||
|
||||
- Linux origin behind `rathole`
|
||||
- AWS GPU/private EC2 node
|
||||
- another private backend later
|
||||
|
||||
2. Decide whether the service should terminate TLS at ingress.
|
||||
|
||||
- default: yes
|
||||
- Caddy on ingress should own the public hostname and certificate
|
||||
|
||||
3. Create the DNS record in Cloudflare.
|
||||
|
||||
- type: `A`
|
||||
- value: `98.87.120.120`
|
||||
- proxy mode: `DNS only`
|
||||
- low TTL during rollout
|
||||
|
||||
4. Add the ingress route in [`Caddyfile`](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/Caddyfile).
|
||||
|
||||
Patterns:
|
||||
|
||||
- Linux-origin service:
|
||||
- proxy to `https://127.0.0.1:8443`
|
||||
- preserve `Host`
|
||||
- private AWS backend service:
|
||||
- proxy to `http://<private-ip>:<port>` or `https://<private-ip>:<port>`
|
||||
|
||||
5. Restrict backend network access.
|
||||
|
||||
- never leave backend app ports open to `0.0.0.0/0` unless absolutely necessary
|
||||
- prefer security-group rule allowing traffic only from ingress security group
|
||||
- for home-origin services, keep them private behind `rathole`
|
||||
|
||||
6. Reload ingress.
|
||||
|
||||
```powershell
|
||||
ssh -i "F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\desineuron-l4-node.pem" ec2-user@98.87.120.120 "sudo caddy validate --config /etc/caddy/Caddyfile && sudo systemctl reload caddy"
|
||||
```
|
||||
|
||||
7. Validate TLS and app response.
|
||||
|
||||
- check certificate subject matches hostname
|
||||
- check `curl -I https://<host>`
|
||||
- check login page or health endpoint
|
||||
- check browser behavior
|
||||
|
||||
8. If the backend is stateful, create a persistent service.
|
||||
|
||||
- prefer `systemd`
|
||||
- enable restart on failure
|
||||
- log to a stable path
|
||||
- record service name, working directory, ports, and restart policy in this handoff doc
|
||||
|
||||
9. Update team docs immediately.
|
||||
|
||||
- hostname
|
||||
- DNS record type
|
||||
- ingress route target
|
||||
- backend service owner
|
||||
- service name
|
||||
- health check command
|
||||
- rollback step
|
||||
|
||||
### Security Notes
|
||||
|
||||
- Public traffic terminates only at the AWS edge.
|
||||
- The Linux box no longer needs Cloudflare Tunnel for these six routes.
|
||||
- The Linux origin is reached through an outbound tunnel, not by directly exposing the home machine to the public for app traffic.
|
||||
- SSH on the Linux box remains key-only.
|
||||
- The AWS ingress IAM role is limited to SSM core.
|
||||
- ComfyUI is no longer directly exposed on the GPU public IP; only the ingress layer can reach `8188`.
|
||||
- Ingress `22/tcp` stays restricted and is now auto-synced from the Linux origin.
|
||||
- Ingress `2333/tcp` is intentionally open so `rathole` survives Airtel IP changes without operator action.
|
||||
|
||||
### Remaining Improvement Ideas
|
||||
|
||||
- Move the Linux nginx certificate issuance/renewal model to the AWS edge permanently instead of copying an existing certificate.
|
||||
- Clean up nginx warnings about duplicated protocol options.
|
||||
- Separate `talk.desineuron.in` more fully from general Nextcloud if a distinct Talk-only UX is desired.
|
||||
- Add authentication in front of `comfy.desineuron.in`; internet scanners started hitting the route immediately after it went live.
|
||||
- Consider putting Basic Auth or an allowlist in front of `comfy.desineuron.in` before broader team rollout.
|
||||
- Add monitoring and alerting on:
|
||||
- `caddy`
|
||||
- `rathole-server`
|
||||
- `rathole-client`
|
||||
- public HTTPS checks
|
||||
- Add infrastructure-as-code for the EC2 ingress node if this should be reproducible by the team without manual AWS CLI steps.
|
||||
|
||||
### Rollback
|
||||
|
||||
If rollback is needed:
|
||||
|
||||
1. Recreate Cloudflare CNAME/tunnel routes or repoint the DNS records away from `98.87.120.120`.
|
||||
2. Stop `caddy` and `rathole-server` on AWS.
|
||||
3. Stop `rathole-client` on Linux.
|
||||
4. Restore Nextcloud files from:
|
||||
- `.env.pre_ingress_backup_2026-04-08`
|
||||
- `reverse-proxy.config.php.pre_ingress_backup_2026-04-08`
|
||||
5. Restart `nextcloud_app` and nginx.
|
||||
|
||||
### Team Summary
|
||||
|
||||
This migration is complete.
|
||||
|
||||
Cloudflare Tunnel is no longer the production path for the six public service hostnames. The stable production ingress is now the AWS `t4g.micro` node with Elastic IP `98.87.120.120`, and the Linux machine remains the private origin behind `rathole`.
|
||||
|
||||
Additional mapped route:
|
||||
|
||||
- `comfy.desineuron.in` now terminates on the same stable ingress and forwards to the GPU node's private address `172.31.46.190:8188`.
|
||||
- No further DNS change is needed for ComfyUI.
|
||||
- The backend is supervised by `systemd`, but the current worker is not yet binding `8188`, so public access is currently degraded with `502`.
|
||||
- The team can use:
|
||||
- `https://comfy.desineuron.in/prompt`
|
||||
- `https://comfy.desineuron.in/history/{prompt_id}`
|
||||
- `https://comfy.desineuron.in/queue`
|
||||
- `https://comfy.desineuron.in/upload/image`
|
||||
|
||||
### Current Status Snapshot - 2026-04-11
|
||||
|
||||
Live public service state:
|
||||
|
||||
- `office.desineuron.in` -> `200`
|
||||
- `git.desineuron.in` -> `200`
|
||||
- `cloud.desineuron.in` -> `200`
|
||||
- `projects.desineuron.in` -> `200`
|
||||
- `talk.desineuron.in` -> `200`
|
||||
- `vpn.desineuron.in` -> `200`
|
||||
- `ops.desineuron.in/login` -> `200`
|
||||
- `comfy.desineuron.in` -> `502`
|
||||
|
||||
Linux-origin health:
|
||||
|
||||
- `nginx.service` -> `active`
|
||||
- `rathole-client.service` -> `active`
|
||||
- `desineuron-ingress-home-ip-sync.timer` -> `active`
|
||||
- `desineuron-ops-control-plane.service` -> `active`
|
||||
|
||||
Linux ops stack containers:
|
||||
|
||||
- `desineuron-ops-api` -> `Up`
|
||||
- `desineuron-ops-db` -> `Up (healthy)`
|
||||
- `desineuron-ops-worker` -> `Up`
|
||||
|
||||
Ingress health:
|
||||
|
||||
- `caddy` -> `active`
|
||||
- `rathole-server` -> `active`
|
||||
- `comfy.desineuron.in` Caddy route is present in `/etc/caddy/Caddyfile`
|
||||
|
||||
GPU ComfyUI state:
|
||||
|
||||
- `comfyui.service` -> `activating`
|
||||
- latest logs show ComfyUI startup sequence completing toward `Starting server`
|
||||
- no active listener on `8188` yet
|
||||
- ingress cannot connect to `172.31.46.190:8188`, which is why the public result is `502`
|
||||
|
||||
### Linux Ops Control Plane
|
||||
|
||||
The Linux box now also hosts the private AWS control surface for the team.
|
||||
|
||||
Public operator URL:
|
||||
|
||||
- `https://ops.desineuron.in/login`
|
||||
|
||||
Purpose:
|
||||
|
||||
- launch/stop/terminate AWS machines
|
||||
- view spot/on-demand market data
|
||||
- track runtime and estimated cost
|
||||
- ingest model directories from the Linux box into S3
|
||||
- hydrate models from S3 to AWS GPU nodes
|
||||
- manage ingress routes through the `t4g.micro`
|
||||
- export session/cost CSVs
|
||||
|
||||
Linux runtime paths:
|
||||
|
||||
- stack root: `/opt/desineuron-ops-control-plane`
|
||||
- env file: `/opt/desineuron-ops-control-plane/.env`
|
||||
- exports: `/opt/desineuron-ops-control-plane/exports`
|
||||
- state: `/opt/desineuron-ops-control-plane/state`
|
||||
|
||||
Canonical S3 bucket:
|
||||
|
||||
- `desineuron-ops-control-plane-819079556187-us-east-1`
|
||||
|
||||
Model library source on Linux:
|
||||
|
||||
- `/mnt/ServerStorage/ai-models/models`
|
||||
|
||||
Current operator accounts:
|
||||
|
||||
- `sagnik@desineuron.in`
|
||||
- `sayan@desineuron.in`
|
||||
- `sourik@desineuron.in`
|
||||
|
||||
Reference docs:
|
||||
|
||||
- [README.md](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/ops_control_plane/README.md)
|
||||
- [Desineuron Ops Control Plane Bibel.md](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/.Agent%20Context/Bibels/Desineuron%20Ops%20Control%20Plane%20Bibel.md)
|
||||
@@ -0,0 +1,12 @@
|
||||
[Unit]
|
||||
Description=Update ingress SSH allowlist to current home public IP
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
EnvironmentFile=/etc/desineuron-ingress-home-ip-sync.env
|
||||
ExecStart=/opt/desineuron-ingress-ip-sync/.venv/bin/python /usr/local/bin/sync_ingress_home_ip.py
|
||||
WorkingDirectory=/var/lib/desineuron-ingress-ip-sync
|
||||
User=root
|
||||
Group=root
|
||||
@@ -0,0 +1,11 @@
|
||||
[Unit]
|
||||
Description=Run ingress home IP sync on boot and every 5 minutes
|
||||
|
||||
[Timer]
|
||||
OnBootSec=45s
|
||||
OnUnitActiveSec=5min
|
||||
Unit=desineuron-ingress-home-ip-sync.service
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,52 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
COMFY_DIR="/opt/dlami/nvme/ComfyUI"
|
||||
SERVICE_NAME="comfyui"
|
||||
LOG_DIR="/var/log/comfyui"
|
||||
|
||||
if ! command -v git >/dev/null 2>&1; then
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y git
|
||||
fi
|
||||
|
||||
if [ ! -d "${COMFY_DIR}/.git" ]; then
|
||||
sudo mkdir -p /opt/dlami/nvme
|
||||
sudo chown -R ubuntu:ubuntu /opt/dlami/nvme
|
||||
git clone https://github.com/comfyanonymous/ComfyUI.git "${COMFY_DIR}"
|
||||
else
|
||||
git -C "${COMFY_DIR}" pull --ff-only
|
||||
fi
|
||||
|
||||
python3 -m pip install -r "${COMFY_DIR}/requirements.txt"
|
||||
|
||||
sudo mkdir -p "${LOG_DIR}"
|
||||
sudo chown -R ubuntu:ubuntu "${LOG_DIR}"
|
||||
|
||||
sudo tee /etc/systemd/system/${SERVICE_NAME}.service >/dev/null <<'EOF'
|
||||
[Unit]
|
||||
Description=ComfyUI GPU Service
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=ubuntu
|
||||
Group=ubuntu
|
||||
WorkingDirectory=/opt/dlami/nvme/ComfyUI
|
||||
Environment=HOME=/home/ubuntu
|
||||
Environment=PYTHONUNBUFFERED=1
|
||||
ExecStart=/usr/bin/python3 /opt/dlami/nvme/ComfyUI/main.py --listen 0.0.0.0 --port 8188 --disable-auto-launch
|
||||
Restart=always
|
||||
RestartSec=5
|
||||
StandardOutput=append:/var/log/comfyui/service.log
|
||||
StandardError=append:/var/log/comfyui/service.log
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now "${SERVICE_NAME}.service"
|
||||
sleep 5
|
||||
sudo systemctl --no-pager --full status "${SERVICE_NAME}.service"
|
||||
@@ -0,0 +1,40 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
if [[ $# -ne 2 ]]; then
|
||||
echo "Usage: $0 <aws_access_key_id> <aws_secret_access_key>" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
AWS_ACCESS_KEY_ID="$1"
|
||||
AWS_SECRET_ACCESS_KEY="$2"
|
||||
INSTALL_ROOT="/opt/desineuron-ingress-ip-sync"
|
||||
VENV_PATH="${INSTALL_ROOT}/.venv"
|
||||
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y python3-venv
|
||||
sudo mkdir -p "${INSTALL_ROOT}"
|
||||
sudo python3 -m venv "${VENV_PATH}"
|
||||
sudo "${VENV_PATH}/bin/pip" install --upgrade pip boto3
|
||||
|
||||
sudo install -m 0755 /tmp/sync_ingress_home_ip.py /usr/local/bin/sync_ingress_home_ip.py
|
||||
sudo install -m 0644 /tmp/desineuron-ingress-home-ip-sync.service /etc/systemd/system/desineuron-ingress-home-ip-sync.service
|
||||
sudo install -m 0644 /tmp/desineuron-ingress-home-ip-sync.timer /etc/systemd/system/desineuron-ingress-home-ip-sync.timer
|
||||
|
||||
sudo mkdir -p /var/lib/desineuron-ingress-ip-sync
|
||||
sudo tee /etc/desineuron-ingress-home-ip-sync.env >/dev/null <<EOF
|
||||
AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
|
||||
AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
|
||||
AWS_REGION=us-east-1
|
||||
INGRESS_SECURITY_GROUP_ID=sg-0721b8b48e12c531d
|
||||
INGRESS_SSH_PORT=22
|
||||
INGRESS_SSH_RULE_DESCRIPTION=SSH fallback from origin network
|
||||
INGRESS_IP_STATE_FILE=/var/lib/desineuron-ingress-ip-sync/current_ip.txt
|
||||
EOF
|
||||
sudo chmod 600 /etc/desineuron-ingress-home-ip-sync.env
|
||||
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now desineuron-ingress-home-ip-sync.timer
|
||||
sudo systemctl start desineuron-ingress-home-ip-sync.service
|
||||
sudo systemctl --no-pager --full status desineuron-ingress-home-ip-sync.service
|
||||
sudo systemctl --no-pager --full status desineuron-ingress-home-ip-sync.timer
|
||||
@@ -0,0 +1,44 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
RATHOLE_VERSION="${RATHOLE_VERSION:-v0.4.3}"
|
||||
RATHOLE_URL="${RATHOLE_URL:-https://github.com/rapiz1/rathole/releases/download/${RATHOLE_VERSION}/rathole-x86_64-unknown-linux-gnu.zip}"
|
||||
CONFIG_SOURCE="${CONFIG_SOURCE:-/tmp/rathole-client.toml}"
|
||||
|
||||
sudo install -d -m 0755 /etc/rathole
|
||||
sudo install -d -m 0755 /opt/rathole
|
||||
|
||||
tmp_dir="$(mktemp -d)"
|
||||
trap 'rm -rf "$tmp_dir"' EXIT
|
||||
|
||||
cd "$tmp_dir"
|
||||
curl -fL "$RATHOLE_URL" -o rathole.zip
|
||||
python3 - <<'PY'
|
||||
import zipfile
|
||||
z = zipfile.ZipFile("rathole.zip")
|
||||
z.extractall(".")
|
||||
PY
|
||||
|
||||
sudo install -m 0755 rathole /usr/local/bin/rathole
|
||||
sudo install -m 0600 "$CONFIG_SOURCE" /etc/rathole/client.toml
|
||||
|
||||
cat <<'EOF' | sudo tee /etc/systemd/system/rathole-client.service >/dev/null
|
||||
[Unit]
|
||||
Description=Desineuron Rathole Client
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/usr/local/bin/rathole /etc/rathole/client.toml
|
||||
Restart=always
|
||||
RestartSec=5
|
||||
User=root
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now rathole-client.service
|
||||
sudo systemctl status --no-pager rathole-client.service || true
|
||||
33
infrastructure/desineuron_ingress/map_gpu_comfy_security.ps1
Normal file
33
infrastructure/desineuron_ingress/map_gpu_comfy_security.ps1
Normal file
@@ -0,0 +1,33 @@
|
||||
$ErrorActionPreference = "Stop"
|
||||
|
||||
$gpuGroups = @(
|
||||
"sg-0b144c17b1b89f4c6",
|
||||
"sg-05e4de3fe94ad6558"
|
||||
)
|
||||
|
||||
$ingressGroup = "sg-0721b8b48e12c531d"
|
||||
|
||||
try {
|
||||
aws ec2 authorize-security-group-ingress `
|
||||
--group-id "sg-0b144c17b1b89f4c6" `
|
||||
--ip-permissions "[{\"IpProtocol\":\"tcp\",\"FromPort\":8188,\"ToPort\":8188,\"UserIdGroupPairs\":[{\"GroupId\":\"$ingressGroup\",\"Description\":\"Allow ComfyUI from ingress\"}]}]" | Out-Null
|
||||
} catch {
|
||||
}
|
||||
|
||||
foreach ($group in $gpuGroups) {
|
||||
foreach ($port in 8118, 8188) {
|
||||
try {
|
||||
aws ec2 revoke-security-group-ingress `
|
||||
--group-id $group `
|
||||
--protocol tcp `
|
||||
--port $port `
|
||||
--cidr 0.0.0.0/0 | Out-Null
|
||||
} catch {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
aws ec2 describe-security-groups `
|
||||
--group-ids $gpuGroups `
|
||||
--query "SecurityGroups[].{GroupId:GroupId,GroupName:GroupName,Ingress:IpPermissions}" `
|
||||
--output json
|
||||
12
infrastructure/desineuron_ingress/rathole-client.toml
Normal file
12
infrastructure/desineuron_ingress/rathole-client.toml
Normal file
@@ -0,0 +1,12 @@
|
||||
[client]
|
||||
remote_addr = "__INGRESS_HOST__:2333"
|
||||
default_token = "__RATHOLE_TOKEN__"
|
||||
|
||||
[client.transport]
|
||||
type = "noise"
|
||||
|
||||
[client.transport.noise]
|
||||
remote_public_key = "__RATHOLE_SERVER_PUBLIC_KEY__"
|
||||
|
||||
[client.services.https_origin]
|
||||
local_addr = "127.0.0.1:443"
|
||||
12
infrastructure/desineuron_ingress/rathole-server.toml
Normal file
12
infrastructure/desineuron_ingress/rathole-server.toml
Normal file
@@ -0,0 +1,12 @@
|
||||
[server]
|
||||
bind_addr = "0.0.0.0:2333"
|
||||
default_token = "__RATHOLE_TOKEN__"
|
||||
|
||||
[server.transport]
|
||||
type = "noise"
|
||||
|
||||
[server.transport.noise]
|
||||
local_private_key = "__RATHOLE_SERVER_PRIVATE_KEY__"
|
||||
|
||||
[server.services.https_origin]
|
||||
bind_addr = "127.0.0.1:8443"
|
||||
110
infrastructure/desineuron_ingress/sync_ingress_home_ip.py
Normal file
110
infrastructure/desineuron_ingress/sync_ingress_home_ip.py
Normal file
@@ -0,0 +1,110 @@
|
||||
#!/usr/bin/env python3
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import urllib.request
|
||||
from pathlib import Path
|
||||
|
||||
import boto3
|
||||
|
||||
|
||||
SECURITY_GROUP_ID = os.environ["INGRESS_SECURITY_GROUP_ID"]
|
||||
RULE_DESCRIPTION = os.environ.get("INGRESS_SSH_RULE_DESCRIPTION", "SSH fallback from origin network")
|
||||
PORT = int(os.environ.get("INGRESS_SSH_PORT", "22"))
|
||||
STATE_FILE = Path(os.environ.get("INGRESS_IP_STATE_FILE", "/var/lib/desineuron-ingress-ip-sync/current_ip.txt"))
|
||||
|
||||
|
||||
def get_public_ip() -> str:
|
||||
with urllib.request.urlopen("https://api.ipify.org", timeout=15) as response:
|
||||
return response.read().decode("utf-8").strip()
|
||||
|
||||
|
||||
def get_security_group():
|
||||
ec2 = boto3.client("ec2", region_name=os.environ.get("AWS_REGION", "us-east-1"))
|
||||
response = ec2.describe_security_groups(GroupIds=[SECURITY_GROUP_ID])
|
||||
return ec2, response["SecurityGroups"][0]
|
||||
|
||||
|
||||
def find_existing_ssh_rules(ip_permissions):
|
||||
matches = []
|
||||
for permission in ip_permissions:
|
||||
if permission.get("IpProtocol") != "tcp":
|
||||
continue
|
||||
if permission.get("FromPort") != PORT or permission.get("ToPort") != PORT:
|
||||
continue
|
||||
for ip_range in permission.get("IpRanges", []):
|
||||
if ip_range.get("Description") == RULE_DESCRIPTION:
|
||||
matches.append(ip_range["CidrIp"])
|
||||
return matches
|
||||
|
||||
|
||||
def revoke_old_rules(ec2, cidrs):
|
||||
for cidr in cidrs:
|
||||
ec2.revoke_security_group_ingress(
|
||||
GroupId=SECURITY_GROUP_ID,
|
||||
IpPermissions=[
|
||||
{
|
||||
"IpProtocol": "tcp",
|
||||
"FromPort": PORT,
|
||||
"ToPort": PORT,
|
||||
"IpRanges": [{"CidrIp": cidr}],
|
||||
}
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
def authorize_new_rule(ec2, cidr):
|
||||
ec2.authorize_security_group_ingress(
|
||||
GroupId=SECURITY_GROUP_ID,
|
||||
IpPermissions=[
|
||||
{
|
||||
"IpProtocol": "tcp",
|
||||
"FromPort": PORT,
|
||||
"ToPort": PORT,
|
||||
"IpRanges": [{"CidrIp": cidr, "Description": RULE_DESCRIPTION}],
|
||||
}
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
def write_state(ip: str):
|
||||
STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
STATE_FILE.write_text(ip + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
def main() -> int:
|
||||
public_ip = get_public_ip()
|
||||
desired_cidr = f"{public_ip}/32"
|
||||
|
||||
ec2, group = get_security_group()
|
||||
existing_rules = find_existing_ssh_rules(group["IpPermissions"])
|
||||
|
||||
if existing_rules == [desired_cidr]:
|
||||
write_state(public_ip)
|
||||
print(json.dumps({"status": "noop", "public_ip": public_ip, "cidr": desired_cidr}))
|
||||
return 0
|
||||
|
||||
if existing_rules:
|
||||
revoke_old_rules(ec2, existing_rules)
|
||||
|
||||
authorize_new_rule(ec2, desired_cidr)
|
||||
write_state(public_ip)
|
||||
print(
|
||||
json.dumps(
|
||||
{
|
||||
"status": "updated",
|
||||
"public_ip": public_ip,
|
||||
"cidr": desired_cidr,
|
||||
"replaced": existing_rules,
|
||||
}
|
||||
)
|
||||
)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
raise SystemExit(main())
|
||||
except Exception as exc:
|
||||
print(json.dumps({"status": "error", "error": str(exc)}), file=sys.stderr)
|
||||
raise
|
||||
102
infrastructure/desineuron_ingress/user_data.sh
Normal file
102
infrastructure/desineuron_ingress/user_data.sh
Normal file
@@ -0,0 +1,102 @@
|
||||
#!/bin/bash
|
||||
set -euxo pipefail
|
||||
|
||||
exec > >(tee /var/log/desineuron-ingress-bootstrap.log | logger -t user-data -s 2>/dev/console) 2>&1
|
||||
|
||||
dnf update -y
|
||||
dnf install -y curl tar gzip unzip jq policycoreutils-python-utils
|
||||
|
||||
systemctl enable amazon-ssm-agent
|
||||
systemctl restart amazon-ssm-agent
|
||||
|
||||
useradd --system --home /var/lib/caddy --shell /sbin/nologin caddy || true
|
||||
install -d -o caddy -g caddy -m 0755 /etc/caddy /var/lib/caddy /var/log/caddy
|
||||
install -d -m 0755 /etc/rathole /opt/rathole
|
||||
|
||||
cat >/etc/ssh/sshd_config.d/10-desineuron-hardening.conf <<'EOF'
|
||||
PasswordAuthentication no
|
||||
KbdInteractiveAuthentication no
|
||||
PermitRootLogin no
|
||||
PubkeyAuthentication yes
|
||||
EOF
|
||||
systemctl restart sshd
|
||||
|
||||
CADDY_VERSION="v2.10.2"
|
||||
CADDY_URL="https://github.com/caddyserver/caddy/releases/download/${CADDY_VERSION}/caddy_2.10.2_linux_arm64.tar.gz"
|
||||
RATHOLE_VERSION="v0.4.3"
|
||||
RATHOLE_URL="https://github.com/rapiz1/rathole/releases/download/${RATHOLE_VERSION}/rathole-aarch64-unknown-linux-musl.zip"
|
||||
|
||||
tmp_dir="$(mktemp -d)"
|
||||
cd "$tmp_dir"
|
||||
curl -fL "$CADDY_URL" -o caddy.tar.gz
|
||||
tar -xzf caddy.tar.gz
|
||||
install -m 0755 caddy /usr/local/bin/caddy
|
||||
setcap cap_net_bind_service=+ep /usr/local/bin/caddy || true
|
||||
|
||||
curl -fL "$RATHOLE_URL" -o rathole.zip
|
||||
python3 - <<'PY'
|
||||
import zipfile
|
||||
z = zipfile.ZipFile("rathole.zip")
|
||||
z.extractall(".")
|
||||
PY
|
||||
install -m 0755 rathole /usr/local/bin/rathole
|
||||
rm -rf "$tmp_dir"
|
||||
|
||||
cat >/etc/systemd/system/caddy.service <<'EOF'
|
||||
[Unit]
|
||||
Description=Caddy
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
User=caddy
|
||||
Group=caddy
|
||||
ExecStart=/usr/local/bin/caddy run --environ --config /etc/caddy/Caddyfile
|
||||
ExecReload=/usr/local/bin/caddy reload --config /etc/caddy/Caddyfile
|
||||
TimeoutStopSec=5s
|
||||
LimitNOFILE=1048576
|
||||
PrivateTmp=true
|
||||
ProtectSystem=full
|
||||
AmbientCapabilities=CAP_NET_BIND_SERVICE
|
||||
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
|
||||
NoNewPrivileges=true
|
||||
Restart=on-failure
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
|
||||
cat >/etc/systemd/system/rathole-server.service <<'EOF'
|
||||
[Unit]
|
||||
Description=Desineuron Rathole Server
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/usr/local/bin/rathole /etc/rathole/server.toml
|
||||
Restart=always
|
||||
RestartSec=5
|
||||
User=root
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
|
||||
cat >/etc/logrotate.d/caddy <<'EOF'
|
||||
/var/log/caddy/*.log {
|
||||
daily
|
||||
rotate 14
|
||||
compress
|
||||
missingok
|
||||
notifempty
|
||||
copytruncate
|
||||
}
|
||||
EOF
|
||||
|
||||
touch /etc/caddy/Caddyfile
|
||||
touch /etc/rathole/server.toml
|
||||
|
||||
systemctl daemon-reload
|
||||
systemctl enable caddy.service
|
||||
systemctl enable rathole-server.service
|
||||
37
infrastructure/ops_control_plane/.env.example
Normal file
37
infrastructure/ops_control_plane/.env.example
Normal file
@@ -0,0 +1,37 @@
|
||||
OPS_DB_NAME=desineuron_ops
|
||||
OPS_DB_USER=desineuron_ops
|
||||
OPS_DB_PASSWORD=change-me
|
||||
OPS_DATABASE_URL=postgresql+psycopg://desineuron_ops:change-me@ops-db:5432/desineuron_ops
|
||||
OPS_SESSION_SECRET=change-me
|
||||
OPS_ADMIN_USERNAME=sagnik
|
||||
OPS_ADMIN_PASSWORD=change-me
|
||||
OPS_TEAM_USERS_JSON=[]
|
||||
OPS_DEFAULT_REGION=us-east-1
|
||||
OPS_VISIBLE_REGIONS=us-east-1,ap-south-1,eu-west-1
|
||||
OPS_BUCKET_NAME=
|
||||
OPS_BUCKET_REGION=us-east-1
|
||||
OPS_SSH_KEY_PATH=/app/state/desineuron-l4-node.pem
|
||||
OPS_GPU_SSH_USER=ubuntu
|
||||
OPS_INGRESS_SSH_HOST=98.87.120.120
|
||||
OPS_INGRESS_SSH_USER=ec2-user
|
||||
OPS_INGRESS_PRIVATE_IP=172.31.41.26
|
||||
OPS_INGRESS_SSH_PORT=22
|
||||
OPS_LINUX_PUBLIC_BASE_URL=https://ops.desineuron.in
|
||||
OPS_PRICE_EBS_GP3_PER_GB_MONTH=0.08
|
||||
OPS_PRICE_PUBLIC_IPV4_PER_HOUR=0.005
|
||||
OPS_ALLOWED_MACHINE_IDS=i-094df09acafb72494,i-0e4eab5fe67cf9abe
|
||||
OPS_GPU_SUBNET_ID=subnet-03d684ed15f327151
|
||||
OPS_GPU_SECURITY_GROUP_IDS=sg-05e4de3fe94ad6558,sg-0b144c17b1b89f4c6
|
||||
OPS_GPU_KEY_NAME=desineuron-l4-node
|
||||
OPS_GPU_AMI_ID=ami-0016081b488c7376d
|
||||
OPS_GPU_INSTANCE_PROFILE=Synapse-Training-Profile
|
||||
OPS_GPU_ROOT_VOLUME_GB=300
|
||||
OPS_GPU_WORKER_SCRIPT_PATH=/app/ops_control_plane/worker.py
|
||||
OPS_CSV_EXPORT_DIR=/app/exports
|
||||
OPS_LOG_DIR=/app/logs
|
||||
OPS_STATE_DIR=/app/state
|
||||
OPS_MODEL_LIBRARY_HOST_PATH=/mnt/ServerStorage/ai-models/models
|
||||
OPS_MODEL_LIBRARY_ROOT=/model-library
|
||||
OPS_INGRESS_ROUTE_HELPER=/usr/local/bin/manage_desineuron_routes.py
|
||||
OPS_CLOUDFLARE_ZONE_NAME=desineuron.in
|
||||
OPS_CLOUDFLARE_API_TOKEN=
|
||||
78
infrastructure/ops_control_plane/README.md
Normal file
78
infrastructure/ops_control_plane/README.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# Desineuron Ops Control Plane
|
||||
|
||||
Internal Linux-hosted control surface for:
|
||||
|
||||
- AWS machine lifecycle
|
||||
- S3-backed model ingest with generated manifests and checksums
|
||||
- model hydration from S3
|
||||
- runtime and estimated cost tracking
|
||||
- ingress route management
|
||||
- session logging and CSV export
|
||||
|
||||
Main deployment target:
|
||||
|
||||
- Linux box at `192.168.1.4`
|
||||
|
||||
Primary public route:
|
||||
|
||||
- `ops.desineuron.in`
|
||||
|
||||
Canonical S3 bucket:
|
||||
|
||||
- `desineuron-ops-control-plane-819079556187-us-east-1`
|
||||
|
||||
Related AWS nodes:
|
||||
|
||||
- ingress: `i-094df09acafb72494`
|
||||
- current GPU worker: `i-0e4eab5fe67cf9abe`
|
||||
|
||||
Core runtime:
|
||||
|
||||
- FastAPI web + API surface
|
||||
- background worker
|
||||
- PostgreSQL
|
||||
- Docker Compose
|
||||
- systemd wrapper on Linux
|
||||
|
||||
Key files:
|
||||
|
||||
- `docker-compose.yml`
|
||||
- `.env.example`
|
||||
- `app/ops_control_plane/main.py`
|
||||
- `app/ops_control_plane/worker.py`
|
||||
- `app/ops_control_plane/cli.py`
|
||||
- `manage_desineuron_routes.py`
|
||||
- `install_linux_ops_control_plane.sh`
|
||||
|
||||
Runtime paths on Linux:
|
||||
|
||||
- stack root: `/opt/desineuron-ops-control-plane`
|
||||
- env file: `/opt/desineuron-ops-control-plane/.env`
|
||||
- exports: `/opt/desineuron-ops-control-plane/exports`
|
||||
- state: `/opt/desineuron-ops-control-plane/state`
|
||||
|
||||
Access:
|
||||
|
||||
- login route: `https://ops.desineuron.in/login`
|
||||
- operator logins are provisioned as email-style usernames
|
||||
- admin password is stored in the protected `.env` file on Linux and should be retrieved locally rather than copied into repo notes
|
||||
|
||||
Validated live behaviors:
|
||||
|
||||
- market pricing API returns live on-demand and spot views
|
||||
- session and cost tracking persist in PostgreSQL and export to CSV
|
||||
- spot launch failures are recorded cleanly instead of crashing the UI
|
||||
- on-demand GPU launch was validated with a `g6.xlarge` lifecycle test
|
||||
- managed ingress route upsert/delete was validated through the helper on the `t4g.micro` ingress
|
||||
- model ingest from Linux model library to S3 was validated with `ops-smoke-model`, including manifest generation and catalog registration
|
||||
|
||||
Operator retrieval commands:
|
||||
|
||||
- admin password:
|
||||
- `sudo sed -n 's/^OPS_ADMIN_PASSWORD=//p' /opt/desineuron-ops-control-plane/.env`
|
||||
- latest CSV export:
|
||||
- `ls -lah /opt/desineuron-ops-control-plane/exports`
|
||||
|
||||
Installer safety note:
|
||||
|
||||
- `install_linux_ops_control_plane.sh` intentionally excludes runtime directories (`data/`, `exports/`, `logs/`, `state/`, `.env`) from code sync so redeploys do not corrupt Postgres state or overwrite secrets
|
||||
16
infrastructure/ops_control_plane/app/Dockerfile
Normal file
16
infrastructure/ops_control_plane/app/Dockerfile
Normal file
@@ -0,0 +1,16 @@
|
||||
FROM python:3.12-slim
|
||||
|
||||
ENV PYTHONDONTWRITEBYTECODE=1
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY requirements.txt /app/requirements.txt
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y --no-install-recommends openssh-client curl ca-certificates \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
RUN pip install --no-cache-dir -r /app/requirements.txt
|
||||
|
||||
COPY ops_control_plane /app/ops_control_plane
|
||||
|
||||
CMD ["python", "-m", "ops_control_plane.main"]
|
||||
@@ -0,0 +1 @@
|
||||
__all__ = ["main"]
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,549 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import csv
|
||||
import hashlib
|
||||
import io
|
||||
import json
|
||||
import shlex
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
from collections.abc import Iterable
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import boto3
|
||||
from botocore.exceptions import ClientError
|
||||
from sqlalchemy import select
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from .config import settings
|
||||
from .models import AuditEvent, Machine, MachineModelCache, MachineProfile, MarketSnapshot, ModelCatalog, RouteBinding, Session as RuntimeSession, SessionCost
|
||||
|
||||
|
||||
REGION_LOCATION_MAP = {
|
||||
"us-east-1": "US East (N. Virginia)",
|
||||
"ap-south-1": "Asia Pacific (Mumbai)",
|
||||
"eu-west-1": "EU (Ireland)",
|
||||
}
|
||||
|
||||
ON_DEMAND_PRICE_FALLBACKS = {
|
||||
("us-east-1", "t4g.micro"): 0.0084,
|
||||
}
|
||||
|
||||
|
||||
def utcnow() -> datetime:
|
||||
return datetime.now(timezone.utc)
|
||||
|
||||
|
||||
def ec2_client(region: str):
|
||||
return boto3.client("ec2", region_name=region)
|
||||
|
||||
|
||||
def pricing_client():
|
||||
return boto3.client("pricing", region_name="us-east-1")
|
||||
|
||||
|
||||
def s3_client(region: str | None = None):
|
||||
return boto3.client("s3", region_name=region or settings.bucket_region)
|
||||
|
||||
|
||||
def ensure_bucket(bucket_name: str, region: str) -> None:
|
||||
client = s3_client(region)
|
||||
try:
|
||||
client.head_bucket(Bucket=bucket_name)
|
||||
except ClientError as exc:
|
||||
code = exc.response.get("Error", {}).get("Code", "")
|
||||
if code in {"404", "NoSuchBucket", "NotFound"}:
|
||||
if region == "us-east-1":
|
||||
client.create_bucket(Bucket=bucket_name)
|
||||
else:
|
||||
client.create_bucket(
|
||||
Bucket=bucket_name,
|
||||
CreateBucketConfiguration={"LocationConstraint": region},
|
||||
)
|
||||
elif code not in {"301", "403"}:
|
||||
raise
|
||||
client.put_bucket_versioning(Bucket=bucket_name, VersioningConfiguration={"Status": "Enabled"})
|
||||
client.put_bucket_encryption(
|
||||
Bucket=bucket_name,
|
||||
ServerSideEncryptionConfiguration={
|
||||
"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"}}]
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
def seed_bucket_prefixes(bucket_name: str) -> None:
|
||||
client = s3_client()
|
||||
for prefix in [
|
||||
"models/",
|
||||
"workflows/",
|
||||
"references/",
|
||||
"outputs/",
|
||||
"manifests/",
|
||||
"bootstrap/",
|
||||
]:
|
||||
client.put_object(Bucket=bucket_name, Key=prefix)
|
||||
|
||||
|
||||
def resolve_model_source_dir(source_relative_path: str) -> Path:
|
||||
source = (settings.model_library_root / source_relative_path).resolve()
|
||||
root = settings.model_library_root.resolve()
|
||||
if root not in source.parents and source != root:
|
||||
raise ValueError("Model source path escapes configured model library root")
|
||||
if not source.exists() or not source.is_dir():
|
||||
raise FileNotFoundError(f"Model source directory not found: {source}")
|
||||
return source
|
||||
|
||||
|
||||
def build_model_manifest(source_dir: Path) -> dict:
|
||||
files: list[dict] = []
|
||||
total_size = 0
|
||||
for path in sorted(p for p in source_dir.rglob("*") if p.is_file()):
|
||||
rel = path.relative_to(source_dir).as_posix()
|
||||
sha256 = hashlib.sha256()
|
||||
with path.open("rb") as handle:
|
||||
for chunk in iter(lambda: handle.read(1024 * 1024), b""):
|
||||
sha256.update(chunk)
|
||||
size_bytes = path.stat().st_size
|
||||
total_size += size_bytes
|
||||
files.append({"path": rel, "sha256": sha256.hexdigest(), "size_bytes": size_bytes})
|
||||
return {
|
||||
"generated_at": utcnow().isoformat(),
|
||||
"file_count": len(files),
|
||||
"total_size_bytes": total_size,
|
||||
"files": files,
|
||||
}
|
||||
|
||||
|
||||
def upload_model_directory(bucket_name: str, model_key: str, source_relative_path: str, label: str, workload_tags: list[str] | None = None, compatibility_tags: list[str] | None = None) -> dict:
|
||||
source_dir = resolve_model_source_dir(source_relative_path)
|
||||
manifest = build_model_manifest(source_dir)
|
||||
client = s3_client()
|
||||
s3_prefix = f"models/{model_key}/"
|
||||
for file_entry in manifest["files"]:
|
||||
local_path = source_dir / Path(file_entry["path"])
|
||||
client.upload_file(str(local_path), bucket_name, s3_prefix + file_entry["path"])
|
||||
manifest_key = f"manifests/models/{model_key}.json"
|
||||
client.put_object(
|
||||
Bucket=bucket_name,
|
||||
Key=manifest_key,
|
||||
Body=json.dumps(manifest, indent=2).encode("utf-8"),
|
||||
ContentType="application/json",
|
||||
)
|
||||
return {
|
||||
"model_key": model_key,
|
||||
"label": label,
|
||||
"source_dir": str(source_dir),
|
||||
"s3_prefix": s3_prefix,
|
||||
"manifest_key": manifest_key,
|
||||
"manifest": manifest,
|
||||
"workload_tags": workload_tags or [],
|
||||
"compatibility_tags": compatibility_tags or [],
|
||||
}
|
||||
|
||||
|
||||
def fetch_on_demand_price(region: str, instance_type: str) -> float | None:
|
||||
location = REGION_LOCATION_MAP.get(region)
|
||||
if not location:
|
||||
return None
|
||||
response = pricing_client().get_products(
|
||||
ServiceCode="AmazonEC2",
|
||||
Filters=[
|
||||
{"Type": "TERM_MATCH", "Field": "instanceType", "Value": instance_type},
|
||||
{"Type": "TERM_MATCH", "Field": "location", "Value": location},
|
||||
{"Type": "TERM_MATCH", "Field": "operatingSystem", "Value": "Linux"},
|
||||
{"Type": "TERM_MATCH", "Field": "tenancy", "Value": "Shared"},
|
||||
{"Type": "TERM_MATCH", "Field": "preInstalledSw", "Value": "NA"},
|
||||
{"Type": "TERM_MATCH", "Field": "capacitystatus", "Value": "Used"},
|
||||
],
|
||||
MaxResults=1,
|
||||
)
|
||||
for price_item in response.get("PriceList", []):
|
||||
item = json.loads(price_item)
|
||||
terms = item.get("terms", {}).get("OnDemand", {})
|
||||
for term in terms.values():
|
||||
for dimension in term.get("priceDimensions", {}).values():
|
||||
price = dimension.get("pricePerUnit", {}).get("USD")
|
||||
if price:
|
||||
return float(price)
|
||||
return ON_DEMAND_PRICE_FALLBACKS.get((region, instance_type))
|
||||
|
||||
|
||||
def refresh_market_snapshots(db: Session, regions: Iterable[str], profile_rows: Iterable[MachineProfile]) -> None:
|
||||
seen: set[tuple[str, str]] = set()
|
||||
for profile in profile_rows:
|
||||
seen.add((profile.region, profile.instance_type))
|
||||
|
||||
for region in regions:
|
||||
region_profiles = [p for p in profile_rows if p.region == region]
|
||||
instance_types = {p.instance_type for p in region_profiles}
|
||||
if not instance_types:
|
||||
continue
|
||||
ec2 = ec2_client(region)
|
||||
offerings = ec2.describe_instance_type_offerings(
|
||||
LocationType="region",
|
||||
Filters=[{"Name": "instance-type", "Values": sorted(instance_types)}],
|
||||
)["InstanceTypeOfferings"]
|
||||
available = {item["InstanceType"] for item in offerings}
|
||||
for instance_type in instance_types:
|
||||
on_demand_price = fetch_on_demand_price(region, instance_type)
|
||||
db.add(
|
||||
MarketSnapshot(
|
||||
region=region,
|
||||
instance_type=instance_type,
|
||||
lifecycle="on-demand",
|
||||
offering_available=instance_type in available,
|
||||
hourly_price_usd=on_demand_price,
|
||||
raw_payload={"instance_type": instance_type, "region": region},
|
||||
)
|
||||
)
|
||||
try:
|
||||
spot_history = ec2.describe_spot_price_history(
|
||||
InstanceTypes=[instance_type],
|
||||
ProductDescriptions=["Linux/UNIX"],
|
||||
StartTime=utcnow(),
|
||||
MaxResults=1,
|
||||
)["SpotPriceHistory"]
|
||||
spot_price = float(spot_history[0]["SpotPrice"]) if spot_history else None
|
||||
except ClientError:
|
||||
spot_price = None
|
||||
db.add(
|
||||
MarketSnapshot(
|
||||
region=region,
|
||||
instance_type=instance_type,
|
||||
lifecycle="spot",
|
||||
offering_available=instance_type in available and spot_price is not None,
|
||||
hourly_price_usd=spot_price,
|
||||
raw_payload={"instance_type": instance_type, "region": region},
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def latest_market_price(db: Session, region: str, instance_type: str, lifecycle: str) -> float:
|
||||
row = db.scalar(
|
||||
select(MarketSnapshot)
|
||||
.where(
|
||||
MarketSnapshot.region == region,
|
||||
MarketSnapshot.instance_type == instance_type,
|
||||
MarketSnapshot.lifecycle == lifecycle,
|
||||
)
|
||||
.order_by(MarketSnapshot.observed_at.desc())
|
||||
)
|
||||
return row.hourly_price_usd if row and row.hourly_price_usd is not None else 0.0
|
||||
|
||||
|
||||
def sync_instances(db: Session, regions: Iterable[str]) -> None:
|
||||
for region in regions:
|
||||
ec2 = ec2_client(region)
|
||||
reservations = ec2.describe_instances()["Reservations"]
|
||||
for reservation in reservations:
|
||||
for instance in reservation["Instances"]:
|
||||
instance_id = instance["InstanceId"]
|
||||
launch_time = instance.get("LaunchTime")
|
||||
if launch_time and launch_time.tzinfo is None:
|
||||
launch_time = launch_time.replace(tzinfo=timezone.utc)
|
||||
public_ip = instance.get("PublicIpAddress")
|
||||
private_ip = instance.get("PrivateIpAddress")
|
||||
state_name = instance["State"]["Name"]
|
||||
volume_size = 0
|
||||
if instance.get("BlockDeviceMappings"):
|
||||
try:
|
||||
volume_ids = [b["Ebs"]["VolumeId"] for b in instance["BlockDeviceMappings"] if "Ebs" in b]
|
||||
if volume_ids:
|
||||
volumes = ec2.describe_volumes(VolumeIds=volume_ids)["Volumes"]
|
||||
volume_size = sum(v.get("Size", 0) for v in volumes)
|
||||
except ClientError:
|
||||
volume_size = 0
|
||||
existing = db.scalar(select(Machine).where(Machine.aws_instance_id == instance_id))
|
||||
tags = {tag["Key"]: tag["Value"] for tag in instance.get("Tags", [])}
|
||||
payload = {
|
||||
"key_name": instance.get("KeyName"),
|
||||
"subnet_id": instance.get("SubnetId"),
|
||||
"security_groups": instance.get("SecurityGroups", []),
|
||||
"image_id": instance.get("ImageId"),
|
||||
"iam_instance_profile": instance.get("IamInstanceProfile", {}).get("Arn"),
|
||||
"availability_zone": instance.get("Placement", {}).get("AvailabilityZone"),
|
||||
"public_dns": instance.get("PublicDnsName"),
|
||||
}
|
||||
if existing:
|
||||
existing.name = tags.get("Name", instance_id)
|
||||
existing.region = region
|
||||
existing.instance_type = instance["InstanceType"]
|
||||
existing.lifecycle = instance.get("InstanceLifecycle", "on-demand")
|
||||
existing.state = state_name
|
||||
existing.public_ip = public_ip
|
||||
existing.private_ip = private_ip
|
||||
existing.launch_time = launch_time
|
||||
existing.volume_gb = volume_size
|
||||
existing.public_ipv4_attached = bool(public_ip)
|
||||
existing.details = payload
|
||||
else:
|
||||
db.add(
|
||||
Machine(
|
||||
aws_instance_id=instance_id,
|
||||
name=tags.get("Name", instance_id),
|
||||
region=region,
|
||||
profile_name=tags.get("DesineuronProfile"),
|
||||
instance_type=instance["InstanceType"],
|
||||
lifecycle=instance.get("InstanceLifecycle", "on-demand"),
|
||||
state=state_name,
|
||||
public_ip=public_ip,
|
||||
private_ip=private_ip,
|
||||
launch_time=launch_time,
|
||||
volume_gb=volume_size,
|
||||
public_ipv4_attached=bool(public_ip),
|
||||
details=payload,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def candidate_subnet_ids(region: str, preferred_subnet_id: str) -> list[str]:
|
||||
if not preferred_subnet_id:
|
||||
return []
|
||||
ec2 = ec2_client(region)
|
||||
subnet_response = ec2.describe_subnets(SubnetIds=[preferred_subnet_id])["Subnets"]
|
||||
if not subnet_response:
|
||||
return [preferred_subnet_id]
|
||||
preferred = subnet_response[0]
|
||||
vpc_id = preferred["VpcId"]
|
||||
subnets = ec2.describe_subnets(
|
||||
Filters=[
|
||||
{"Name": "vpc-id", "Values": [vpc_id]},
|
||||
{"Name": "state", "Values": ["available"]},
|
||||
]
|
||||
)["Subnets"]
|
||||
ranked: list[tuple[int, str, str]] = []
|
||||
for subnet in subnets:
|
||||
subnet_id = subnet["SubnetId"]
|
||||
az = subnet.get("AvailabilityZone", "")
|
||||
score = 2
|
||||
if subnet_id == preferred_subnet_id:
|
||||
score = 0
|
||||
elif subnet.get("MapPublicIpOnLaunch"):
|
||||
score = 1
|
||||
ranked.append((score, az, subnet_id))
|
||||
return [subnet_id for _, _, subnet_id in sorted(ranked)]
|
||||
|
||||
|
||||
def calculate_machine_cost(machine: Machine, hourly_rate: float) -> dict:
|
||||
if not machine.launch_time:
|
||||
runtime_hours = 0.0
|
||||
else:
|
||||
runtime_hours = max((utcnow() - machine.launch_time).total_seconds() / 3600.0, 0.0)
|
||||
compute_cost = runtime_hours * hourly_rate
|
||||
storage_hourly = (machine.volume_gb * settings.ebs_gp3_per_gb_month) / 730.0
|
||||
storage_cost = runtime_hours * storage_hourly
|
||||
public_ip_cost = runtime_hours * settings.public_ipv4_per_hour if machine.public_ipv4_attached else 0.0
|
||||
return {
|
||||
"runtime_hours": round(runtime_hours, 3),
|
||||
"compute_cost_usd": round(compute_cost, 4),
|
||||
"storage_cost_usd": round(storage_cost, 4),
|
||||
"public_ip_cost_usd": round(public_ip_cost, 4),
|
||||
"total_cost_usd": round(compute_cost + storage_cost + public_ip_cost, 4),
|
||||
"hourly_price_usd": round(hourly_rate + storage_hourly + (settings.public_ipv4_per_hour if machine.public_ipv4_attached else 0.0), 4),
|
||||
}
|
||||
|
||||
|
||||
def upsert_session_cost(db: Session, session_row: RuntimeSession, machine: Machine) -> None:
|
||||
hourly_rate = latest_market_price(db, machine.region, machine.instance_type, machine.lifecycle or "on-demand")
|
||||
cost_payload = calculate_machine_cost(machine, hourly_rate)
|
||||
record = db.scalar(
|
||||
select(SessionCost).where(SessionCost.session_id == session_row.id).order_by(SessionCost.calculated_at.desc())
|
||||
)
|
||||
if record:
|
||||
record.runtime_hours = cost_payload["runtime_hours"]
|
||||
record.compute_cost_usd = cost_payload["compute_cost_usd"]
|
||||
record.storage_cost_usd = cost_payload["storage_cost_usd"]
|
||||
record.public_ip_cost_usd = cost_payload["public_ip_cost_usd"]
|
||||
record.total_cost_usd = cost_payload["total_cost_usd"]
|
||||
record.calculated_at = utcnow()
|
||||
else:
|
||||
db.add(SessionCost(session_id=session_row.id, **cost_payload))
|
||||
|
||||
|
||||
def create_managed_instance(db: Session, profile: MachineProfile, actor: str, lifecycle: str) -> RuntimeSession:
|
||||
ec2 = ec2_client(profile.region)
|
||||
launch_config = profile.launch_config
|
||||
base_run_args = {
|
||||
"ImageId": launch_config["ami_id"],
|
||||
"InstanceType": profile.instance_type,
|
||||
"SecurityGroupIds": launch_config["security_group_ids"],
|
||||
"KeyName": launch_config["key_name"],
|
||||
"IamInstanceProfile": {"Name": launch_config["instance_profile"]},
|
||||
"MinCount": 1,
|
||||
"MaxCount": 1,
|
||||
"BlockDeviceMappings": [
|
||||
{
|
||||
"DeviceName": "/dev/sda1",
|
||||
"Ebs": {
|
||||
"VolumeSize": int(launch_config.get("root_volume_gb", settings.gpu_root_volume_gb)),
|
||||
"VolumeType": "gp3",
|
||||
"DeleteOnTermination": True,
|
||||
},
|
||||
}
|
||||
],
|
||||
"TagSpecifications": [
|
||||
{
|
||||
"ResourceType": "instance",
|
||||
"Tags": [
|
||||
{"Key": "Name", "Value": f"desineuron-{profile.name}-{int(utcnow().timestamp())}"},
|
||||
{"Key": "ManagedBy", "Value": "DesineuronOps"},
|
||||
{"Key": "DesineuronProfile", "Value": profile.name},
|
||||
],
|
||||
}
|
||||
],
|
||||
}
|
||||
if lifecycle == "spot":
|
||||
base_run_args["InstanceMarketOptions"] = {
|
||||
"MarketType": "spot",
|
||||
"SpotOptions": {"SpotInstanceType": "one-time", "InstanceInterruptionBehavior": "terminate"},
|
||||
}
|
||||
subnet_ids = candidate_subnet_ids(profile.region, launch_config["subnet_id"]) or [launch_config["subnet_id"]]
|
||||
last_exc: Exception | None = None
|
||||
response = None
|
||||
chosen_subnet = launch_config["subnet_id"]
|
||||
for subnet_id in subnet_ids:
|
||||
run_args = dict(base_run_args)
|
||||
run_args["SubnetId"] = subnet_id
|
||||
try:
|
||||
response = ec2.run_instances(**run_args)
|
||||
chosen_subnet = subnet_id
|
||||
break
|
||||
except ClientError as exc:
|
||||
last_exc = exc
|
||||
error_code = exc.response.get("Error", {}).get("Code")
|
||||
if error_code not in {"InsufficientInstanceCapacity", "MaxSpotInstanceCountExceeded", "Unsupported"}:
|
||||
raise
|
||||
continue
|
||||
if response is None:
|
||||
assert last_exc is not None
|
||||
raise last_exc
|
||||
instance = response["Instances"][0]
|
||||
machine = Machine(
|
||||
aws_instance_id=instance["InstanceId"],
|
||||
name=f"desineuron-{profile.name}",
|
||||
region=profile.region,
|
||||
profile_name=profile.name,
|
||||
instance_type=profile.instance_type,
|
||||
lifecycle=lifecycle,
|
||||
state=instance["State"]["Name"],
|
||||
public_ip=instance.get("PublicIpAddress"),
|
||||
private_ip=instance.get("PrivateIpAddress"),
|
||||
launch_time=instance.get("LaunchTime"),
|
||||
volume_gb=int(launch_config.get("root_volume_gb", settings.gpu_root_volume_gb)),
|
||||
public_ipv4_attached=True,
|
||||
details={"launched_by": actor, "chosen_subnet_id": chosen_subnet},
|
||||
)
|
||||
db.add(machine)
|
||||
db.flush()
|
||||
session_row = RuntimeSession(machine_id=machine.id, actor=actor, workload_name=profile.name, status="active")
|
||||
db.add(session_row)
|
||||
db.add(AuditEvent(actor=actor, action="launch_machine", entity_type="machine", entity_id=machine.aws_instance_id, payload={"profile": profile.name, "lifecycle": lifecycle}))
|
||||
return session_row
|
||||
|
||||
|
||||
def stop_machine(db: Session, machine: Machine, actor: str) -> None:
|
||||
ec2 = ec2_client(machine.region)
|
||||
ec2.stop_instances(InstanceIds=[machine.aws_instance_id])
|
||||
machine.state = "stopping"
|
||||
db.add(AuditEvent(actor=actor, action="stop_machine", entity_type="machine", entity_id=machine.aws_instance_id, payload={}))
|
||||
|
||||
|
||||
def terminate_machine(db: Session, machine: Machine, actor: str) -> None:
|
||||
ec2 = ec2_client(machine.region)
|
||||
ec2.terminate_instances(InstanceIds=[machine.aws_instance_id])
|
||||
machine.state = "shutting-down"
|
||||
db.add(AuditEvent(actor=actor, action="terminate_machine", entity_type="machine", entity_id=machine.aws_instance_id, payload={}))
|
||||
|
||||
|
||||
def ssh_run(host: str, user: str, command: str) -> subprocess.CompletedProcess[str]:
|
||||
return subprocess.run(
|
||||
[
|
||||
"ssh",
|
||||
"-o",
|
||||
"StrictHostKeyChecking=no",
|
||||
"-o",
|
||||
"UserKnownHostsFile=NUL",
|
||||
"-i",
|
||||
str(settings.ssh_key_path),
|
||||
f"{user}@{host}",
|
||||
command,
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
)
|
||||
|
||||
|
||||
def hydrate_model(machine: Machine, model_prefix: str, actor: str, bucket_name: str) -> dict:
|
||||
if not machine.public_ip:
|
||||
raise RuntimeError("Machine has no public IP for hydration")
|
||||
install_cmd = (
|
||||
"command -v s5cmd >/dev/null 2>&1 || "
|
||||
"curl -L https://github.com/peak/s5cmd/releases/download/v2.3.0/s5cmd_2.3.0_Linux-64bit.tar.gz "
|
||||
"| tar -xz -C /tmp && sudo mv /tmp/s5cmd /usr/local/bin/s5cmd"
|
||||
)
|
||||
ssh_run(machine.public_ip, settings.gpu_ssh_user, install_cmd)
|
||||
remote_dir = f"/opt/dlami/nvme/models/{model_prefix.split('/')[-2]}"
|
||||
copy_cmd = (
|
||||
f"mkdir -p {remote_dir} && "
|
||||
f"s5cmd cp 's3://{bucket_name}/{model_prefix}*' '{remote_dir}/'"
|
||||
)
|
||||
result = ssh_run(machine.public_ip, settings.gpu_ssh_user, copy_cmd)
|
||||
verify_result = None
|
||||
manifest_key = f"manifests/models/{model_prefix.rstrip('/').split('/')[-1]}.json"
|
||||
try:
|
||||
manifest_obj = s3_client().get_object(Bucket=bucket_name, Key=manifest_key)
|
||||
manifest = json.loads(manifest_obj["Body"].read().decode("utf-8"))
|
||||
checks = " && ".join(
|
||||
f"test -f {shlex.quote(remote_dir + '/' + entry['path'])}"
|
||||
for entry in manifest.get("files", [])
|
||||
) or "true"
|
||||
verify = ssh_run(machine.public_ip, settings.gpu_ssh_user, checks)
|
||||
verify_result = {"stdout": verify.stdout, "stderr": verify.stderr, "returncode": verify.returncode}
|
||||
except ClientError:
|
||||
verify_result = {"stdout": "", "stderr": "manifest_missing", "returncode": 1}
|
||||
return {
|
||||
"stdout": result.stdout,
|
||||
"stderr": result.stderr,
|
||||
"returncode": result.returncode,
|
||||
"remote_dir": remote_dir,
|
||||
"verify": verify_result,
|
||||
}
|
||||
|
||||
|
||||
def start_service(machine: Machine, service_name: str) -> dict:
|
||||
if not machine.public_ip:
|
||||
raise RuntimeError("Machine has no public IP")
|
||||
result = ssh_run(machine.public_ip, settings.gpu_ssh_user, f"sudo systemctl start {service_name} && sudo systemctl is-active {service_name}")
|
||||
return {"stdout": result.stdout, "stderr": result.stderr, "returncode": result.returncode}
|
||||
|
||||
|
||||
def stop_service(machine: Machine, service_name: str) -> dict:
|
||||
if not machine.public_ip:
|
||||
raise RuntimeError("Machine has no public IP")
|
||||
result = ssh_run(machine.public_ip, settings.gpu_ssh_user, f"sudo systemctl stop {service_name}")
|
||||
return {"stdout": result.stdout, "stderr": result.stderr, "returncode": result.returncode}
|
||||
|
||||
|
||||
def export_sessions_csv(db: Session, target_path: str) -> str:
|
||||
rows = db.execute(
|
||||
select(
|
||||
RuntimeSession.id,
|
||||
RuntimeSession.actor,
|
||||
RuntimeSession.workload_name,
|
||||
RuntimeSession.status,
|
||||
RuntimeSession.started_at,
|
||||
RuntimeSession.ended_at,
|
||||
SessionCost.runtime_hours,
|
||||
SessionCost.compute_cost_usd,
|
||||
SessionCost.storage_cost_usd,
|
||||
SessionCost.public_ip_cost_usd,
|
||||
SessionCost.total_cost_usd,
|
||||
).join(SessionCost, SessionCost.session_id == RuntimeSession.id, isouter=True)
|
||||
)
|
||||
with open(target_path, "w", newline="", encoding="utf-8") as handle:
|
||||
writer = csv.writer(handle)
|
||||
writer.writerow(["session_id", "actor", "workload", "status", "started_at", "ended_at", "runtime_hours", "compute_cost_usd", "storage_cost_usd", "public_ip_cost_usd", "total_cost_usd"])
|
||||
for row in rows:
|
||||
writer.writerow(row)
|
||||
return target_path
|
||||
@@ -0,0 +1,79 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import typer
|
||||
from sqlalchemy import select
|
||||
|
||||
from .aws_control import calculate_machine_cost, create_managed_instance, export_sessions_csv, latest_market_price, stop_machine, terminate_machine
|
||||
from .database import Base, engine, session_scope
|
||||
from .models import AuditEvent, Machine, MachineProfile, Session as RuntimeSession
|
||||
|
||||
|
||||
app = typer.Typer(help="Desineuron Ops CLI")
|
||||
|
||||
|
||||
@app.command("machine-list")
|
||||
def machine_list():
|
||||
with session_scope() as db:
|
||||
machines = db.scalars(select(Machine).order_by(Machine.updated_at.desc())).all()
|
||||
for machine in machines:
|
||||
hourly_rate = latest_market_price(db, machine.region, machine.instance_type, machine.lifecycle)
|
||||
cost = calculate_machine_cost(machine, hourly_rate)
|
||||
typer.echo(f"{machine.aws_instance_id} {machine.instance_type} {machine.state} ${cost['total_cost_usd']:.4f}")
|
||||
|
||||
|
||||
@app.command("machine-launch")
|
||||
def machine_launch(profile_name: str, lifecycle: str = "spot", actor: str = "cli"):
|
||||
with session_scope() as db:
|
||||
profile = db.scalar(select(MachineProfile).where(MachineProfile.name == profile_name))
|
||||
if not profile:
|
||||
raise typer.BadParameter(f"Unknown profile: {profile_name}")
|
||||
session_row = create_managed_instance(db, profile, actor, lifecycle)
|
||||
typer.echo(json.dumps({"session_id": session_row.id, "profile": profile_name, "lifecycle": lifecycle}))
|
||||
|
||||
|
||||
@app.command("machine-stop")
|
||||
def machine_stop(machine_id: str, actor: str = "cli"):
|
||||
with session_scope() as db:
|
||||
machine = db.scalar(select(Machine).where(Machine.aws_instance_id == machine_id))
|
||||
if not machine:
|
||||
raise typer.BadParameter(f"Unknown machine: {machine_id}")
|
||||
stop_machine(db, machine, actor)
|
||||
active_session = db.scalar(select(RuntimeSession).where(RuntimeSession.machine_id == machine.id, RuntimeSession.status == "active"))
|
||||
if active_session:
|
||||
active_session.status = "stopped"
|
||||
typer.echo(json.dumps({"machine": machine_id, "status": "stopping"}))
|
||||
|
||||
|
||||
@app.command("machine-terminate")
|
||||
def machine_terminate(machine_id: str, actor: str = "cli"):
|
||||
with session_scope() as db:
|
||||
machine = db.scalar(select(Machine).where(Machine.aws_instance_id == machine_id))
|
||||
if not machine:
|
||||
raise typer.BadParameter(f"Unknown machine: {machine_id}")
|
||||
terminate_machine(db, machine, actor)
|
||||
active_session = db.scalar(select(RuntimeSession).where(RuntimeSession.machine_id == machine.id, RuntimeSession.status == "active"))
|
||||
if active_session:
|
||||
active_session.status = "terminated"
|
||||
typer.echo(json.dumps({"machine": machine_id, "status": "terminating"}))
|
||||
|
||||
|
||||
@app.command("audit-tail")
|
||||
def audit_tail(limit: int = 20):
|
||||
with session_scope() as db:
|
||||
events = db.scalars(select(AuditEvent).order_by(AuditEvent.created_at.desc()).limit(limit)).all()
|
||||
for event in events:
|
||||
typer.echo(json.dumps({"actor": event.actor, "action": event.action, "entity": event.entity_id, "created_at": event.created_at.isoformat()}))
|
||||
|
||||
|
||||
@app.command("export-sessions")
|
||||
def export_sessions(output: Path = Path("/app/exports/sessions_cli.csv")):
|
||||
with session_scope() as db:
|
||||
export_sessions_csv(db, str(output))
|
||||
typer.echo(str(output))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app()
|
||||
@@ -0,0 +1,51 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import json
|
||||
from dataclasses import field
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Settings:
|
||||
database_url: str = os.environ["OPS_DATABASE_URL"]
|
||||
session_secret: str = os.environ["OPS_SESSION_SECRET"]
|
||||
admin_username: str = os.environ.get("OPS_ADMIN_USERNAME", "sagnik")
|
||||
admin_password: str = os.environ["OPS_ADMIN_PASSWORD"]
|
||||
team_users_json: str = os.environ.get("OPS_TEAM_USERS_JSON", "[]")
|
||||
default_region: str = os.environ.get("OPS_DEFAULT_REGION", "us-east-1")
|
||||
visible_regions: tuple[str, ...] = tuple(
|
||||
region.strip() for region in os.environ.get("OPS_VISIBLE_REGIONS", "us-east-1").split(",") if region.strip()
|
||||
)
|
||||
bucket_name: str = os.environ.get("OPS_BUCKET_NAME", "")
|
||||
bucket_region: str = os.environ.get("OPS_BUCKET_REGION", "us-east-1")
|
||||
ssh_key_path: Path = Path(os.environ.get("OPS_SSH_KEY_PATH", "/app/state/desineuron-l4-node.pem"))
|
||||
gpu_ssh_user: str = os.environ.get("OPS_GPU_SSH_USER", "ubuntu")
|
||||
ingress_ssh_host: str = os.environ.get("OPS_INGRESS_SSH_HOST", "")
|
||||
ingress_ssh_user: str = os.environ.get("OPS_INGRESS_SSH_USER", "ec2-user")
|
||||
ingress_ssh_port: int = int(os.environ.get("OPS_INGRESS_SSH_PORT", "22"))
|
||||
ingress_route_helper: str = os.environ.get("OPS_INGRESS_ROUTE_HELPER", "/usr/local/bin/manage_desineuron_routes.py")
|
||||
public_base_url: str = os.environ.get("OPS_LINUX_PUBLIC_BASE_URL", "https://ops.desineuron.in")
|
||||
ebs_gp3_per_gb_month: float = float(os.environ.get("OPS_PRICE_EBS_GP3_PER_GB_MONTH", "0.08"))
|
||||
public_ipv4_per_hour: float = float(os.environ.get("OPS_PRICE_PUBLIC_IPV4_PER_HOUR", "0.005"))
|
||||
allowed_machine_ids: tuple[str, ...] = tuple(
|
||||
machine.strip() for machine in os.environ.get("OPS_ALLOWED_MACHINE_IDS", "").split(",") if machine.strip()
|
||||
)
|
||||
gpu_subnet_id: str = os.environ.get("OPS_GPU_SUBNET_ID", "")
|
||||
gpu_security_group_ids: tuple[str, ...] = tuple(
|
||||
group.strip() for group in os.environ.get("OPS_GPU_SECURITY_GROUP_IDS", "").split(",") if group.strip()
|
||||
)
|
||||
gpu_key_name: str = os.environ.get("OPS_GPU_KEY_NAME", "")
|
||||
gpu_ami_id: str = os.environ.get("OPS_GPU_AMI_ID", "")
|
||||
gpu_instance_profile: str = os.environ.get("OPS_GPU_INSTANCE_PROFILE", "")
|
||||
gpu_root_volume_gb: int = int(os.environ.get("OPS_GPU_ROOT_VOLUME_GB", "300"))
|
||||
export_dir: Path = Path(os.environ.get("OPS_CSV_EXPORT_DIR", "/app/exports"))
|
||||
log_dir: Path = Path(os.environ.get("OPS_LOG_DIR", "/app/logs"))
|
||||
state_dir: Path = Path(os.environ.get("OPS_STATE_DIR", "/app/state"))
|
||||
model_library_root: Path = Path(os.environ.get("OPS_MODEL_LIBRARY_ROOT", "/model-library"))
|
||||
cloudflare_zone_name: str = os.environ.get("OPS_CLOUDFLARE_ZONE_NAME", "desineuron.in")
|
||||
cloudflare_api_token: str = os.environ.get("OPS_CLOUDFLARE_API_TOKEN", "")
|
||||
|
||||
|
||||
settings = Settings()
|
||||
@@ -0,0 +1,41 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from contextlib import contextmanager
|
||||
|
||||
from sqlalchemy import create_engine
|
||||
from sqlalchemy.orm import DeclarativeBase, Session, sessionmaker
|
||||
|
||||
from .config import settings
|
||||
|
||||
|
||||
engine = create_engine(settings.database_url, pool_pre_ping=True)
|
||||
SessionLocal = sessionmaker(bind=engine, autoflush=False, autocommit=False, expire_on_commit=False)
|
||||
|
||||
|
||||
class Base(DeclarativeBase):
|
||||
pass
|
||||
|
||||
|
||||
def get_db():
|
||||
db = SessionLocal()
|
||||
try:
|
||||
yield db
|
||||
db.commit()
|
||||
except Exception:
|
||||
db.rollback()
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@contextmanager
|
||||
def session_scope():
|
||||
session = SessionLocal()
|
||||
try:
|
||||
yield session
|
||||
session.commit()
|
||||
except Exception:
|
||||
session.rollback()
|
||||
raise
|
||||
finally:
|
||||
session.close()
|
||||
598
infrastructure/ops_control_plane/app/ops_control_plane/main.py
Normal file
598
infrastructure/ops_control_plane/app/ops_control_plane/main.py
Normal file
@@ -0,0 +1,598 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from pathlib import Path
|
||||
|
||||
from botocore.exceptions import ClientError
|
||||
from fastapi import Depends, FastAPI, Form, HTTPException, Request
|
||||
from fastapi.responses import HTMLResponse, JSONResponse, RedirectResponse
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
from fastapi.templating import Jinja2Templates
|
||||
from sqlalchemy import func, select
|
||||
from sqlalchemy.orm import Session
|
||||
from starlette.middleware.sessions import SessionMiddleware
|
||||
|
||||
from .aws_control import calculate_machine_cost, create_managed_instance, ensure_bucket, export_sessions_csv, hydrate_model, latest_market_price, seed_bucket_prefixes, start_service, stop_machine, stop_service, sync_instances, terminate_machine, upload_model_directory
|
||||
from .config import settings
|
||||
from .database import Base, engine, get_db, session_scope
|
||||
from .models import AuditEvent, CsvExport, Job, Machine, MachineProfile, MarketSnapshot, ModelCatalog, RouteBinding, Session as RuntimeSession, SessionCost, User, WorkloadProfile
|
||||
from .route_control import apply_route, remove_route
|
||||
from .seed import seed_defaults
|
||||
from .security import get_current_user, verify_password
|
||||
|
||||
|
||||
app = FastAPI(title="Desineuron Ops Control Plane")
|
||||
app.add_middleware(SessionMiddleware, secret_key=settings.session_secret)
|
||||
template_dir = Path(__file__).parent / "templates"
|
||||
static_dir = Path(__file__).parent / "static"
|
||||
templates = Jinja2Templates(directory=str(template_dir))
|
||||
app.mount("/static", StaticFiles(directory=str(static_dir)), name="static")
|
||||
|
||||
|
||||
def utcnow() -> datetime:
|
||||
return datetime.now(timezone.utc)
|
||||
|
||||
|
||||
def recent_totals(db: Session) -> dict:
|
||||
now = utcnow()
|
||||
day_start = now - timedelta(days=1)
|
||||
month_start = now - timedelta(days=30)
|
||||
day_total = db.scalar(
|
||||
select(func.coalesce(func.sum(SessionCost.total_cost_usd), 0.0))
|
||||
.join(RuntimeSession, RuntimeSession.id == SessionCost.session_id)
|
||||
.where(SessionCost.calculated_at >= day_start)
|
||||
)
|
||||
month_total = db.scalar(
|
||||
select(func.coalesce(func.sum(SessionCost.total_cost_usd), 0.0))
|
||||
.join(RuntimeSession, RuntimeSession.id == SessionCost.session_id)
|
||||
.where(SessionCost.calculated_at >= month_start)
|
||||
)
|
||||
return {
|
||||
"last_24h_usd": round(float(day_total or 0.0), 4),
|
||||
"last_30d_usd": round(float(month_total or 0.0), 4),
|
||||
}
|
||||
|
||||
|
||||
def pop_flash(request: Request) -> dict | None:
|
||||
return request.session.pop("flash", None)
|
||||
|
||||
|
||||
def set_flash(request: Request, level: str, message: str) -> None:
|
||||
request.session["flash"] = {"level": level, "message": message}
|
||||
|
||||
|
||||
def parse_tag_list(raw: str) -> list[str]:
|
||||
return [item.strip() for item in raw.split(",") if item.strip()]
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
def startup() -> None:
|
||||
Base.metadata.create_all(bind=engine)
|
||||
settings.export_dir.mkdir(parents=True, exist_ok=True)
|
||||
settings.log_dir.mkdir(parents=True, exist_ok=True)
|
||||
settings.state_dir.mkdir(parents=True, exist_ok=True)
|
||||
with session_scope() as db:
|
||||
seed_defaults(db)
|
||||
if settings.bucket_name:
|
||||
ensure_bucket(settings.bucket_name, settings.bucket_region)
|
||||
seed_bucket_prefixes(settings.bucket_name)
|
||||
|
||||
|
||||
@app.get("/", response_class=HTMLResponse)
|
||||
def root(request: Request):
|
||||
if request.session.get("username"):
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
return RedirectResponse("/login", status_code=302)
|
||||
|
||||
|
||||
@app.get("/login", response_class=HTMLResponse)
|
||||
def login_page(request: Request):
|
||||
return templates.TemplateResponse("login.html", {"request": request, "error": None})
|
||||
|
||||
|
||||
@app.post("/login", response_class=HTMLResponse)
|
||||
def login(request: Request, username: str = Form(...), password: str = Form(...), db: Session = Depends(get_db)):
|
||||
user = db.scalar(select(User).where(User.username == username, User.is_active.is_(True)))
|
||||
if not user or not verify_password(password, user.password_hash):
|
||||
return templates.TemplateResponse("login.html", {"request": request, "error": "Invalid credentials"}, status_code=401)
|
||||
request.session["username"] = user.username
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
|
||||
|
||||
@app.get("/logout")
|
||||
def logout(request: Request):
|
||||
request.session.clear()
|
||||
return RedirectResponse("/login", status_code=302)
|
||||
|
||||
|
||||
@app.get("/dashboard", response_class=HTMLResponse)
|
||||
def dashboard(request: Request, current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
machines = db.scalars(select(Machine).order_by(Machine.updated_at.desc())).all()
|
||||
profiles = db.scalars(select(MachineProfile).order_by(MachineProfile.name)).all()
|
||||
workloads = db.scalars(select(WorkloadProfile).order_by(WorkloadProfile.name)).all()
|
||||
models = db.scalars(select(ModelCatalog).order_by(ModelCatalog.model_key)).all()
|
||||
routes = db.scalars(select(RouteBinding).order_by(RouteBinding.hostname)).all()
|
||||
jobs = db.scalars(select(Job).order_by(Job.created_at.desc()).limit(20)).all()
|
||||
sessions = db.scalars(select(RuntimeSession).order_by(RuntimeSession.started_at.desc()).limit(20)).all()
|
||||
market_rows = db.scalars(select(MarketSnapshot).order_by(MarketSnapshot.observed_at.desc()).limit(100)).all()
|
||||
audits = db.scalars(select(AuditEvent).order_by(AuditEvent.created_at.desc()).limit(20)).all()
|
||||
costs = []
|
||||
total_hourly = 0.0
|
||||
total_estimated = 0.0
|
||||
for machine in machines:
|
||||
hourly_rate = latest_market_price(db, machine.region, machine.instance_type, machine.lifecycle)
|
||||
machine_cost = calculate_machine_cost(machine, hourly_rate)
|
||||
total_hourly += machine_cost["hourly_price_usd"]
|
||||
total_estimated += machine_cost["total_cost_usd"]
|
||||
costs.append((machine.aws_instance_id, machine_cost))
|
||||
summary = {
|
||||
"machine_count": len(machines),
|
||||
"active_sessions": sum(1 for session in sessions if session.status == "active"),
|
||||
"active_jobs": sum(1 for job in jobs if job.status in {"queued", "running"}),
|
||||
"routes_active": sum(1 for route in routes if route.status == "active"),
|
||||
"hourly_burn_usd": round(total_hourly, 4),
|
||||
"fleet_estimated_cost_usd": round(total_estimated, 4),
|
||||
**recent_totals(db),
|
||||
}
|
||||
return templates.TemplateResponse(
|
||||
"index.html",
|
||||
{
|
||||
"request": request,
|
||||
"user": current_user,
|
||||
"machines": machines,
|
||||
"profiles": profiles,
|
||||
"workloads": workloads,
|
||||
"models": models,
|
||||
"routes": routes,
|
||||
"jobs": jobs,
|
||||
"sessions": sessions,
|
||||
"market_rows": market_rows,
|
||||
"audits": audits,
|
||||
"costs": dict(costs),
|
||||
"summary": summary,
|
||||
"flash": pop_flash(request),
|
||||
"bucket_name": settings.bucket_name,
|
||||
"regions": settings.visible_regions,
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@app.get("/api/markets/instances")
|
||||
def get_markets(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
profiles = db.scalars(select(MachineProfile).order_by(MachineProfile.name)).all()
|
||||
payload = []
|
||||
for profile in profiles:
|
||||
per_region = {}
|
||||
for region in settings.visible_regions:
|
||||
on_demand = db.scalar(
|
||||
select(MarketSnapshot)
|
||||
.where(MarketSnapshot.region == region, MarketSnapshot.instance_type == profile.instance_type, MarketSnapshot.lifecycle == "on-demand")
|
||||
.order_by(MarketSnapshot.observed_at.desc())
|
||||
)
|
||||
spot = db.scalar(
|
||||
select(MarketSnapshot)
|
||||
.where(MarketSnapshot.region == region, MarketSnapshot.instance_type == profile.instance_type, MarketSnapshot.lifecycle == "spot")
|
||||
.order_by(MarketSnapshot.observed_at.desc())
|
||||
)
|
||||
per_region[region] = {
|
||||
"on_demand": on_demand.hourly_price_usd if on_demand else None,
|
||||
"on_demand_available": bool(on_demand and on_demand.offering_available),
|
||||
"spot": spot.hourly_price_usd if spot else None,
|
||||
"spot_available": bool(spot and spot.offering_available),
|
||||
"last_seen": max(
|
||||
[stamp for stamp in [on_demand.observed_at if on_demand else None, spot.observed_at if spot else None] if stamp],
|
||||
default=None,
|
||||
),
|
||||
}
|
||||
payload.append(
|
||||
{
|
||||
"profile": profile.name,
|
||||
"instance_type": profile.instance_type,
|
||||
"gpu_label": profile.gpu_label,
|
||||
"vcpu": profile.vcpu,
|
||||
"memory_gib": profile.memory_gib,
|
||||
"regions": per_region,
|
||||
}
|
||||
)
|
||||
return payload
|
||||
|
||||
|
||||
@app.get("/api/machines")
|
||||
def get_machines(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
machines = db.scalars(select(Machine).order_by(Machine.updated_at.desc())).all()
|
||||
payload = []
|
||||
for machine in machines:
|
||||
hourly_rate = latest_market_price(db, machine.region, machine.instance_type, machine.lifecycle)
|
||||
payload.append(
|
||||
{
|
||||
"id": machine.id,
|
||||
"aws_instance_id": machine.aws_instance_id,
|
||||
"name": machine.name,
|
||||
"region": machine.region,
|
||||
"state": machine.state,
|
||||
"instance_type": machine.instance_type,
|
||||
"lifecycle": machine.lifecycle,
|
||||
"public_ip": machine.public_ip,
|
||||
"private_ip": machine.private_ip,
|
||||
"cost": calculate_machine_cost(machine, hourly_rate),
|
||||
}
|
||||
)
|
||||
return payload
|
||||
|
||||
|
||||
@app.post("/api/machines/launch")
|
||||
def launch_machine(request: Request, profile_name: str = Form(...), lifecycle: str = Form(...), db: Session = Depends(get_db), current_user: User = Depends(get_current_user)):
|
||||
profile = db.scalar(select(MachineProfile).where(MachineProfile.name == profile_name))
|
||||
if not profile:
|
||||
raise HTTPException(status_code=404, detail="Profile not found")
|
||||
job = Job(job_type="launch_machine", status="running", actor=current_user.username, payload={"profile_name": profile_name, "lifecycle": lifecycle}, started_at=utcnow())
|
||||
db.add(job)
|
||||
db.flush()
|
||||
try:
|
||||
session_row = create_managed_instance(db, profile, current_user.username, lifecycle)
|
||||
except Exception as exc:
|
||||
error_code = exc.response.get("Error", {}).get("Code") if isinstance(exc, ClientError) else exc.__class__.__name__
|
||||
job.status = "failed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = {"error": str(exc), "code": error_code}
|
||||
db.add(AuditEvent(actor=current_user.username, action="launch_machine_failed", entity_type="profile", entity_id=profile.name, payload=job.result))
|
||||
set_flash(request, "error", f"Launch failed for {profile.name}: {error_code}")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
job.status = "completed"
|
||||
job.session_id = session_row.id
|
||||
job.finished_at = utcnow()
|
||||
job.result = {"session_id": session_row.id}
|
||||
set_flash(request, "success", f"Launched {profile.name} as {lifecycle}.")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
|
||||
|
||||
@app.post("/api/machines/{machine_id}/stop")
|
||||
def api_stop_machine(machine_id: int, request: Request, db: Session = Depends(get_db), current_user: User = Depends(get_current_user)):
|
||||
machine = db.get(Machine, machine_id)
|
||||
if not machine:
|
||||
raise HTTPException(status_code=404, detail="Machine not found")
|
||||
job = Job(job_type="stop_machine", status="running", actor=current_user.username, machine_id=machine_id, payload={"aws_instance_id": machine.aws_instance_id}, started_at=utcnow())
|
||||
db.add(job)
|
||||
stop_machine(db, machine, current_user.username)
|
||||
active_session = db.scalar(select(RuntimeSession).where(RuntimeSession.machine_id == machine.id, RuntimeSession.status == "active"))
|
||||
if active_session:
|
||||
active_session.status = "stopped"
|
||||
active_session.ended_at = utcnow()
|
||||
job.status = "completed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = {"status": "stopping"}
|
||||
if "text/html" in request.headers.get("accept", ""):
|
||||
set_flash(request, "success", f"Stop requested for {machine.aws_instance_id}.")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
return {"status": "stopping"}
|
||||
|
||||
|
||||
@app.post("/api/machines/{machine_id}/terminate")
|
||||
def api_terminate_machine(machine_id: int, request: Request, db: Session = Depends(get_db), current_user: User = Depends(get_current_user)):
|
||||
machine = db.get(Machine, machine_id)
|
||||
if not machine:
|
||||
raise HTTPException(status_code=404, detail="Machine not found")
|
||||
job = Job(job_type="terminate_machine", status="running", actor=current_user.username, machine_id=machine_id, payload={"aws_instance_id": machine.aws_instance_id}, started_at=utcnow())
|
||||
db.add(job)
|
||||
terminate_machine(db, machine, current_user.username)
|
||||
active_session = db.scalar(select(RuntimeSession).where(RuntimeSession.machine_id == machine.id, RuntimeSession.status == "active"))
|
||||
if active_session:
|
||||
active_session.status = "terminated"
|
||||
active_session.ended_at = utcnow()
|
||||
job.status = "completed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = {"status": "terminating"}
|
||||
if "text/html" in request.headers.get("accept", ""):
|
||||
set_flash(request, "success", f"Terminate requested for {machine.aws_instance_id}.")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
return {"status": "terminating"}
|
||||
|
||||
|
||||
@app.post("/api/models/hydrate")
|
||||
def api_hydrate_model(request: Request, machine_id: int = Form(...), model_key: str = Form(...), db: Session = Depends(get_db), current_user: User = Depends(get_current_user)):
|
||||
machine = db.get(Machine, machine_id)
|
||||
model = db.scalar(select(ModelCatalog).where(ModelCatalog.model_key == model_key))
|
||||
if not machine or not model:
|
||||
raise HTTPException(status_code=404, detail="Machine or model not found")
|
||||
if not settings.bucket_name:
|
||||
raise HTTPException(status_code=400, detail="Bucket is not configured")
|
||||
job = Job(job_type="hydrate_model", status="running", actor=current_user.username, machine_id=machine_id, payload={"model_key": model_key}, started_at=utcnow())
|
||||
db.add(job)
|
||||
result = hydrate_model(machine, model.s3_prefix, current_user.username, settings.bucket_name)
|
||||
db.add(AuditEvent(actor=current_user.username, action="hydrate_model", entity_type="machine", entity_id=machine.aws_instance_id, payload={"model_key": model.model_key, "result": result}))
|
||||
job.status = "completed" if result.get("returncode") == 0 else "failed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = result
|
||||
if "text/html" in request.headers.get("accept", ""):
|
||||
set_flash(request, "success" if result.get("returncode") == 0 else "error", f"Hydration {'completed' if result.get('returncode') == 0 else 'failed'} for {model.label} on {machine.aws_instance_id}.")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
return JSONResponse(result)
|
||||
|
||||
|
||||
@app.post("/api/models/register")
|
||||
def api_register_model(
|
||||
request: Request,
|
||||
model_key: str = Form(...),
|
||||
label: str = Form(...),
|
||||
source_relative_path: str = Form(...),
|
||||
workload_tags: str = Form(""),
|
||||
compatibility_tags: str = Form(""),
|
||||
db: Session = Depends(get_db),
|
||||
current_user: User = Depends(get_current_user),
|
||||
):
|
||||
if not settings.bucket_name:
|
||||
raise HTTPException(status_code=400, detail="Bucket is not configured")
|
||||
job = Job(
|
||||
job_type="register_model",
|
||||
status="running",
|
||||
actor=current_user.username,
|
||||
payload={
|
||||
"model_key": model_key,
|
||||
"label": label,
|
||||
"source_relative_path": source_relative_path,
|
||||
"workload_tags": workload_tags,
|
||||
"compatibility_tags": compatibility_tags,
|
||||
},
|
||||
started_at=utcnow(),
|
||||
)
|
||||
db.add(job)
|
||||
try:
|
||||
result = upload_model_directory(
|
||||
settings.bucket_name,
|
||||
model_key=model_key,
|
||||
source_relative_path=source_relative_path,
|
||||
label=label,
|
||||
workload_tags=parse_tag_list(workload_tags),
|
||||
compatibility_tags=parse_tag_list(compatibility_tags),
|
||||
)
|
||||
except Exception as exc:
|
||||
job.status = "failed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = {"error": str(exc)}
|
||||
db.add(AuditEvent(actor=current_user.username, action="register_model_failed", entity_type="model", entity_id=model_key, payload=job.result))
|
||||
if "text/html" in request.headers.get("accept", ""):
|
||||
set_flash(request, "error", f"Model ingest failed for {model_key}: {exc}")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
raise HTTPException(status_code=500, detail=str(exc))
|
||||
|
||||
existing = db.scalar(select(ModelCatalog).where(ModelCatalog.model_key == model_key))
|
||||
if existing:
|
||||
existing.label = label
|
||||
existing.s3_prefix = result["s3_prefix"]
|
||||
existing.expected_manifest = result["manifest"]
|
||||
existing.checksums = {entry["path"]: entry["sha256"] for entry in result["manifest"]["files"]}
|
||||
existing.compatibility_tags = result["compatibility_tags"]
|
||||
existing.workload_tags = result["workload_tags"]
|
||||
existing.size_gb = round(result["manifest"]["total_size_bytes"] / (1024 ** 3), 3)
|
||||
else:
|
||||
db.add(
|
||||
ModelCatalog(
|
||||
model_key=model_key,
|
||||
label=label,
|
||||
s3_prefix=result["s3_prefix"],
|
||||
expected_manifest=result["manifest"],
|
||||
checksums={entry["path"]: entry["sha256"] for entry in result["manifest"]["files"]},
|
||||
compatibility_tags=result["compatibility_tags"],
|
||||
workload_tags=result["workload_tags"],
|
||||
size_gb=round(result["manifest"]["total_size_bytes"] / (1024 ** 3), 3),
|
||||
)
|
||||
)
|
||||
job.status = "completed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = {"manifest_key": result["manifest_key"], "file_count": result["manifest"]["file_count"]}
|
||||
db.add(AuditEvent(actor=current_user.username, action="register_model", entity_type="model", entity_id=model_key, payload=job.result))
|
||||
if "text/html" in request.headers.get("accept", ""):
|
||||
set_flash(request, "success", f"Model {model_key} uploaded to S3 and manifest stored.")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
return JSONResponse(job.result)
|
||||
|
||||
|
||||
@app.post("/api/workloads/start")
|
||||
def api_start_workload(request: Request, machine_id: int = Form(...), workload_name: str = Form(...), auto_route: bool = Form(False), db: Session = Depends(get_db), current_user: User = Depends(get_current_user)):
|
||||
machine = db.get(Machine, machine_id)
|
||||
workload = db.scalar(select(WorkloadProfile).where(WorkloadProfile.name == workload_name))
|
||||
if not machine or not workload:
|
||||
raise HTTPException(status_code=404, detail="Machine or workload not found")
|
||||
job = Job(job_type="start_workload", status="running", actor=current_user.username, machine_id=machine_id, payload={"workload_name": workload_name, "auto_route": auto_route}, started_at=utcnow())
|
||||
db.add(job)
|
||||
result = start_service(machine, workload.name)
|
||||
route_result = None
|
||||
if result.get("returncode") == 0 and auto_route and workload.route_hostname and workload.default_port and machine.private_ip:
|
||||
route_result = apply_route(workload.route_hostname, "http", machine.private_ip, workload.default_port)
|
||||
existing = db.scalar(select(RouteBinding).where(RouteBinding.hostname == workload.route_hostname))
|
||||
if existing:
|
||||
existing.scheme = "http"
|
||||
existing.target_host = machine.private_ip
|
||||
existing.target_port = workload.default_port
|
||||
existing.status = "active"
|
||||
existing.details = {"managed_by": "ops_control_plane", "machine_id": machine.aws_instance_id}
|
||||
else:
|
||||
db.add(RouteBinding(hostname=workload.route_hostname, target_type="managed", target_host=machine.private_ip, target_port=workload.default_port, scheme="http", status="active", details={"managed_by": "ops_control_plane", "machine_id": machine.aws_instance_id}))
|
||||
db.add(AuditEvent(actor=current_user.username, action="start_workload", entity_type="machine", entity_id=machine.aws_instance_id, payload={"workload": workload.name, "result": result}))
|
||||
job.status = "completed" if result.get("returncode") == 0 else "failed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = {"service": result, "route": route_result}
|
||||
if "text/html" in request.headers.get("accept", ""):
|
||||
set_flash(request, "success" if result.get("returncode") == 0 else "error", f"Start workload {'completed' if result.get('returncode') == 0 else 'failed'} for {workload.name} on {machine.aws_instance_id}.")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
return JSONResponse({"service": result, "route": route_result})
|
||||
|
||||
|
||||
@app.post("/api/workloads/{machine_id}/stop")
|
||||
def api_stop_workload(machine_id: int, request: Request, workload_name: str = Form(...), db: Session = Depends(get_db), current_user: User = Depends(get_current_user)):
|
||||
machine = db.get(Machine, machine_id)
|
||||
if not machine:
|
||||
raise HTTPException(status_code=404, detail="Machine not found")
|
||||
job = Job(job_type="stop_workload", status="running", actor=current_user.username, machine_id=machine_id, payload={"workload_name": workload_name}, started_at=utcnow())
|
||||
db.add(job)
|
||||
result = stop_service(machine, workload_name)
|
||||
db.add(AuditEvent(actor=current_user.username, action="stop_workload", entity_type="machine", entity_id=machine.aws_instance_id, payload={"workload": workload_name, "result": result}))
|
||||
job.status = "completed" if result.get("returncode") == 0 else "failed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = result
|
||||
if "text/html" in request.headers.get("accept", ""):
|
||||
set_flash(request, "success" if result.get("returncode") == 0 else "error", f"Stop workload {'completed' if result.get('returncode') == 0 else 'failed'} for {workload_name} on {machine.aws_instance_id}.")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
return JSONResponse(result)
|
||||
|
||||
|
||||
@app.post("/api/routes/map")
|
||||
def api_map_route(request: Request, hostname: str = Form(...), scheme: str = Form(...), target_host: str = Form(...), target_port: int = Form(...), db: Session = Depends(get_db), current_user: User = Depends(get_current_user)):
|
||||
job = Job(job_type="map_route", status="running", actor=current_user.username, payload={"hostname": hostname, "scheme": scheme, "target_host": target_host, "target_port": target_port}, started_at=utcnow())
|
||||
db.add(job)
|
||||
result = apply_route(hostname, scheme, target_host, target_port)
|
||||
existing = db.scalar(select(RouteBinding).where(RouteBinding.hostname == hostname))
|
||||
if existing:
|
||||
existing.scheme = scheme
|
||||
existing.target_host = target_host
|
||||
existing.target_port = target_port
|
||||
existing.status = "active"
|
||||
else:
|
||||
db.add(RouteBinding(hostname=hostname, target_type="managed", target_host=target_host, target_port=target_port, scheme=scheme, status="active"))
|
||||
db.add(AuditEvent(actor=current_user.username, action="map_route", entity_type="route", entity_id=hostname, payload=result))
|
||||
job.status = "completed" if result.get("returncode") == 0 else "failed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = result
|
||||
if "text/html" in request.headers.get("accept", ""):
|
||||
set_flash(request, "success" if result.get("returncode") == 0 else "error", f"Route {'mapped' if result.get('returncode') == 0 else 'map failed'} for {hostname}.")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
return JSONResponse(result)
|
||||
|
||||
|
||||
@app.post("/api/routes/unmap")
|
||||
def api_unmap_route(request: Request, hostname: str = Form(...), db: Session = Depends(get_db), current_user: User = Depends(get_current_user)):
|
||||
job = Job(job_type="unmap_route", status="running", actor=current_user.username, payload={"hostname": hostname}, started_at=utcnow())
|
||||
db.add(job)
|
||||
result = remove_route(hostname)
|
||||
existing = db.scalar(select(RouteBinding).where(RouteBinding.hostname == hostname))
|
||||
if existing:
|
||||
existing.status = "removed"
|
||||
db.add(AuditEvent(actor=current_user.username, action="unmap_route", entity_type="route", entity_id=hostname, payload=result))
|
||||
job.status = "completed" if result.get("returncode") == 0 else "failed"
|
||||
job.finished_at = utcnow()
|
||||
job.result = result
|
||||
if "text/html" in request.headers.get("accept", ""):
|
||||
set_flash(request, "success" if result.get("returncode") == 0 else "error", f"Route {'removed' if result.get('returncode') == 0 else 'removal failed'} for {hostname}.")
|
||||
return RedirectResponse("/dashboard", status_code=302)
|
||||
return JSONResponse(result)
|
||||
|
||||
|
||||
@app.get("/api/markets/pricing")
|
||||
def get_market_pricing(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
rows = db.scalars(select(MarketSnapshot).order_by(MarketSnapshot.observed_at.desc()).limit(200)).all()
|
||||
return [
|
||||
{
|
||||
"region": row.region,
|
||||
"instance_type": row.instance_type,
|
||||
"lifecycle": row.lifecycle,
|
||||
"offering_available": row.offering_available,
|
||||
"hourly_price_usd": row.hourly_price_usd,
|
||||
"observed_at": row.observed_at,
|
||||
}
|
||||
for row in rows
|
||||
]
|
||||
|
||||
|
||||
@app.get("/api/sessions")
|
||||
def get_sessions(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
sessions = db.scalars(select(RuntimeSession).order_by(RuntimeSession.started_at.desc()).limit(200)).all()
|
||||
payload = []
|
||||
for session_row in sessions:
|
||||
machine = db.get(Machine, session_row.machine_id) if session_row.machine_id else None
|
||||
latest_cost = db.scalar(select(SessionCost).where(SessionCost.session_id == session_row.id).order_by(SessionCost.calculated_at.desc()))
|
||||
payload.append(
|
||||
{
|
||||
"id": session_row.id,
|
||||
"actor": session_row.actor,
|
||||
"workload_name": session_row.workload_name,
|
||||
"status": session_row.status,
|
||||
"started_at": session_row.started_at,
|
||||
"ended_at": session_row.ended_at,
|
||||
"notes": session_row.notes,
|
||||
"machine": machine.aws_instance_id if machine else None,
|
||||
"cost": latest_cost.total_cost_usd if latest_cost else None,
|
||||
"runtime_hours": latest_cost.runtime_hours if latest_cost else None,
|
||||
}
|
||||
)
|
||||
return payload
|
||||
|
||||
|
||||
@app.get("/api/costs")
|
||||
def api_costs(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
machines = db.scalars(select(Machine)).all()
|
||||
total = 0.0
|
||||
items = []
|
||||
for machine in machines:
|
||||
hourly_rate = latest_market_price(db, machine.region, machine.instance_type, machine.lifecycle)
|
||||
cost = calculate_machine_cost(machine, hourly_rate)
|
||||
total += cost["total_cost_usd"]
|
||||
items.append({"machine": machine.aws_instance_id, **cost})
|
||||
return {"machines": items, "total_estimated_cost_usd": round(total, 4), **recent_totals(db)}
|
||||
|
||||
|
||||
@app.get("/api/models")
|
||||
def api_models(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
models = db.scalars(select(ModelCatalog).order_by(ModelCatalog.model_key)).all()
|
||||
return [
|
||||
{
|
||||
"model_key": model.model_key,
|
||||
"label": model.label,
|
||||
"s3_prefix": model.s3_prefix,
|
||||
"size_gb": model.size_gb,
|
||||
"workload_tags": model.workload_tags,
|
||||
"compatibility_tags": model.compatibility_tags,
|
||||
"file_count": (model.expected_manifest or {}).get("file_count", 0),
|
||||
}
|
||||
for model in models
|
||||
]
|
||||
|
||||
|
||||
@app.get("/api/audit")
|
||||
def api_audit(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
events = db.scalars(select(AuditEvent).order_by(AuditEvent.created_at.desc()).limit(100)).all()
|
||||
return [
|
||||
{
|
||||
"actor": event.actor,
|
||||
"action": event.action,
|
||||
"entity_type": event.entity_type,
|
||||
"entity_id": event.entity_id,
|
||||
"payload": event.payload,
|
||||
"created_at": event.created_at,
|
||||
}
|
||||
for event in events
|
||||
]
|
||||
|
||||
|
||||
@app.get("/api/jobs")
|
||||
def api_jobs(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
jobs = db.scalars(select(Job).order_by(Job.created_at.desc()).limit(200)).all()
|
||||
return [
|
||||
{
|
||||
"id": job.id,
|
||||
"job_type": job.job_type,
|
||||
"status": job.status,
|
||||
"actor": job.actor,
|
||||
"machine_id": job.machine_id,
|
||||
"session_id": job.session_id,
|
||||
"payload": job.payload,
|
||||
"result": job.result,
|
||||
"created_at": job.created_at,
|
||||
"finished_at": job.finished_at,
|
||||
}
|
||||
for job in jobs
|
||||
]
|
||||
|
||||
|
||||
@app.get("/api/exports/csv")
|
||||
def api_export_csv(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
|
||||
target = settings.export_dir / "sessions_latest.csv"
|
||||
export_sessions_csv(db, str(target))
|
||||
db.add(CsvExport(actor=current_user.username, export_type="sessions", path=str(target), details={"format": "csv"}))
|
||||
return {"path": str(target)}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
|
||||
uvicorn.run("ops_control_plane.main:app", host="0.0.0.0", port=8080, reload=False)
|
||||
192
infrastructure/ops_control_plane/app/ops_control_plane/models.py
Normal file
192
infrastructure/ops_control_plane/app/ops_control_plane/models.py
Normal file
@@ -0,0 +1,192 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from sqlalchemy import Boolean, DateTime, Float, ForeignKey, Integer, JSON, String, Text
|
||||
from sqlalchemy.orm import Mapped, mapped_column, relationship
|
||||
|
||||
from .database import Base
|
||||
|
||||
|
||||
def utcnow() -> datetime:
|
||||
return datetime.now(timezone.utc)
|
||||
|
||||
|
||||
class User(Base):
|
||||
__tablename__ = "users"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
username: Mapped[str] = mapped_column(String(64), unique=True, index=True)
|
||||
password_hash: Mapped[str] = mapped_column(String(255))
|
||||
role: Mapped[str] = mapped_column(String(32), default="admin")
|
||||
is_active: Mapped[bool] = mapped_column(Boolean, default=True)
|
||||
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
|
||||
|
||||
|
||||
class MachineProfile(Base):
|
||||
__tablename__ = "machine_profiles"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
name: Mapped[str] = mapped_column(String(64), unique=True)
|
||||
region: Mapped[str] = mapped_column(String(32))
|
||||
instance_type: Mapped[str] = mapped_column(String(32))
|
||||
gpu_label: Mapped[str] = mapped_column(String(64))
|
||||
vcpu: Mapped[int] = mapped_column(Integer)
|
||||
memory_gib: Mapped[float] = mapped_column(Float)
|
||||
preferred_lifecycle: Mapped[str] = mapped_column(String(16), default="spot")
|
||||
launch_config: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
intended_workloads: Mapped[list] = mapped_column(JSON, default=list)
|
||||
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
|
||||
|
||||
|
||||
class MarketSnapshot(Base):
|
||||
__tablename__ = "market_snapshots"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
region: Mapped[str] = mapped_column(String(32), index=True)
|
||||
instance_type: Mapped[str] = mapped_column(String(32), index=True)
|
||||
lifecycle: Mapped[str] = mapped_column(String(16), index=True)
|
||||
offering_available: Mapped[bool] = mapped_column(Boolean, default=False)
|
||||
hourly_price_usd: Mapped[float | None] = mapped_column(Float, nullable=True)
|
||||
source: Mapped[str] = mapped_column(String(32), default="aws")
|
||||
raw_payload: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
observed_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, index=True)
|
||||
|
||||
|
||||
class Machine(Base):
|
||||
__tablename__ = "machines"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
aws_instance_id: Mapped[str] = mapped_column(String(32), unique=True, index=True)
|
||||
name: Mapped[str] = mapped_column(String(128))
|
||||
region: Mapped[str] = mapped_column(String(32))
|
||||
profile_name: Mapped[str | None] = mapped_column(String(64), nullable=True)
|
||||
instance_type: Mapped[str] = mapped_column(String(32))
|
||||
lifecycle: Mapped[str] = mapped_column(String(16))
|
||||
state: Mapped[str] = mapped_column(String(32))
|
||||
public_ip: Mapped[str | None] = mapped_column(String(64), nullable=True)
|
||||
private_ip: Mapped[str | None] = mapped_column(String(64), nullable=True)
|
||||
launch_time: Mapped[datetime | None] = mapped_column(DateTime(timezone=True), nullable=True)
|
||||
volume_gb: Mapped[int] = mapped_column(Integer, default=0)
|
||||
public_ipv4_attached: Mapped[bool] = mapped_column(Boolean, default=False)
|
||||
details: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
updated_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, onupdate=utcnow)
|
||||
|
||||
|
||||
class WorkloadProfile(Base):
|
||||
__tablename__ = "workload_profiles"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
name: Mapped[str] = mapped_column(String(64), unique=True)
|
||||
service_type: Mapped[str] = mapped_column(String(32))
|
||||
model_requirements: Mapped[list] = mapped_column(JSON, default=list)
|
||||
default_port: Mapped[int | None] = mapped_column(Integer, nullable=True)
|
||||
start_command: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||||
stop_command: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||||
healthcheck_path: Mapped[str | None] = mapped_column(String(255), nullable=True)
|
||||
route_hostname: Mapped[str | None] = mapped_column(String(255), nullable=True)
|
||||
|
||||
|
||||
class Job(Base):
|
||||
__tablename__ = "jobs"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
job_type: Mapped[str] = mapped_column(String(32), index=True)
|
||||
status: Mapped[str] = mapped_column(String(32), index=True, default="queued")
|
||||
payload: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
result: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
actor: Mapped[str | None] = mapped_column(String(64), nullable=True)
|
||||
machine_id: Mapped[int | None] = mapped_column(ForeignKey("machines.id"), nullable=True)
|
||||
session_id: Mapped[int | None] = mapped_column(ForeignKey("sessions.id"), nullable=True)
|
||||
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
|
||||
started_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True), nullable=True)
|
||||
finished_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True), nullable=True)
|
||||
|
||||
|
||||
class Session(Base):
|
||||
__tablename__ = "sessions"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
machine_id: Mapped[int | None] = mapped_column(ForeignKey("machines.id"), nullable=True)
|
||||
actor: Mapped[str | None] = mapped_column(String(64), nullable=True)
|
||||
workload_name: Mapped[str | None] = mapped_column(String(64), nullable=True)
|
||||
status: Mapped[str] = mapped_column(String(32), default="active")
|
||||
started_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
|
||||
ended_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True), nullable=True)
|
||||
notes: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||||
cost_records: Mapped[list["SessionCost"]] = relationship(back_populates="session")
|
||||
|
||||
|
||||
class SessionCost(Base):
|
||||
__tablename__ = "session_costs"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
session_id: Mapped[int] = mapped_column(ForeignKey("sessions.id"))
|
||||
runtime_hours: Mapped[float] = mapped_column(Float, default=0.0)
|
||||
compute_cost_usd: Mapped[float] = mapped_column(Float, default=0.0)
|
||||
storage_cost_usd: Mapped[float] = mapped_column(Float, default=0.0)
|
||||
public_ip_cost_usd: Mapped[float] = mapped_column(Float, default=0.0)
|
||||
total_cost_usd: Mapped[float] = mapped_column(Float, default=0.0)
|
||||
calculated_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
|
||||
session: Mapped[Session] = relationship(back_populates="cost_records")
|
||||
|
||||
|
||||
class ModelCatalog(Base):
|
||||
__tablename__ = "model_catalog"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
model_key: Mapped[str] = mapped_column(String(128), unique=True)
|
||||
label: Mapped[str] = mapped_column(String(255))
|
||||
s3_prefix: Mapped[str] = mapped_column(String(512))
|
||||
expected_manifest: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
checksums: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
compatibility_tags: Mapped[list] = mapped_column(JSON, default=list)
|
||||
workload_tags: Mapped[list] = mapped_column(JSON, default=list)
|
||||
size_gb: Mapped[float | None] = mapped_column(Float, nullable=True)
|
||||
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
|
||||
|
||||
|
||||
class MachineModelCache(Base):
|
||||
__tablename__ = "machine_model_cache"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
machine_id: Mapped[int] = mapped_column(ForeignKey("machines.id"))
|
||||
model_key: Mapped[str] = mapped_column(String(128))
|
||||
status: Mapped[str] = mapped_column(String(32), default="pending")
|
||||
path_on_machine: Mapped[str | None] = mapped_column(String(512), nullable=True)
|
||||
hydrated_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True), nullable=True)
|
||||
details: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
|
||||
|
||||
class RouteBinding(Base):
|
||||
__tablename__ = "route_bindings"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
hostname: Mapped[str] = mapped_column(String(255), unique=True)
|
||||
target_type: Mapped[str] = mapped_column(String(32))
|
||||
target_host: Mapped[str] = mapped_column(String(255))
|
||||
target_port: Mapped[int] = mapped_column(Integer)
|
||||
scheme: Mapped[str] = mapped_column(String(16), default="http")
|
||||
status: Mapped[str] = mapped_column(String(32), default="active")
|
||||
details: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
updated_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, onupdate=utcnow)
|
||||
|
||||
|
||||
class ServiceState(Base):
|
||||
__tablename__ = "service_states"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
machine_id: Mapped[int | None] = mapped_column(ForeignKey("machines.id"), nullable=True)
|
||||
service_name: Mapped[str] = mapped_column(String(64))
|
||||
status: Mapped[str] = mapped_column(String(32))
|
||||
details: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
updated_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow, onupdate=utcnow)
|
||||
|
||||
|
||||
class AuditEvent(Base):
|
||||
__tablename__ = "audit_events"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
actor: Mapped[str | None] = mapped_column(String(64), nullable=True)
|
||||
action: Mapped[str] = mapped_column(String(64))
|
||||
entity_type: Mapped[str] = mapped_column(String(64))
|
||||
entity_id: Mapped[str | None] = mapped_column(String(128), nullable=True)
|
||||
payload: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
|
||||
|
||||
|
||||
class CsvExport(Base):
|
||||
__tablename__ = "csv_exports"
|
||||
id: Mapped[int] = mapped_column(Integer, primary_key=True)
|
||||
actor: Mapped[str | None] = mapped_column(String(64), nullable=True)
|
||||
export_type: Mapped[str] = mapped_column(String(64))
|
||||
path: Mapped[str] = mapped_column(String(512))
|
||||
details: Mapped[dict] = mapped_column(JSON, default=dict)
|
||||
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=utcnow)
|
||||
@@ -0,0 +1,45 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
|
||||
from .config import settings
|
||||
|
||||
|
||||
def run_ingress_command(command: str) -> subprocess.CompletedProcess[str]:
|
||||
return subprocess.run(
|
||||
[
|
||||
"ssh",
|
||||
"-o",
|
||||
"StrictHostKeyChecking=no",
|
||||
"-o",
|
||||
"UserKnownHostsFile=NUL",
|
||||
"-i",
|
||||
str(settings.ssh_key_path),
|
||||
"-p",
|
||||
str(settings.ingress_ssh_port),
|
||||
f"{settings.ingress_ssh_user}@{settings.ingress_ssh_host}",
|
||||
command,
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
)
|
||||
|
||||
|
||||
def apply_route(hostname: str, scheme: str, target_host: str, target_port: int) -> dict:
|
||||
payload = json.dumps(
|
||||
{"hostname": hostname, "scheme": scheme, "target_host": target_host, "target_port": target_port}
|
||||
)
|
||||
result = run_ingress_command(
|
||||
f"sudo {settings.ingress_route_helper} upsert '{payload}' && sudo systemctl reload caddy"
|
||||
)
|
||||
return {"stdout": result.stdout, "stderr": result.stderr, "returncode": result.returncode}
|
||||
|
||||
|
||||
def remove_route(hostname: str) -> dict:
|
||||
result = run_ingress_command(
|
||||
f"sudo {settings.ingress_route_helper} delete {hostname} && sudo systemctl reload caddy"
|
||||
)
|
||||
return {"stdout": result.stdout, "stderr": result.stderr, "returncode": result.returncode}
|
||||
|
||||
@@ -0,0 +1,30 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from fastapi import Depends, HTTPException, Request, status
|
||||
from passlib.context import CryptContext
|
||||
from sqlalchemy import select
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from .database import get_db
|
||||
from .models import User
|
||||
|
||||
|
||||
pwd_context = CryptContext(schemes=["pbkdf2_sha256"], deprecated="auto")
|
||||
|
||||
|
||||
def hash_password(password: str) -> str:
|
||||
return pwd_context.hash(password)
|
||||
|
||||
|
||||
def verify_password(password: str, password_hash: str) -> bool:
|
||||
return pwd_context.verify(password, password_hash)
|
||||
|
||||
|
||||
def get_current_user(request: Request, db: Session = Depends(get_db)) -> User:
|
||||
username = request.session.get("username")
|
||||
if not username:
|
||||
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED)
|
||||
user = db.scalar(select(User).where(User.username == username, User.is_active.is_(True)))
|
||||
if not user:
|
||||
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED)
|
||||
return user
|
||||
160
infrastructure/ops_control_plane/app/ops_control_plane/seed.py
Normal file
160
infrastructure/ops_control_plane/app/ops_control_plane/seed.py
Normal file
@@ -0,0 +1,160 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
|
||||
from sqlalchemy import select
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from .config import settings
|
||||
from .models import MachineProfile, ModelCatalog, User, WorkloadProfile
|
||||
from .security import hash_password
|
||||
|
||||
|
||||
DEFAULT_MACHINE_PROFILES = [
|
||||
{
|
||||
"name": "t4g-micro-ingress",
|
||||
"region": "us-east-1",
|
||||
"instance_type": "t4g.micro",
|
||||
"gpu_label": "Ingress CPU",
|
||||
"vcpu": 2,
|
||||
"memory_gib": 1.0,
|
||||
"preferred_lifecycle": "on-demand",
|
||||
"intended_workloads": ["ingress"],
|
||||
},
|
||||
{
|
||||
"name": "g6-xlarge",
|
||||
"region": "us-east-1",
|
||||
"instance_type": "g6.xlarge",
|
||||
"gpu_label": "1x NVIDIA L4",
|
||||
"vcpu": 4,
|
||||
"memory_gib": 16.0,
|
||||
"preferred_lifecycle": "spot",
|
||||
"intended_workloads": ["light-comfy", "qwen-edit"],
|
||||
},
|
||||
{
|
||||
"name": "g6-2xlarge",
|
||||
"region": "us-east-1",
|
||||
"instance_type": "g6.2xlarge",
|
||||
"gpu_label": "1x NVIDIA L4",
|
||||
"vcpu": 8,
|
||||
"memory_gib": 32.0,
|
||||
"preferred_lifecycle": "spot",
|
||||
"intended_workloads": ["comfyui", "qwen-edit"],
|
||||
},
|
||||
{
|
||||
"name": "g6-4xlarge",
|
||||
"region": "us-east-1",
|
||||
"instance_type": "g6.4xlarge",
|
||||
"gpu_label": "1x NVIDIA L4",
|
||||
"vcpu": 16,
|
||||
"memory_gib": 64.0,
|
||||
"preferred_lifecycle": "spot",
|
||||
"intended_workloads": ["comfyui", "wan-video", "qwen-edit"],
|
||||
},
|
||||
{
|
||||
"name": "g6-12xlarge",
|
||||
"region": "us-east-1",
|
||||
"instance_type": "g6.12xlarge",
|
||||
"gpu_label": "4x NVIDIA L4",
|
||||
"vcpu": 48,
|
||||
"memory_gib": 192.0,
|
||||
"preferred_lifecycle": "spot",
|
||||
"intended_workloads": ["comfyui", "batch-storyboard", "qwen-edit", "multi-gpu"],
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
DEFAULT_WORKLOADS = [
|
||||
{
|
||||
"name": "comfyui",
|
||||
"service_type": "systemd",
|
||||
"model_requirements": [],
|
||||
"default_port": 8188,
|
||||
"start_command": "sudo systemctl start comfyui",
|
||||
"stop_command": "sudo systemctl stop comfyui",
|
||||
"healthcheck_path": "/",
|
||||
"route_hostname": "comfy.desineuron.in",
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
DEFAULT_MODELS = [
|
||||
{
|
||||
"model_key": "qwen-image-edit-2511",
|
||||
"label": "Qwen Image Edit 2511",
|
||||
"s3_prefix": "models/qwen-image-edit-2511/",
|
||||
"compatibility_tags": ["qwen", "image-edit"],
|
||||
"workload_tags": ["comfyui", "qwen-edit"],
|
||||
},
|
||||
{
|
||||
"model_key": "qwen-image-2512",
|
||||
"label": "Qwen Image 2512",
|
||||
"s3_prefix": "models/qwen-image-2512/",
|
||||
"compatibility_tags": ["qwen", "image"],
|
||||
"workload_tags": ["comfyui", "qwen-image"],
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def seed_defaults(db: Session) -> None:
|
||||
if not db.scalar(select(User).where(User.username == settings.admin_username)):
|
||||
db.add(
|
||||
User(
|
||||
username=settings.admin_username,
|
||||
password_hash=hash_password(settings.admin_password),
|
||||
role="admin",
|
||||
)
|
||||
)
|
||||
|
||||
try:
|
||||
team_users = json.loads(settings.team_users_json)
|
||||
except json.JSONDecodeError:
|
||||
team_users = []
|
||||
for row in team_users:
|
||||
username = row.get("username")
|
||||
password = row.get("password")
|
||||
role = row.get("role", "operator")
|
||||
if not username or not password:
|
||||
continue
|
||||
existing_user = db.scalar(select(User).where(User.username == username))
|
||||
if existing_user:
|
||||
existing_user.role = role
|
||||
existing_user.is_active = True
|
||||
if row.get("reset_password"):
|
||||
existing_user.password_hash = hash_password(password)
|
||||
continue
|
||||
db.add(User(username=username, password_hash=hash_password(password), role=role))
|
||||
|
||||
for profile in DEFAULT_MACHINE_PROFILES:
|
||||
existing = db.scalar(select(MachineProfile).where(MachineProfile.name == profile["name"]))
|
||||
if existing:
|
||||
existing.launch_config = {
|
||||
"ami_id": settings.gpu_ami_id,
|
||||
"subnet_id": settings.gpu_subnet_id,
|
||||
"security_group_ids": list(settings.gpu_security_group_ids),
|
||||
"key_name": settings.gpu_key_name,
|
||||
"instance_profile": settings.gpu_instance_profile,
|
||||
"root_volume_gb": settings.gpu_root_volume_gb,
|
||||
}
|
||||
continue
|
||||
db.add(
|
||||
MachineProfile(
|
||||
**profile,
|
||||
launch_config={
|
||||
"ami_id": settings.gpu_ami_id,
|
||||
"subnet_id": settings.gpu_subnet_id,
|
||||
"security_group_ids": list(settings.gpu_security_group_ids),
|
||||
"key_name": settings.gpu_key_name,
|
||||
"instance_profile": settings.gpu_instance_profile,
|
||||
"root_volume_gb": settings.gpu_root_volume_gb,
|
||||
},
|
||||
)
|
||||
)
|
||||
|
||||
for workload in DEFAULT_WORKLOADS:
|
||||
if not db.scalar(select(WorkloadProfile).where(WorkloadProfile.name == workload["name"])):
|
||||
db.add(WorkloadProfile(**workload))
|
||||
|
||||
for model in DEFAULT_MODELS:
|
||||
if not db.scalar(select(ModelCatalog).where(ModelCatalog.model_key == model["model_key"])):
|
||||
db.add(ModelCatalog(**model))
|
||||
@@ -0,0 +1,209 @@
|
||||
html{color-scheme:dark}
|
||||
body{
|
||||
font-family:Segoe UI,system-ui,sans-serif;
|
||||
background:
|
||||
radial-gradient(circle at top right, rgba(220,38,38,.18), transparent 28%),
|
||||
radial-gradient(circle at left 20%, rgba(239,68,68,.09), transparent 24%),
|
||||
linear-gradient(180deg, #020202 0%, #070707 100%);
|
||||
color:#f5f5f5;
|
||||
margin:0;
|
||||
min-height:100vh;
|
||||
}
|
||||
.hud-grid{
|
||||
position:fixed;
|
||||
inset:0;
|
||||
pointer-events:none;
|
||||
background-image:
|
||||
linear-gradient(rgba(255,255,255,.02) 1px, transparent 1px),
|
||||
linear-gradient(90deg, rgba(255,255,255,.02) 1px, transparent 1px);
|
||||
background-size:32px 32px;
|
||||
mask-image:linear-gradient(180deg, rgba(0,0,0,.35), rgba(0,0,0,.85));
|
||||
}
|
||||
.topbar{
|
||||
position:sticky;
|
||||
top:0;
|
||||
z-index:10;
|
||||
display:flex;
|
||||
justify-content:space-between;
|
||||
align-items:center;
|
||||
padding:22px 30px;
|
||||
background:rgba(10,10,10,.9);
|
||||
backdrop-filter:blur(18px);
|
||||
border-bottom:1px solid rgba(255,255,255,.07);
|
||||
box-shadow:0 10px 40px rgba(0,0,0,.4);
|
||||
}
|
||||
.topbar h1{
|
||||
margin:0;
|
||||
font-size:24px;
|
||||
letter-spacing:.04em;
|
||||
text-transform:uppercase;
|
||||
}
|
||||
.topbar p{
|
||||
margin:5px 0 0;
|
||||
color:#b8b8b8;
|
||||
max-width:760px;
|
||||
}
|
||||
.topbar-actions{
|
||||
display:flex;
|
||||
gap:12px;
|
||||
align-items:center;
|
||||
}
|
||||
.user-chip{
|
||||
display:inline-flex;
|
||||
align-items:center;
|
||||
padding:8px 12px;
|
||||
border:1px solid rgba(248,113,113,.45);
|
||||
border-radius:999px;
|
||||
color:#fca5a5;
|
||||
background:rgba(127,29,29,.22);
|
||||
box-shadow:0 0 24px rgba(220,38,38,.15) inset;
|
||||
}
|
||||
.topbar-actions a,.button,button{
|
||||
display:inline-flex;
|
||||
align-items:center;
|
||||
justify-content:center;
|
||||
gap:8px;
|
||||
background:linear-gradient(180deg, #ef4444 0%, #991b1b 100%);
|
||||
color:#fff;
|
||||
border:1px solid rgba(248,113,113,.5);
|
||||
border-radius:12px;
|
||||
padding:10px 14px;
|
||||
text-decoration:none;
|
||||
cursor:pointer;
|
||||
box-shadow:0 0 24px rgba(220,38,38,.18);
|
||||
}
|
||||
.button.secondary,button.secondary{
|
||||
background:rgba(255,255,255,.04);
|
||||
border-color:rgba(255,255,255,.14);
|
||||
color:#fff;
|
||||
box-shadow:none;
|
||||
}
|
||||
.button.danger,button.danger{
|
||||
background:linear-gradient(180deg, #dc2626 0%, #7f1d1d 100%);
|
||||
}
|
||||
.page{
|
||||
position:relative;
|
||||
padding:26px;
|
||||
}
|
||||
.grid{display:grid;gap:20px}
|
||||
.grid.two{grid-template-columns:repeat(2,minmax(0,1fr))}
|
||||
.grid.three{grid-template-columns:repeat(3,minmax(0,1fr))}
|
||||
.summary-grid{display:grid;grid-template-columns:repeat(4,minmax(0,1fr));gap:20px;margin-bottom:20px}
|
||||
.card{
|
||||
position:relative;
|
||||
overflow:hidden;
|
||||
background:linear-gradient(180deg, rgba(16,16,16,.88) 0%, rgba(8,8,8,.92) 100%);
|
||||
border:1px solid rgba(255,255,255,.08);
|
||||
border-radius:20px;
|
||||
padding:22px;
|
||||
margin-bottom:20px;
|
||||
box-shadow:
|
||||
0 16px 40px rgba(0,0,0,.45),
|
||||
0 0 0 1px rgba(255,255,255,.02) inset;
|
||||
}
|
||||
.card::after{
|
||||
content:"";
|
||||
position:absolute;
|
||||
inset:auto -20% -60% auto;
|
||||
width:180px;
|
||||
height:180px;
|
||||
background:radial-gradient(circle, rgba(220,38,38,.16), transparent 65%);
|
||||
pointer-events:none;
|
||||
}
|
||||
.card h2{
|
||||
margin:0 0 16px;
|
||||
font-size:18px;
|
||||
letter-spacing:.04em;
|
||||
text-transform:uppercase;
|
||||
}
|
||||
.card.narrow{max-width:460px;margin:90px auto}
|
||||
.card.stat strong{
|
||||
display:block;
|
||||
font-size:30px;
|
||||
margin:8px 0;
|
||||
color:#fff;
|
||||
}
|
||||
.eyebrow{
|
||||
color:#f87171;
|
||||
font-size:11px;
|
||||
letter-spacing:.18em;
|
||||
text-transform:uppercase;
|
||||
}
|
||||
.flash{
|
||||
display:flex;
|
||||
gap:12px;
|
||||
align-items:center;
|
||||
}
|
||||
.flash.success{
|
||||
border-color:rgba(248,113,113,.35);
|
||||
background:linear-gradient(180deg, rgba(127,29,29,.25) 0%, rgba(18,18,18,.95) 100%);
|
||||
}
|
||||
.flash.error{
|
||||
border-color:rgba(248,113,113,.6);
|
||||
background:linear-gradient(180deg, rgba(69,10,10,.55) 0%, rgba(18,18,18,.95) 100%);
|
||||
}
|
||||
.stack{display:grid;gap:12px}
|
||||
.action-stack{display:grid;gap:8px}
|
||||
.plain-list{padding-left:18px;margin:0;display:grid;gap:8px;color:#d6d6d6}
|
||||
.kv-list{display:grid;gap:10px}
|
||||
.kv-list div{display:flex;justify-content:space-between;gap:12px}
|
||||
.checkbox-row{
|
||||
display:flex;
|
||||
align-items:center;
|
||||
gap:10px;
|
||||
color:#f5f5f5;
|
||||
}
|
||||
label{display:grid;gap:6px;color:#d0d0d0}
|
||||
input,select{
|
||||
padding:11px 12px;
|
||||
border-radius:12px;
|
||||
border:1px solid rgba(255,255,255,.12);
|
||||
background:rgba(255,255,255,.03);
|
||||
color:#fff;
|
||||
outline:none;
|
||||
}
|
||||
input:focus,select:focus{
|
||||
border-color:rgba(248,113,113,.75);
|
||||
box-shadow:0 0 0 3px rgba(220,38,38,.16);
|
||||
}
|
||||
table{width:100%;border-collapse:collapse}
|
||||
th,td{
|
||||
padding:12px 10px;
|
||||
border-bottom:1px solid rgba(255,255,255,.08);
|
||||
text-align:left;
|
||||
vertical-align:top;
|
||||
}
|
||||
th{
|
||||
color:#fca5a5;
|
||||
font-weight:600;
|
||||
font-size:12px;
|
||||
letter-spacing:.08em;
|
||||
text-transform:uppercase;
|
||||
}
|
||||
.pill{
|
||||
display:inline-block;
|
||||
padding:4px 10px;
|
||||
border-radius:999px;
|
||||
font-size:12px;
|
||||
background:rgba(255,255,255,.06);
|
||||
color:#f3f3f3;
|
||||
}
|
||||
.pill.available{
|
||||
background:rgba(127,29,29,.45);
|
||||
color:#fecaca;
|
||||
border:1px solid rgba(248,113,113,.3);
|
||||
}
|
||||
.pill.unavailable{
|
||||
background:rgba(31,31,31,.9);
|
||||
color:#d4d4d4;
|
||||
}
|
||||
.pill.unknown{
|
||||
background:rgba(55,65,81,.5);
|
||||
color:#e5e7eb;
|
||||
}
|
||||
.muted{color:#a3a3a3;font-size:12px}
|
||||
.error{color:#fca5a5}
|
||||
|
||||
@media (max-width: 1100px){
|
||||
.grid.two,.grid.three,.summary-grid{grid-template-columns:1fr}
|
||||
}
|
||||
@@ -0,0 +1,27 @@
|
||||
<!doctype html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<title>{{ title or "Desineuron Ops" }}</title>
|
||||
<link rel="stylesheet" href="/static/style.css">
|
||||
</head>
|
||||
<body>
|
||||
<div class="hud-grid" aria-hidden="true"></div>
|
||||
<header class="topbar">
|
||||
<div>
|
||||
<h1>Desineuron Ops Control Plane</h1>
|
||||
<p>Linux-hosted AWS control surface for machines, models, routes, and cost</p>
|
||||
</div>
|
||||
{% if user %}
|
||||
<div class="topbar-actions">
|
||||
<span class="user-chip">{{ user.username }}</span>
|
||||
<a href="/logout">Logout</a>
|
||||
</div>
|
||||
{% endif %}
|
||||
</header>
|
||||
<main class="page">
|
||||
{% block content %}{% endblock %}
|
||||
</main>
|
||||
</body>
|
||||
</html>
|
||||
@@ -0,0 +1,355 @@
|
||||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
{% if flash %}
|
||||
<section class="card flash {{ flash.level }}">
|
||||
<strong>{{ flash.level|capitalize }}</strong>
|
||||
<span>{{ flash.message }}</span>
|
||||
</section>
|
||||
{% endif %}
|
||||
|
||||
<section class="summary-grid">
|
||||
<article class="card stat">
|
||||
<span class="eyebrow">Machines</span>
|
||||
<strong>{{ summary.machine_count }}</strong>
|
||||
<span class="muted">Known AWS nodes</span>
|
||||
</article>
|
||||
<article class="card stat">
|
||||
<span class="eyebrow">Hourly Burn</span>
|
||||
<strong>${{ summary.hourly_burn_usd }}</strong>
|
||||
<span class="muted">Estimated live blended hourly cost</span>
|
||||
</article>
|
||||
<article class="card stat">
|
||||
<span class="eyebrow">24h Cost</span>
|
||||
<strong>${{ summary.last_24h_usd }}</strong>
|
||||
<span class="muted">Rolling 24 hour estimate</span>
|
||||
</article>
|
||||
<article class="card stat">
|
||||
<span class="eyebrow">30d Cost</span>
|
||||
<strong>${{ summary.last_30d_usd }}</strong>
|
||||
<span class="muted">Rolling 30 day estimate</span>
|
||||
</article>
|
||||
</section>
|
||||
|
||||
<div class="grid three">
|
||||
<section class="card">
|
||||
<h2>Control Surface</h2>
|
||||
<div class="kv-list">
|
||||
<div><span>Bucket</span><strong>{{ bucket_name or "not configured" }}</strong></div>
|
||||
<div><span>Visible regions</span><strong>{{ regions|join(", ") }}</strong></div>
|
||||
<div><span>Active sessions</span><strong>{{ summary.active_sessions }}</strong></div>
|
||||
<div><span>Active jobs</span><strong>{{ summary.active_jobs }}</strong></div>
|
||||
<div><span>Active routes</span><strong>{{ summary.routes_active }}</strong></div>
|
||||
<div><span>Fleet est. cost</span><strong>${{ summary.fleet_estimated_cost_usd }}</strong></div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="card">
|
||||
<h2>Launch Machine</h2>
|
||||
<form method="post" action="/api/machines/launch" class="stack">
|
||||
<label>Profile
|
||||
<select name="profile_name">
|
||||
{% for profile in profiles %}
|
||||
<option value="{{ profile.name }}">{{ profile.name }} | {{ profile.instance_type }} | {{ profile.gpu_label }}</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</label>
|
||||
<label>Lifecycle
|
||||
<select name="lifecycle">
|
||||
<option value="spot">spot</option>
|
||||
<option value="on-demand">on-demand</option>
|
||||
</select>
|
||||
</label>
|
||||
<button type="submit">Launch Selected Machine</button>
|
||||
</form>
|
||||
</section>
|
||||
|
||||
<section class="card">
|
||||
<h2>Runbooks</h2>
|
||||
<ul class="plain-list">
|
||||
<li>1. Launch preferred GPU profile.</li>
|
||||
<li>2. Hydrate required model from S3.</li>
|
||||
<li>3. Start workload and optionally map route.</li>
|
||||
<li>4. Monitor runtime and estimated cost.</li>
|
||||
<li>5. Stop or terminate the node when done.</li>
|
||||
</ul>
|
||||
<a class="button secondary" href="/api/exports/csv">Export Sessions CSV</a>
|
||||
</section>
|
||||
</div>
|
||||
|
||||
<section class="card">
|
||||
<h2>Markets</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Profile</th>
|
||||
<th>Instance</th>
|
||||
<th>GPU</th>
|
||||
<th>vCPU / RAM</th>
|
||||
<th>Region</th>
|
||||
<th>On-Demand</th>
|
||||
<th>Spot</th>
|
||||
<th>Preferred Use</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{% for profile in profiles %}
|
||||
{% for region in regions %}
|
||||
{% set ns = namespace(on_demand='-', on_demand_status='unknown', spot='-', spot_status='unknown') %}
|
||||
{% for market in market_rows %}
|
||||
{% if market.region == region and market.instance_type == profile.instance_type and market.lifecycle == 'on-demand' %}
|
||||
{% set ns.on_demand = '$' ~ market.hourly_price_usd if market.hourly_price_usd is not none else '-' %}
|
||||
{% set ns.on_demand_status = 'available' if market.offering_available else 'unavailable' %}
|
||||
{% endif %}
|
||||
{% if market.region == region and market.instance_type == profile.instance_type and market.lifecycle == 'spot' %}
|
||||
{% set ns.spot = '$' ~ market.hourly_price_usd if market.hourly_price_usd is not none else '-' %}
|
||||
{% set ns.spot_status = 'available' if market.offering_available else 'unavailable' %}
|
||||
{% endif %}
|
||||
{% endfor %}
|
||||
<tr>
|
||||
<td>{{ profile.name }}</td>
|
||||
<td>{{ profile.instance_type }}</td>
|
||||
<td>{{ profile.gpu_label }}</td>
|
||||
<td>{{ profile.vcpu }} / {{ profile.memory_gib }} GiB</td>
|
||||
<td>{{ region }}</td>
|
||||
<td><span class="pill {{ ns.on_demand_status }}">{{ ns.on_demand }}</span></td>
|
||||
<td><span class="pill {{ ns.spot_status }}">{{ ns.spot }}</span></td>
|
||||
<td>{{ profile.intended_workloads|join(", ") }}</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
{% endfor %}
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
<section class="card">
|
||||
<h2>Machines</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Name</th>
|
||||
<th>Type</th>
|
||||
<th>State</th>
|
||||
<th>IPs</th>
|
||||
<th>Runtime</th>
|
||||
<th>Cost</th>
|
||||
<th>Actions</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{% for machine in machines %}
|
||||
<tr>
|
||||
<td>
|
||||
<strong>{{ machine.name }}</strong>
|
||||
<div class="muted">{{ machine.aws_instance_id }}</div>
|
||||
</td>
|
||||
<td>
|
||||
<div>{{ machine.instance_type }}</div>
|
||||
<div class="muted">{{ machine.lifecycle }} / {{ machine.region }}</div>
|
||||
</td>
|
||||
<td>{{ machine.state }}</td>
|
||||
<td>
|
||||
<div>{{ machine.public_ip or "-" }}</div>
|
||||
<div class="muted">{{ machine.private_ip or "-" }}</div>
|
||||
</td>
|
||||
<td>{{ costs[machine.aws_instance_id].runtime_hours if machine.aws_instance_id in costs else "-" }} h</td>
|
||||
<td>
|
||||
<div>${{ costs[machine.aws_instance_id].total_cost_usd if machine.aws_instance_id in costs else "-" }}</div>
|
||||
<div class="muted">${{ costs[machine.aws_instance_id].hourly_price_usd if machine.aws_instance_id in costs else "-" }}/hr</div>
|
||||
</td>
|
||||
<td>
|
||||
<div class="action-stack">
|
||||
<form method="post" action="/api/machines/{{ machine.id }}/stop">
|
||||
<button type="submit" class="button secondary">Stop</button>
|
||||
</form>
|
||||
<form method="post" action="/api/machines/{{ machine.id }}/terminate">
|
||||
<button type="submit" class="button danger">Terminate</button>
|
||||
</form>
|
||||
</div>
|
||||
</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
<div class="grid two">
|
||||
<section class="card">
|
||||
<h2>Model Library Ingest</h2>
|
||||
<form method="post" action="/api/models/register" class="stack">
|
||||
<label>Model Key <input type="text" name="model_key" placeholder="qwen-image-edit-2511" required></label>
|
||||
<label>Label <input type="text" name="label" placeholder="Qwen Image Edit 2511" required></label>
|
||||
<label>Source Path Under Linux Model Library <input type="text" name="source_relative_path" placeholder="Qwen-Image-Edit-2511" required></label>
|
||||
<label>Workload Tags <input type="text" name="workload_tags" placeholder="comfyui, qwen-edit"></label>
|
||||
<label>Compatibility Tags <input type="text" name="compatibility_tags" placeholder="qwen, image-edit"></label>
|
||||
<button type="submit">Upload to S3 + Generate Manifest</button>
|
||||
</form>
|
||||
</section>
|
||||
|
||||
<section class="card">
|
||||
<h2>Hydrate Model</h2>
|
||||
<form method="post" action="/api/models/hydrate" class="stack">
|
||||
<label>Machine
|
||||
<select name="machine_id">
|
||||
{% for machine in machines %}
|
||||
<option value="{{ machine.id }}">{{ machine.name }} ({{ machine.aws_instance_id }})</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</label>
|
||||
<label>Model
|
||||
<select name="model_key">
|
||||
{% for model in models %}
|
||||
<option value="{{ model.model_key }}">{{ model.label }}</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</label>
|
||||
<button type="submit">Hydrate from S3</button>
|
||||
</form>
|
||||
</section>
|
||||
|
||||
<section class="card">
|
||||
<h2>Start Workload</h2>
|
||||
<form method="post" action="/api/workloads/start" class="stack">
|
||||
<label>Machine
|
||||
<select name="machine_id">
|
||||
{% for machine in machines %}
|
||||
<option value="{{ machine.id }}">{{ machine.name }}</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</label>
|
||||
<label>Workload
|
||||
<select name="workload_name">
|
||||
{% for workload in workloads %}
|
||||
<option value="{{ workload.name }}">{{ workload.name }}</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</label>
|
||||
<label class="checkbox-row"><input type="checkbox" name="auto_route" value="true"> Auto-map workload hostname via ingress</label>
|
||||
<button type="submit">Start Workload</button>
|
||||
</form>
|
||||
</section>
|
||||
</div>
|
||||
|
||||
<section class="card">
|
||||
<h2>Registered Models</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr><th>Model</th><th>S3 Prefix</th><th>Size</th><th>Files</th><th>Tags</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{% for model in models %}
|
||||
<tr>
|
||||
<td>
|
||||
<strong>{{ model.label }}</strong>
|
||||
<div class="muted">{{ model.model_key }}</div>
|
||||
</td>
|
||||
<td>{{ model.s3_prefix }}</td>
|
||||
<td>{{ model.size_gb or "-" }} GiB</td>
|
||||
<td>{{ model.expected_manifest.file_count if model.expected_manifest else "-" }}</td>
|
||||
<td>
|
||||
<div>{{ model.workload_tags|join(", ") }}</div>
|
||||
<div class="muted">{{ model.compatibility_tags|join(", ") }}</div>
|
||||
</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
<div class="grid two">
|
||||
<section class="card">
|
||||
<h2>Route Management</h2>
|
||||
<form method="post" action="/api/routes/map" class="stack">
|
||||
<label>Hostname <input type="text" name="hostname" placeholder="gpu-ui.desineuron.in" required></label>
|
||||
<label>Scheme
|
||||
<select name="scheme">
|
||||
<option value="http">http</option>
|
||||
<option value="https">https</option>
|
||||
</select>
|
||||
</label>
|
||||
<label>Target Host <input type="text" name="target_host" placeholder="172.31.x.x" required></label>
|
||||
<label>Target Port <input type="number" name="target_port" value="8188" required></label>
|
||||
<button type="submit">Map Route</button>
|
||||
</form>
|
||||
<table>
|
||||
<thead>
|
||||
<tr><th>Hostname</th><th>Target</th><th>Status</th><th>Action</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{% for route in routes %}
|
||||
<tr>
|
||||
<td>{{ route.hostname }}</td>
|
||||
<td>{{ route.scheme }}://{{ route.target_host }}:{{ route.target_port }}</td>
|
||||
<td>{{ route.status }}</td>
|
||||
<td>
|
||||
<form method="post" action="/api/routes/unmap">
|
||||
<input type="hidden" name="hostname" value="{{ route.hostname }}">
|
||||
<button type="submit" class="button secondary">Unmap</button>
|
||||
</form>
|
||||
</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
<section class="card">
|
||||
<h2>Recent Sessions</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr><th>Actor</th><th>Workload</th><th>Status</th><th>Started</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{% for session in sessions %}
|
||||
<tr>
|
||||
<td>{{ session.actor }}</td>
|
||||
<td>{{ session.workload_name }}</td>
|
||||
<td>{{ session.status }}</td>
|
||||
<td>{{ session.started_at }}</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
</div>
|
||||
|
||||
<div class="grid two">
|
||||
<section class="card">
|
||||
<h2>Recent Jobs</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr><th>ID</th><th>Type</th><th>Status</th><th>Actor</th><th>Created</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{% for job in jobs %}
|
||||
<tr>
|
||||
<td>{{ job.id }}</td>
|
||||
<td>{{ job.job_type }}</td>
|
||||
<td>{{ job.status }}</td>
|
||||
<td>{{ job.actor or "-" }}</td>
|
||||
<td>{{ job.created_at }}</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
<section class="card">
|
||||
<h2>Audit</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr><th>Actor</th><th>Action</th><th>Entity</th><th>Time</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{% for event in audits %}
|
||||
<tr>
|
||||
<td>{{ event.actor or "-" }}</td>
|
||||
<td>{{ event.action }}</td>
|
||||
<td>{{ event.entity_type }} / {{ event.entity_id }}</td>
|
||||
<td>{{ event.created_at }}</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
</div>
|
||||
{% endblock %}
|
||||
@@ -0,0 +1,14 @@
|
||||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<section class="card narrow">
|
||||
<p class="eyebrow">Private Surface</p>
|
||||
<h2>Login</h2>
|
||||
<p class="muted">Use your Desineuron operator account.</p>
|
||||
{% if error %}<p class="error">{{ error }}</p>{% endif %}
|
||||
<form method="post" action="/login" class="stack">
|
||||
<label>Email or username <input type="text" name="username" required></label>
|
||||
<label>Password <input type="password" name="password" required></label>
|
||||
<button type="submit">Enter Ops Console</button>
|
||||
</form>
|
||||
</section>
|
||||
{% endblock %}
|
||||
@@ -0,0 +1,50 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
from sqlalchemy import select
|
||||
|
||||
from .aws_control import latest_market_price, refresh_market_snapshots, sync_instances, upsert_session_cost
|
||||
from .database import Base, engine, session_scope
|
||||
from .models import Machine, MachineProfile, Session as RuntimeSession
|
||||
from .seed import seed_defaults
|
||||
|
||||
|
||||
def run_worker() -> None:
|
||||
Base.metadata.create_all(bind=engine)
|
||||
last_market_refresh: datetime | None = None
|
||||
while True:
|
||||
with session_scope() as db:
|
||||
seed_defaults(db)
|
||||
profiles = db.scalars(select(MachineProfile)).all()
|
||||
sync_instances(db, {profile.region for profile in profiles})
|
||||
running_machines = db.scalars(select(Machine).where(Machine.state == "running")).all()
|
||||
for machine in running_machines:
|
||||
active_session = db.scalar(
|
||||
select(RuntimeSession).where(RuntimeSession.machine_id == machine.id, RuntimeSession.status == "active")
|
||||
)
|
||||
if not active_session:
|
||||
db.add(
|
||||
RuntimeSession(
|
||||
machine_id=machine.id,
|
||||
actor="system-import",
|
||||
workload_name=machine.profile_name or machine.instance_type,
|
||||
status="active",
|
||||
notes="Imported from existing running machine state",
|
||||
)
|
||||
)
|
||||
if last_market_refresh is None or datetime.now(timezone.utc) - last_market_refresh > timedelta(minutes=15):
|
||||
refresh_market_snapshots(db, {profile.region for profile in profiles}, profiles)
|
||||
last_market_refresh = datetime.now(timezone.utc)
|
||||
sessions = db.scalars(select(RuntimeSession).where(RuntimeSession.status == "active")).all()
|
||||
for session_row in sessions:
|
||||
if session_row.machine_id:
|
||||
machine = db.get(Machine, session_row.machine_id)
|
||||
if machine:
|
||||
upsert_session_cost(db, session_row, machine)
|
||||
time.sleep(60)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run_worker()
|
||||
13
infrastructure/ops_control_plane/app/requirements.txt
Normal file
13
infrastructure/ops_control_plane/app/requirements.txt
Normal file
@@ -0,0 +1,13 @@
|
||||
fastapi==0.116.1
|
||||
uvicorn[standard]==0.35.0
|
||||
sqlalchemy==2.0.43
|
||||
psycopg[binary]==3.2.10
|
||||
jinja2==3.1.6
|
||||
python-multipart==0.0.20
|
||||
itsdangerous==2.2.0
|
||||
passlib[bcrypt]==1.7.4
|
||||
boto3==1.40.35
|
||||
httpx==0.28.1
|
||||
typer==0.16.1
|
||||
python-dateutil==2.9.0.post0
|
||||
|
||||
58
infrastructure/ops_control_plane/docker-compose.yml
Normal file
58
infrastructure/ops_control_plane/docker-compose.yml
Normal file
@@ -0,0 +1,58 @@
|
||||
services:
|
||||
ops-db:
|
||||
image: postgres:16-alpine
|
||||
container_name: desineuron-ops-db
|
||||
environment:
|
||||
POSTGRES_DB: ${OPS_DB_NAME}
|
||||
POSTGRES_USER: ${OPS_DB_USER}
|
||||
POSTGRES_PASSWORD: ${OPS_DB_PASSWORD}
|
||||
ports:
|
||||
- "5435:5432"
|
||||
volumes:
|
||||
- ./data/postgres:/var/lib/postgresql/data
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U ${OPS_DB_USER} -d ${OPS_DB_NAME}"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 10
|
||||
restart: unless-stopped
|
||||
|
||||
ops-api:
|
||||
build:
|
||||
context: ./app
|
||||
container_name: desineuron-ops-api
|
||||
command: ["python", "-m", "ops_control_plane.main"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
OPS_ROLE: api
|
||||
ports:
|
||||
- "18765:8080"
|
||||
depends_on:
|
||||
ops-db:
|
||||
condition: service_healthy
|
||||
volumes:
|
||||
- ./exports:/app/exports
|
||||
- ./logs:/app/logs
|
||||
- ./state:/app/state
|
||||
- ${OPS_MODEL_LIBRARY_HOST_PATH:-/mnt/ServerStorage/ai-models/models}:/model-library:ro
|
||||
restart: unless-stopped
|
||||
|
||||
ops-worker:
|
||||
build:
|
||||
context: ./app
|
||||
container_name: desineuron-ops-worker
|
||||
command: ["python", "-m", "ops_control_plane.worker"]
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
OPS_ROLE: worker
|
||||
depends_on:
|
||||
ops-db:
|
||||
condition: service_healthy
|
||||
volumes:
|
||||
- ./exports:/app/exports
|
||||
- ./logs:/app/logs
|
||||
- ./state:/app/state
|
||||
- ${OPS_MODEL_LIBRARY_HOST_PATH:-/mnt/ServerStorage/ai-models/models}:/model-library:ro
|
||||
restart: unless-stopped
|
||||
@@ -0,0 +1,9 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
sudo mkdir -p /etc/caddy/managed
|
||||
sudo install -m 0755 /tmp/manage_desineuron_routes.py /usr/local/bin/manage_desineuron_routes.py
|
||||
sudo install -m 0644 /tmp/desineuron_ingress_Caddyfile /etc/caddy/Caddyfile
|
||||
sudo python3 /usr/local/bin/manage_desineuron_routes.py list >/dev/null
|
||||
sudo caddy validate --config /etc/caddy/Caddyfile
|
||||
sudo systemctl reload caddy
|
||||
@@ -0,0 +1,52 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
TARGET_ROOT=/opt/desineuron-ops-control-plane
|
||||
SERVICE_FILE=/etc/systemd/system/desineuron-ops-control-plane.service
|
||||
|
||||
sudo mkdir -p "$TARGET_ROOT"
|
||||
sudo mkdir -p "$TARGET_ROOT/data/postgres" "$TARGET_ROOT/exports" "$TARGET_ROOT/logs" "$TARGET_ROOT/state"
|
||||
sudo rsync -a \
|
||||
--exclude '.env' \
|
||||
--exclude 'data/' \
|
||||
--exclude 'exports/' \
|
||||
--exclude 'logs/' \
|
||||
--exclude 'state/' \
|
||||
/tmp/desineuron_ops_control_plane/ "$TARGET_ROOT/"
|
||||
sudo chown -R "$USER:$USER" "$TARGET_ROOT"
|
||||
|
||||
if [[ ! -f "$TARGET_ROOT/.env" ]]; then
|
||||
cp "$TARGET_ROOT/.env.example" "$TARGET_ROOT/.env"
|
||||
fi
|
||||
chmod 600 "$TARGET_ROOT/.env"
|
||||
|
||||
if [[ ! -f "$TARGET_ROOT/state/desineuron-l4-node.pem" ]]; then
|
||||
echo "Missing $TARGET_ROOT/state/desineuron-l4-node.pem" >&2
|
||||
exit 1
|
||||
fi
|
||||
chmod 600 "$TARGET_ROOT/state/desineuron-l4-node.pem"
|
||||
|
||||
mkdir -p "$TARGET_ROOT/data/postgres" "$TARGET_ROOT/exports" "$TARGET_ROOT/logs" "$TARGET_ROOT/state"
|
||||
sudo chown -R 999:999 "$TARGET_ROOT/data/postgres" || true
|
||||
|
||||
sudo tee "$SERVICE_FILE" >/dev/null <<EOF
|
||||
[Unit]
|
||||
Description=Desineuron Ops Control Plane
|
||||
After=docker.service network-online.target
|
||||
Requires=docker.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
RemainAfterExit=yes
|
||||
WorkingDirectory=$TARGET_ROOT
|
||||
ExecStart=/usr/bin/docker compose up -d --build
|
||||
ExecStop=/usr/bin/docker compose down
|
||||
TimeoutStartSec=0
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now desineuron-ops-control-plane.service
|
||||
sudo systemctl --no-pager --full status desineuron-ops-control-plane.service
|
||||
@@ -0,0 +1,37 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
TARGET=/etc/nginx/sites-available/desineuron-ops-control-plane.conf
|
||||
LINK=/etc/nginx/sites-enabled/desineuron-ops-control-plane.conf
|
||||
|
||||
sudo tee "$TARGET" >/dev/null <<'EOF'
|
||||
server {
|
||||
listen 443 ssl http2;
|
||||
listen [::]:443 ssl http2;
|
||||
server_name ops.desineuron.in;
|
||||
|
||||
ssl_certificate /etc/letsencrypt/live/desineuron-infra/fullchain.pem;
|
||||
ssl_certificate_key /etc/letsencrypt/live/desineuron-infra/privkey.pem;
|
||||
ssl_protocols TLSv1.2 TLSv1.3;
|
||||
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
|
||||
|
||||
client_max_body_size 128m;
|
||||
|
||||
location / {
|
||||
proxy_pass http://127.0.0.1:18765;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_read_timeout 3600;
|
||||
proxy_send_timeout 3600;
|
||||
}
|
||||
}
|
||||
EOF
|
||||
|
||||
sudo ln -sf "$TARGET" "$LINK"
|
||||
sudo nginx -t
|
||||
sudo systemctl reload nginx
|
||||
76
infrastructure/ops_control_plane/manage_desineuron_routes.py
Normal file
76
infrastructure/ops_control_plane/manage_desineuron_routes.py
Normal file
@@ -0,0 +1,76 @@
|
||||
#!/usr/bin/env python3
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
STATE_FILE = Path("/etc/caddy/managed/desineuron-routes.json")
|
||||
SNIPPET_FILE = Path("/etc/caddy/managed/desineuron-routes.caddy")
|
||||
|
||||
|
||||
def load_routes() -> dict[str, dict]:
|
||||
if STATE_FILE.exists():
|
||||
return json.loads(STATE_FILE.read_text(encoding="utf-8"))
|
||||
return {}
|
||||
|
||||
|
||||
def save_routes(routes: dict[str, dict]) -> None:
|
||||
STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
STATE_FILE.write_text(json.dumps(routes, indent=2), encoding="utf-8")
|
||||
|
||||
|
||||
def render_routes(routes: dict[str, dict]) -> None:
|
||||
lines: list[str] = []
|
||||
for hostname, route in sorted(routes.items()):
|
||||
lines.extend(
|
||||
[
|
||||
f"{hostname} {{",
|
||||
"\tlog {",
|
||||
"\t\toutput file /var/log/caddy/access.log",
|
||||
"\t\tformat json",
|
||||
"\t}",
|
||||
f"\treverse_proxy {route['scheme']}://{route['target_host']}:{route['target_port']} {{",
|
||||
"\t\theader_up Host {host}",
|
||||
"\t\theader_up X-Forwarded-Host {host}",
|
||||
"\t\theader_up X-Forwarded-Proto {scheme}",
|
||||
"\t\theader_up X-Forwarded-For {remote_host}",
|
||||
"\t}",
|
||||
"}",
|
||||
"",
|
||||
]
|
||||
)
|
||||
SNIPPET_FILE.write_text("\n".join(lines).rstrip() + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
def main() -> int:
|
||||
if len(sys.argv) < 2:
|
||||
print("usage: manage_desineuron_routes.py <upsert|delete|list> [payload|hostname]")
|
||||
return 1
|
||||
command = sys.argv[1]
|
||||
routes = load_routes()
|
||||
if command == "upsert":
|
||||
payload = json.loads(sys.argv[2])
|
||||
routes[payload["hostname"]] = payload
|
||||
save_routes(routes)
|
||||
render_routes(routes)
|
||||
print(json.dumps({"status": "ok", "action": "upsert", "hostname": payload["hostname"]}))
|
||||
return 0
|
||||
if command == "delete":
|
||||
hostname = sys.argv[2]
|
||||
routes.pop(hostname, None)
|
||||
save_routes(routes)
|
||||
render_routes(routes)
|
||||
print(json.dumps({"status": "ok", "action": "delete", "hostname": hostname}))
|
||||
return 0
|
||||
if command == "list":
|
||||
print(json.dumps(routes, indent=2))
|
||||
return 0
|
||||
print(f"unknown command: {command}")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
||||
Reference in New Issue
Block a user