Files
Project_Velocity/infrastructure/desineuron_ingress/TEAM_HANDOFF_2026-04-08.md
2026-04-13 00:51:39 +05:30

24 KiB

Desineuron Stable Ingress Handoff

Date: 2026-04-08

Chapters

  1. Outcome
  2. Final Architecture
  3. AWS Resources
  4. Linux Origin State
  5. Migration Changes Applied
  6. Validation Results
  7. ComfyUI Recovery and GPU Route
  8. Files and Config Artifacts
  9. Dynamic Home IP Sync
  10. Operational Commands
  11. Future Service Mapping Runbook
  12. Security Notes
  13. Remaining Improvement Ideas
  14. Rollback
  15. Team Summary
  16. Current Status Snapshot - 2026-04-12
  17. Linux Ops Control Plane
  18. Velocity Stable API Runbook

Outcome

The Cloudflare Tunnel dependency for the six public desineuron.in services has been replaced with a self-hosted AWS ingress layer:

  • Public edge: AWS EC2 t4g.micro
  • Stable public IP: 98.87.120.120
  • TLS termination: Caddy on the ingress node
  • Private backend relay: rathole
  • Origin: Linux box at 192.168.1.4
  • DNS: Cloudflare, DNS only

Public hostnames now route through AWS instead of Cloudflare Tunnel:

  • office.desineuron.in
  • git.desineuron.in
  • cloud.desineuron.in
  • projects.desineuron.in
  • talk.desineuron.in
  • vpn.desineuron.in
  • comfy.desineuron.in (ingress route created for AWS GPU ComfyUI)
  • ops.desineuron.in (private operator control surface on the Linux box)

Final Architecture

Internet
  -> Cloudflare DNS
  -> 98.87.120.120
  -> EC2 ingress: desineuron-ingress-01
     -> Caddy :443
     -> rathole server (control on 2333, local relay on 127.0.0.1:8443)
     -> Linux origin tunnel client
        -> Linux nginx :443
        -> per-host upstream routing
           -> Gitea
           -> Nextcloud
           -> Taiga
           -> OnlyOffice
           -> NetBird
  -> comfy.desineuron.in
     -> EC2 ingress Caddy
     -> private proxy to AWS GPU box `172.31.46.190:8188`
     -> ComfyUI endpoints on systemd-managed GPU service

AWS Resources

  • Instance name: desineuron-ingress-01
  • Instance ID: i-094df09acafb72494
  • Type: t4g.micro
  • Region: us-east-1
  • Subnet: subnet-03d684ed15f327151
  • VPC: vpc-081d2397920aad268
  • Root disk: 20 GB gp3
  • Elastic IP: 98.87.120.120
  • IAM role: desineuron-ingress-role
  • Instance profile: desineuron-ingress-profile
  • Security group: sg-0721b8b48e12c531d

Current GPU worker:

  • Instance ID: i-0e4eab5fe67cf9abe
  • Type: g6.12xlarge
  • Region: us-east-1
  • Private IP: 172.31.46.190
  • Current public IP: 100.31.64.121
  • Launch time: 2026-04-11T06:14:04Z

Open ingress ports:

  • 80/tcp from internet
  • 443/tcp from internet
  • 22/tcp restricted to the current home public IP and auto-synced from the Linux origin
  • 2333/tcp from internet for rathole control and data relay

GPU node security posture for ComfyUI:

  • public 8118/tcp removed
  • public 8188/tcp removed
  • 8188/tcp now allowed only from ingress security group sg-0721b8b48e12c531d

Linux Origin State

Services exposed to local nginx:

  • git.desineuron.in -> 127.0.0.1:3000 (gitea)
  • cloud.desineuron.in -> 127.0.0.1:11000 (nextcloud_app)
  • talk.desineuron.in -> 127.0.0.1:11000 (nextcloud_app, Talk-focused hostname)
  • projects.desineuron.in -> 127.0.0.1:9100 (taiga-gateway)
  • office.desineuron.in -> 127.0.0.1:9980 (nextcloud_onlyoffice)
  • vpn.desineuron.in -> 127.0.0.1:8080 / 127.0.0.1:8081 (netbird)

Tunnel state:

  • rathole-client.service active on Linux
  • rathole-server.service active on AWS
  • cloudflared inactive on Linux

Migration Changes Applied

Cloudflare

Old CNAME tunnel records were removed for the six public hostnames.

New records were created:

  • Type: A
  • Value: 98.87.120.120
  • Proxy status: DNS only
  • TTL: 300

AWS Ingress

Installed and configured:

  • Caddy
  • rathole
  • amazon-ssm-agent
  • Linux-driven SSH allowlist sync for the ingress node

TLS:

  • Existing valid certificate/key pair from the Linux origin was copied to the ingress node.
  • Caddy now terminates HTTPS at the edge.

Linux Origin

nginx was already routing by hostname and remains the origin router.

Nextcloud was adjusted so talk.desineuron.in no longer canonicalizes to cloud.desineuron.in:

  • removed overwritehost pin
  • added talk.desineuron.in to trusted domains
  • restarted nextcloud_app

Validation Results

Public hostname checks through the new ingress:

  • office.desineuron.in -> 200 /welcome/
  • git.desineuron.in -> 200
  • cloud.desineuron.in -> 200 /login
  • projects.desineuron.in -> 200
  • talk.desineuron.in -> 200 /login on talk.desineuron.in
  • vpn.desineuron.in -> 200
  • ops.desineuron.in/login -> 200
  • comfy.desineuron.in -> 200

Important note:

  • talk.desineuron.in now stays on the talk hostname.
  • It is still backed by the same Nextcloud origin and presents the Nextcloud login flow, which is expected given the current Linux-side app layout.

ComfyUI Recovery and GPU Route

Root cause of the earlier 502:

  • ingress route and TLS were correct
  • the GPU spot node had lost the actual /opt/dlami/nvme/ComfyUI app tree
  • nothing was listening on 172.31.46.190:8188

Permanent fix applied:

  • restored /opt/dlami/nvme/ComfyUI from upstream source control
  • installed ComfyUI Python requirements on the GPU node
  • created systemd unit comfyui.service
  • enabled comfyui.service at boot with automatic restart
  • kept comfy.desineuron.in mapped through ingress Caddy
  • removed direct public access to 8118 and 8188
  • allowed 8188 only from ingress security group

Current live path:

  • https://comfy.desineuron.in -> ingress 98.87.120.120 -> Caddy reverse proxy -> GPU private IP 172.31.46.190:8188 -> comfyui.service

Current public result:

  • comfy.desineuron.in currently returns 200 OK
  • ingress route is now managed dynamically instead of hardcoded to one GPU private IP

Current GPU service:

  • comfyui.service
  • app path: /opt/dlami/nvme/ComfyUI
  • log path: /var/log/comfyui/service.log
  • port: 8188/tcp

Current backend state on 2026-04-12:

  • comfyui.service is active
  • main.py is present under /opt/dlami/nvme/ComfyUI
  • the process is listening on 0.0.0.0:8188
  • the public ingress path is healthy again

Auto-healing fix applied:

  • ComfyUI systemd service now runs an ExecStartPre recovery script at /usr/local/bin/desineuron-ensure-comfyui.sh
  • that script reclones/repairs /opt/dlami/nvme/ComfyUI if the tree is missing or damaged
  • Linux now runs desineuron-comfy-route-sync.timer
  • the timer updates the managed Caddy route for comfy.desineuron.in to the current private IP of the AWS instance tagged DesineuronRole=comfyui
  • this protects the public route from GPU instance IP drift without manual Caddy edits

Expected endpoints:

  • https://comfy.desineuron.in/
  • https://comfy.desineuron.in/prompt
  • https://comfy.desineuron.in/history/{prompt_id}
  • https://comfy.desineuron.in/queue
  • https://comfy.desineuron.in/upload/image

Files and Config Artifacts

Infrastructure artifacts in repo:

Linux origin files touched:

  • /etc/nginx/sites-enabled/desineuron.conf
  • /mnt/ServerStorage/docker_apps/nextcloud/.env
  • /mnt/ServerStorage/docker_apps/nextcloud/data/config/config.php
  • /mnt/ServerStorage/docker_apps/nextcloud/data/config/reverse-proxy.config.php

Backups created on Linux:

  • /mnt/ServerStorage/docker_apps/nextcloud/.env.pre_ingress_backup_2026-04-08
  • /mnt/ServerStorage/docker_apps/nextcloud/data/config/reverse-proxy.config.php.pre_ingress_backup_2026-04-08

Dynamic Home IP Sync

Purpose:

  • Keep ingress 22/tcp restricted to the current Airtel public IP even when the ISP changes it
  • Prevent future manual outages for SSH fallback caused by stale home-IP security-group rules

Design:

  • Linux origin runs desineuron-ingress-home-ip-sync.timer
  • Timer fires on boot and every 5 minutes
  • Service resolves the current home public IP via https://api.ipify.org
  • Service updates only the ingress security group sg-0721b8b48e12c531d
  • Only the SSH fallback rule is mutated
  • rathole is no longer dependent on the Airtel IP because 2333/tcp remains open on the ingress

Installed Linux paths:

  • /usr/local/bin/sync_ingress_home_ip.py
  • /etc/systemd/system/desineuron-ingress-home-ip-sync.service
  • /etc/systemd/system/desineuron-ingress-home-ip-sync.timer
  • /etc/desineuron-ingress-home-ip-sync.env
  • /opt/desineuron-ingress-ip-sync/.venv
  • /var/lib/desineuron-ingress-ip-sync/current_ip.txt

Current state:

  • Timer: enabled and active
  • Last recorded home public IP: 223.185.28.89
  • Ingress SSH rule CIDR: 223.185.28.89/32

Dynamic Comfy Route Sync

Purpose:

  • keep comfy.desineuron.in mapped to the correct AWS GPU private IP even if the GPU instance public/private IP changes
  • remove the need to hand-edit /etc/caddy/Caddyfile for ComfyUI moves

Design:

  • Linux runs desineuron-comfy-route-sync.timer
  • timer fires on boot and every 2 minutes
  • service looks for the newest running EC2 instance tagged DesineuronRole=comfyui
  • service reads its current private IP
  • service connects to the ingress node and updates the managed Caddy route with /usr/local/bin/manage_desineuron_routes.py
  • Caddy is validated and reloaded only after a successful route update

Installed Linux paths:

  • /usr/local/bin/sync_comfy_route.py
  • /etc/systemd/system/desineuron-comfy-route-sync.service
  • /etc/systemd/system/desineuron-comfy-route-sync.timer
  • /etc/desineuron-comfy-route-sync.env
  • /opt/desineuron-comfy-route-sync/.venv
  • /var/lib/desineuron-comfy-route-sync/current_target.txt

Current state:

  • Timer: enabled and active
  • Current synced target: 172.31.46.190
  • Current target instance tag: DesineuronRole=comfyui

Operational Commands

Check AWS ingress status:

aws ec2 describe-instances --instance-ids i-094df09acafb72494 --region us-east-1
aws ec2 describe-addresses --allocation-ids eipalloc-0d54fc0f827450e7b --region us-east-1

Check ingress services:

aws ssm send-command --region us-east-1 --instance-ids i-094df09acafb72494 --document-name AWS-RunShellScript --parameters commands="sudo systemctl status caddy rathole-server --no-pager"

Check GPU ComfyUI service:

aws ssm send-command --region us-east-1 --instance-ids i-0e4eab5fe67cf9abe --document-name AWS-RunShellScript --parameters commands="sudo systemctl status comfyui --no-pager","ss -ltnp | grep 8188 || true","tail -n 40 /var/log/comfyui/service.log || true"

Check Linux origin services:

ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S systemctl status rathole-client nginx"
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S systemctl status desineuron-ingress-home-ip-sync.service desineuron-ingress-home-ip-sync.timer"
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S journalctl -u desineuron-ingress-home-ip-sync -n 50 --no-pager"
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S systemctl status desineuron-ops-control-plane.service --no-pager"
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S docker compose -f /opt/desineuron-ops-control-plane/docker-compose.yml ps"
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S systemctl status desineuron-comfy-route-sync.service desineuron-comfy-route-sync.timer --no-pager"

Public endpoint validation:

curl.exe -I https://office.desineuron.in
curl.exe -I https://git.desineuron.in
curl.exe -I https://cloud.desineuron.in
curl.exe -I https://projects.desineuron.in
curl.exe -I https://talk.desineuron.in
curl.exe -I https://vpn.desineuron.in
curl.exe -I https://comfy.desineuron.in
curl.exe -I https://ops.desineuron.in/login

Future Service Mapping Runbook

Use this pattern for any future public service behind the stable ingress layer.

  1. Decide the backend location.
  • Linux origin behind rathole
  • AWS GPU/private EC2 node
  • another private backend later
  1. Decide whether the service should terminate TLS at ingress.
  • default: yes
  • Caddy on ingress should own the public hostname and certificate
  1. Create the DNS record in Cloudflare.
  • type: A
  • value: 98.87.120.120
  • proxy mode: DNS only
  • low TTL during rollout
  1. Add the ingress route in Caddyfile.

Patterns:

  • Linux-origin service:
    • proxy to https://127.0.0.1:8443
    • preserve Host
  • private AWS backend service:
    • proxy to http://<private-ip>:<port> or https://<private-ip>:<port>
  1. Restrict backend network access.
  • never leave backend app ports open to 0.0.0.0/0 unless absolutely necessary
  • prefer security-group rule allowing traffic only from ingress security group
  • for home-origin services, keep them private behind rathole
  1. Reload ingress.
ssh -i "F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\desineuron-l4-node.pem" ec2-user@98.87.120.120 "sudo caddy validate --config /etc/caddy/Caddyfile && sudo systemctl reload caddy"
  1. Validate TLS and app response.
  • check certificate subject matches hostname
  • check curl -I https://<host>
  • check login page or health endpoint
  • check browser behavior
  1. If the backend is stateful, create a persistent service.
  • prefer systemd
  • enable restart on failure
  • log to a stable path
  • record service name, working directory, ports, and restart policy in this handoff doc
  1. Update team docs immediately.
  • hostname
  • DNS record type
  • ingress route target
  • backend service owner
  • service name
  • health check command
  • rollback step

Security Notes

  • Public traffic terminates only at the AWS edge.
  • The Linux box no longer needs Cloudflare Tunnel for these six routes.
  • The Linux origin is reached through an outbound tunnel, not by directly exposing the home machine to the public for app traffic.
  • SSH on the Linux box remains key-only.
  • The AWS ingress IAM role is limited to SSM core.
  • ComfyUI is no longer directly exposed on the GPU public IP; only the ingress layer can reach 8188.
  • Ingress 22/tcp stays restricted and is now auto-synced from the Linux origin.
  • Ingress 2333/tcp is intentionally open so rathole survives Airtel IP changes without operator action.

Remaining Improvement Ideas

  • Move the Linux nginx certificate issuance/renewal model to the AWS edge permanently instead of copying an existing certificate.
  • Clean up nginx warnings about duplicated protocol options.
  • Separate talk.desineuron.in more fully from general Nextcloud if a distinct Talk-only UX is desired.
  • Add authentication in front of comfy.desineuron.in; internet scanners started hitting the route immediately after it went live.
  • Consider putting Basic Auth or an allowlist in front of comfy.desineuron.in before broader team rollout.
  • Add monitoring and alerting on:
    • caddy
    • rathole-server
    • rathole-client
    • public HTTPS checks
  • Add infrastructure-as-code for the EC2 ingress node if this should be reproducible by the team without manual AWS CLI steps.

Rollback

If rollback is needed:

  1. Recreate Cloudflare CNAME/tunnel routes or repoint the DNS records away from 98.87.120.120.
  2. Stop caddy and rathole-server on AWS.
  3. Stop rathole-client on Linux.
  4. Restore Nextcloud files from:
    • .env.pre_ingress_backup_2026-04-08
    • reverse-proxy.config.php.pre_ingress_backup_2026-04-08
  5. Restart nextcloud_app and nginx.

Team Summary

This migration is complete.

Cloudflare Tunnel is no longer the production path for the six public service hostnames. The stable production ingress is now the AWS t4g.micro node with Elastic IP 98.87.120.120, and the Linux machine remains the private origin behind rathole.

Additional mapped route:

  • comfy.desineuron.in now terminates on the same stable ingress and forwards to the GPU node's private address 172.31.46.190:8188.
  • No further DNS change is needed for ComfyUI.
  • The backend is supervised by systemd and currently healthy.
  • The route is now auto-synced from Linux based on the tagged AWS ComfyUI worker, so future IP changes do not require manual ingress edits.
  • The team can use:
    • https://comfy.desineuron.in/prompt
    • https://comfy.desineuron.in/history/{prompt_id}
    • https://comfy.desineuron.in/queue
    • https://comfy.desineuron.in/upload/image

Current Status Snapshot - 2026-04-12

Live public service state:

  • office.desineuron.in -> 200
  • git.desineuron.in -> 200
  • cloud.desineuron.in -> 200
  • projects.desineuron.in -> 200
  • talk.desineuron.in -> 200
  • vpn.desineuron.in -> 200
  • ops.desineuron.in/login -> 200
  • comfy.desineuron.in -> 200

Linux-origin health:

  • nginx.service -> active
  • rathole-client.service -> active
  • desineuron-ingress-home-ip-sync.timer -> active
  • desineuron-ops-control-plane.service -> active

Linux ops stack containers:

  • desineuron-ops-api -> Up
  • desineuron-ops-db -> Up (healthy)
  • desineuron-ops-worker -> Up

Ingress health:

  • caddy -> active
  • rathole-server -> active
  • comfy.desineuron.in Caddy route is present in /etc/caddy/Caddyfile

GPU ComfyUI state:

  • comfyui.service -> active
  • main.py present under /opt/dlami/nvme/ComfyUI
  • listener present on 0.0.0.0:8188
  • public ingress path is healthy

Comfy auto-heal state:

  • desineuron-comfy-route-sync.timer -> active
  • synced target file -> /var/lib/desineuron-comfy-route-sync/current_target.txt
  • current synced target -> 172.31.46.190

Linux Ops Control Plane

The Linux box now also hosts the private AWS control surface for the team.

Public operator URL:

  • https://ops.desineuron.in/login

Purpose:

  • launch/stop/terminate AWS machines
  • view spot/on-demand market data
  • track runtime and estimated cost
  • ingest model directories from the Linux box into S3
  • hydrate models from S3 to AWS GPU nodes
  • manage ingress routes through the t4g.micro
  • export session/cost CSVs

Linux runtime paths:

  • stack root: /opt/desineuron-ops-control-plane
  • env file: /opt/desineuron-ops-control-plane/.env
  • exports: /opt/desineuron-ops-control-plane/exports
  • state: /opt/desineuron-ops-control-plane/state

Canonical S3 bucket:

  • desineuron-ops-control-plane-819079556187-us-east-1

Model library source on Linux:

  • /mnt/ServerStorage/ai-models/models

Current operator accounts:

  • sagnik@desineuron.in
  • sayan@desineuron.in
  • sourik@desineuron.in

Reference docs:

Velocity Stable API Runbook

Problem:

  • the Velocity backend was still exposed through an ephemeral AWS instance IP
  • frontend code was hardcoded to https://54.152.236.10
  • EC2 stop/start changed the backend public IP and broke the app
  • the stable ingress already existed, but Velocity had never been mapped through it

Correct production pattern:

  • public API hostname: api.desineuron.in
  • public edge: ingress 98.87.120.120
  • ingress route target: current private IP of the EC2 instance tagged DesineuronRole=velocity-backend
  • Linux box runs the route-sync timer, just like the ComfyUI pattern
  • backend stays private and should only accept 8000/8001 from ingress security group sg-0721b8b48e12c531d

Repo artifacts added for this pattern:

Frontend changes expected by this pattern:

  • app/src/lib/api.ts now points production traffic to https://api.desineuron.in
  • app/vite.config.ts uses VITE_BACKEND_PROXY_TARGET for local dev override
  • Vite proxy errors are no longer tied to one stale EC2 IP

Backend bootstrap note:

  • remote_bootstrap_20260401.sh now includes:
    • https://api.desineuron.in
    • https://54.152.236.10
    • https://18.212.122.77 in CORS_ORIGINS

Operator steps still required outside the repo:

  1. Tag the backend EC2 instance:

    • key: DesineuronRole
    • value: velocity-backend
  2. Add Cloudflare DNS:

    • record: api.desineuron.in
    • type: A
    • value: 98.87.120.120
    • proxy: DNS only
  3. Bootstrap the first ingress route once:

    • target host: current backend private IP
    • target port: 8001 unless the backend listener is changed
  4. Lock down backend security group:

    • revoke public 0.0.0.0/0 access to the backend app port
    • allow backend app port only from ingress security group sg-0721b8b48e12c531d
  5. Update backend runtime env and restart:

    • add https://api.desineuron.in to CORS_ORIGINS
    • restart velocity-backend.service
  6. Install the Linux route sync timer:

    • copy infrastructure/desineuron_ingress/*velocity* to Linux temporary staging
    • run install_linux_velocity_route_sync.sh

Expected result after the 6 steps:

  • frontend reaches https://api.desineuron.in
  • ingress forwards to the current backend private IP
  • backend public IP changes stop mattering
  • Linux auto-heals route drift every 2 minutes and on boot