Missed files (#19)

Co-authored-by: Sagnik <sagnik7896@gmail.com>
Reviewed-on: #19
This commit was merged in pull request #19.
This commit is contained in:
2026-04-12 19:26:20 +05:30
parent 4645ff737b
commit e241ff800c
69 changed files with 4375 additions and 2469 deletions

View File

@@ -42,18 +42,4 @@ ops.desineuron.in {
}
}
comfy.desineuron.in {
log {
output file /var/log/caddy/access.log
format json
}
reverse_proxy http://172.31.46.190:8188 {
header_up Host {host}
header_up X-Forwarded-Host {host}
header_up X-Forwarded-Proto {scheme}
header_up X-Forwarded-For {remote_host}
}
}
import /etc/caddy/managed/*.caddy

View File

@@ -19,7 +19,7 @@ Date: 2026-04-08
13. Remaining Improvement Ideas
14. Rollback
15. Team Summary
16. Current Status Snapshot - 2026-04-11
16. Current Status Snapshot - 2026-04-12
17. Linux Ops Control Plane
### Outcome
@@ -87,7 +87,7 @@ Current GPU worker:
- Type: `g6.12xlarge`
- Region: `us-east-1`
- Private IP: `172.31.46.190`
- Current public IP: `18.208.176.121`
- Current public IP: `100.31.64.121`
- Launch time: `2026-04-11T06:14:04Z`
Open ingress ports:
@@ -168,7 +168,7 @@ Public hostname checks through the new ingress:
- `talk.desineuron.in` -> `200 /login` on `talk.desineuron.in`
- `vpn.desineuron.in` -> `200`
- `ops.desineuron.in/login` -> `200`
- `comfy.desineuron.in` -> `502`
- `comfy.desineuron.in` -> `200`
Important note:
@@ -203,9 +203,8 @@ Current live path:
Current public result:
- `comfy.desineuron.in` currently returns `502 Bad Gateway`
- ingress route is present and Caddy is healthy
- the current GPU backend is not yet listening on `172.31.46.190:8188`, so this is a backend readiness issue, not a DNS or edge-TLS issue
- `comfy.desineuron.in` currently returns `200 OK`
- ingress route is now managed dynamically instead of hardcoded to one GPU private IP
Current GPU service:
@@ -214,11 +213,20 @@ Current GPU service:
- log path: `/var/log/comfyui/service.log`
- port: `8188/tcp`
Current backend state on `2026-04-11`:
Current backend state on `2026-04-12`:
- `comfyui.service` is `activating`
- latest log shows ComfyUI startup and `Starting server`
- the process is still not binding `8188`, so ingress sees the backend as unavailable
- `comfyui.service` is `active`
- `main.py` is present under `/opt/dlami/nvme/ComfyUI`
- the process is listening on `0.0.0.0:8188`
- the public ingress path is healthy again
Auto-healing fix applied:
- ComfyUI `systemd` service now runs an `ExecStartPre` recovery script at `/usr/local/bin/desineuron-ensure-comfyui.sh`
- that script reclones/repairs `/opt/dlami/nvme/ComfyUI` if the tree is missing or damaged
- Linux now runs `desineuron-comfy-route-sync.timer`
- the timer updates the managed Caddy route for `comfy.desineuron.in` to the current private IP of the AWS instance tagged `DesineuronRole=comfyui`
- this protects the public route from GPU instance IP drift without manual Caddy edits
Expected endpoints:
@@ -244,6 +252,10 @@ Infrastructure artifacts in repo:
- [desineuron-ingress-home-ip-sync.service](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/desineuron-ingress-home-ip-sync.service)
- [desineuron-ingress-home-ip-sync.timer](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/desineuron-ingress-home-ip-sync.timer)
- [install_linux_ingress_ip_sync.sh](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/install_linux_ingress_ip_sync.sh)
- [sync_comfy_route.py](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/sync_comfy_route.py)
- [desineuron-comfy-route-sync.service](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/desineuron-comfy-route-sync.service)
- [desineuron-comfy-route-sync.timer](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/desineuron-comfy-route-sync.timer)
- [install_linux_comfy_route_sync.sh](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/desineuron_ingress/install_linux_comfy_route_sync.sh)
- [README.md](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/infrastructure/ops_control_plane/README.md)
- [Desineuron Ops Control Plane Bibel.md](/F:/Workin%20In%20Progress/DESINEURON/GITLAB/Project_Velocity/.Agent%20Context/Bibels/Desineuron%20Ops%20Control%20Plane%20Bibel.md)
@@ -290,6 +302,37 @@ Current state:
- Last recorded home public IP: `223.185.28.89`
- Ingress SSH rule CIDR: `223.185.28.89/32`
### Dynamic Comfy Route Sync
Purpose:
- keep `comfy.desineuron.in` mapped to the correct AWS GPU private IP even if the GPU instance public/private IP changes
- remove the need to hand-edit `/etc/caddy/Caddyfile` for ComfyUI moves
Design:
- Linux runs `desineuron-comfy-route-sync.timer`
- timer fires on boot and every 2 minutes
- service looks for the newest running EC2 instance tagged `DesineuronRole=comfyui`
- service reads its current private IP
- service connects to the ingress node and updates the managed Caddy route with `/usr/local/bin/manage_desineuron_routes.py`
- Caddy is validated and reloaded only after a successful route update
Installed Linux paths:
- `/usr/local/bin/sync_comfy_route.py`
- `/etc/systemd/system/desineuron-comfy-route-sync.service`
- `/etc/systemd/system/desineuron-comfy-route-sync.timer`
- `/etc/desineuron-comfy-route-sync.env`
- `/opt/desineuron-comfy-route-sync/.venv`
- `/var/lib/desineuron-comfy-route-sync/current_target.txt`
Current state:
- Timer: enabled and active
- Current synced target: `172.31.46.190`
- Current target instance tag: `DesineuronRole=comfyui`
### Operational Commands
Check AWS ingress status:
@@ -319,6 +362,7 @@ ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S journalctl -u desineuron-ingress-home-ip-sync -n 50 --no-pager"
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S systemctl status desineuron-ops-control-plane.service --no-pager"
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S docker compose -f /opt/desineuron-ops-control-plane/docker-compose.yml ps"
ssh -i "$env:USERPROFILE\.ssh\id_ed25519_desineuron_lan" desineuron-node-01@192.168.1.4 "echo '***' | sudo -S systemctl status desineuron-comfy-route-sync.service desineuron-comfy-route-sync.timer --no-pager"
```
Public endpoint validation:
@@ -449,14 +493,15 @@ Additional mapped route:
- `comfy.desineuron.in` now terminates on the same stable ingress and forwards to the GPU node's private address `172.31.46.190:8188`.
- No further DNS change is needed for ComfyUI.
- The backend is supervised by `systemd`, but the current worker is not yet binding `8188`, so public access is currently degraded with `502`.
- The backend is supervised by `systemd` and currently healthy.
- The route is now auto-synced from Linux based on the tagged AWS ComfyUI worker, so future IP changes do not require manual ingress edits.
- The team can use:
- `https://comfy.desineuron.in/prompt`
- `https://comfy.desineuron.in/history/{prompt_id}`
- `https://comfy.desineuron.in/queue`
- `https://comfy.desineuron.in/upload/image`
### Current Status Snapshot - 2026-04-11
### Current Status Snapshot - 2026-04-12
Live public service state:
@@ -467,7 +512,7 @@ Live public service state:
- `talk.desineuron.in` -> `200`
- `vpn.desineuron.in` -> `200`
- `ops.desineuron.in/login` -> `200`
- `comfy.desineuron.in` -> `502`
- `comfy.desineuron.in` -> `200`
Linux-origin health:
@@ -490,10 +535,16 @@ Ingress health:
GPU ComfyUI state:
- `comfyui.service` -> `activating`
- latest logs show ComfyUI startup sequence completing toward `Starting server`
- no active listener on `8188` yet
- ingress cannot connect to `172.31.46.190:8188`, which is why the public result is `502`
- `comfyui.service` -> `active`
- `main.py` present under `/opt/dlami/nvme/ComfyUI`
- listener present on `0.0.0.0:8188`
- public ingress path is healthy
Comfy auto-heal state:
- `desineuron-comfy-route-sync.timer` -> `active`
- synced target file -> `/var/lib/desineuron-comfy-route-sync/current_target.txt`
- current synced target -> `172.31.46.190`
### Linux Ops Control Plane

View File

@@ -0,0 +1,9 @@
[Unit]
Description=Sync comfy.desineuron.in managed route to current GPU private IP
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
EnvironmentFile=/etc/desineuron-comfy-route-sync.env
ExecStart=/opt/desineuron-comfy-route-sync/.venv/bin/python /usr/local/bin/sync_comfy_route.py

View File

@@ -0,0 +1,10 @@
[Unit]
Description=Run comfy route sync on boot and every 2 minutes
[Timer]
OnBootSec=1min
OnUnitActiveSec=2min
Unit=desineuron-comfy-route-sync.service
[Install]
WantedBy=timers.target

View File

@@ -4,22 +4,41 @@ set -euo pipefail
COMFY_DIR="/opt/dlami/nvme/ComfyUI"
SERVICE_NAME="comfyui"
LOG_DIR="/var/log/comfyui"
ENSURE_SCRIPT="/usr/local/bin/desineuron-ensure-comfyui.sh"
if ! command -v git >/dev/null 2>&1; then
sudo apt-get update
sudo apt-get install -y git
fi
sudo tee "${ENSURE_SCRIPT}" >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
COMFY_DIR="/opt/dlami/nvme/ComfyUI"
sudo mkdir -p /opt/dlami/nvme
sudo chown -R ubuntu:ubuntu /opt/dlami/nvme
if [ ! -d "${COMFY_DIR}/.git" ]; then
sudo mkdir -p /opt/dlami/nvme
sudo chown -R ubuntu:ubuntu /opt/dlami/nvme
rm -rf "${COMFY_DIR}"
git clone https://github.com/comfyanonymous/ComfyUI.git "${COMFY_DIR}"
else
git -C "${COMFY_DIR}" pull --ff-only
git -C "${COMFY_DIR}" fetch --all --prune
git -C "${COMFY_DIR}" reset --hard origin/master
fi
python3 -m pip install -r "${COMFY_DIR}/requirements.txt"
if [ ! -f "${COMFY_DIR}/main.py" ]; then
echo "ComfyUI main.py missing after ensure step" >&2
exit 1
fi
EOF
sudo chmod 0755 "${ENSURE_SCRIPT}"
sudo -u ubuntu "${ENSURE_SCRIPT}"
sudo mkdir -p "${LOG_DIR}"
sudo chown -R ubuntu:ubuntu "${LOG_DIR}"
@@ -36,6 +55,7 @@ Group=ubuntu
WorkingDirectory=/opt/dlami/nvme/ComfyUI
Environment=HOME=/home/ubuntu
Environment=PYTHONUNBUFFERED=1
ExecStartPre=/usr/local/bin/desineuron-ensure-comfyui.sh
ExecStart=/usr/bin/python3 /opt/dlami/nvme/ComfyUI/main.py --listen 0.0.0.0 --port 8188 --disable-auto-launch
Restart=always
RestartSec=5

View File

@@ -0,0 +1,33 @@
#!/usr/bin/env bash
set -euo pipefail
APP_ROOT=/opt/desineuron-comfy-route-sync
VENV_PATH="$APP_ROOT/.venv"
ENV_FILE=/etc/desineuron-comfy-route-sync.env
SCRIPT_PATH=/usr/local/bin/sync_comfy_route.py
SERVICE_FILE=/etc/systemd/system/desineuron-comfy-route-sync.service
TIMER_FILE=/etc/systemd/system/desineuron-comfy-route-sync.timer
sudo mkdir -p "$APP_ROOT" /var/lib/desineuron-comfy-route-sync
python3 -m venv "$VENV_PATH"
"$VENV_PATH/bin/pip" install --upgrade pip boto3
sudo install -m 0755 /tmp/desineuron_ingress/sync_comfy_route.py "$SCRIPT_PATH"
sudo install -m 0644 /tmp/desineuron_ingress/desineuron-comfy-route-sync.service "$SERVICE_FILE"
sudo install -m 0644 /tmp/desineuron_ingress/desineuron-comfy-route-sync.timer "$TIMER_FILE"
sudo tee "$ENV_FILE" >/dev/null <<EOF
OPS_ENV_FILE=/opt/desineuron-ops-control-plane/.env
COMFY_ROUTE_HOSTNAME=comfy.desineuron.in
COMFY_ROUTE_PORT=8188
COMFY_INSTANCE_TAG_KEY=DesineuronRole
COMFY_INSTANCE_TAG_VALUE=comfyui
COMFY_ROUTE_STATE_FILE=/var/lib/desineuron-comfy-route-sync/current_target.txt
INGRESS_SSH_KEY_PATH=/opt/desineuron-ops-control-plane/state/desineuron-l4-node.pem
EOF
sudo chmod 600 "$ENV_FILE"
sudo systemctl daemon-reload
sudo systemctl enable --now desineuron-comfy-route-sync.timer
sudo systemctl start desineuron-comfy-route-sync.service
sudo systemctl --no-pager --full status desineuron-comfy-route-sync.service desineuron-comfy-route-sync.timer

View File

@@ -0,0 +1,142 @@
#!/usr/bin/env python3
from __future__ import annotations
import json
import os
import subprocess
import sys
from pathlib import Path
import boto3
def load_env_file(path: Path) -> dict[str, str]:
data: dict[str, str] = {}
if not path.exists():
return data
for line in path.read_text(encoding="utf-8").splitlines():
line = line.strip()
if not line or line.startswith("#") or "=" not in line:
continue
key, value = line.split("=", 1)
data[key.strip()] = value.strip()
return data
def env(name: str, default: str = "") -> str:
return os.environ.get(name, default)
def resolve_target_instance(ec2) -> dict | None:
explicit_instance_id = env("COMFY_INSTANCE_ID")
if explicit_instance_id:
reservations = ec2.describe_instances(InstanceIds=[explicit_instance_id])["Reservations"]
for reservation in reservations:
for instance in reservation["Instances"]:
if instance["State"]["Name"] == "running":
return instance
return None
tag_key = env("COMFY_INSTANCE_TAG_KEY", "DesineuronRole")
tag_value = env("COMFY_INSTANCE_TAG_VALUE", "comfyui")
filters = [
{"Name": "instance-state-name", "Values": ["running"]},
{"Name": f"tag:{tag_key}", "Values": [tag_value]},
]
reservations = ec2.describe_instances(Filters=filters)["Reservations"]
instances = [instance for reservation in reservations for instance in reservation["Instances"]]
if not instances:
return None
instances.sort(key=lambda row: row["LaunchTime"], reverse=True)
return instances[0]
def upsert_route(hostname: str, private_ip: str, port: int) -> subprocess.CompletedProcess[str]:
ingress_host = env("INGRESS_SSH_HOST")
ingress_user = env("INGRESS_SSH_USER", "ec2-user")
ingress_port = env("INGRESS_SSH_PORT", "22")
ingress_key = env("INGRESS_SSH_KEY_PATH")
helper = env("INGRESS_ROUTE_HELPER", "/usr/local/bin/manage_desineuron_routes.py")
payload = json.dumps(
{
"hostname": hostname,
"scheme": "http",
"target_host": private_ip,
"target_port": port,
}
)
command = (
f"sudo {helper} upsert '{payload}'"
" && sudo caddy validate --config /etc/caddy/Caddyfile"
" && sudo systemctl reload caddy"
)
return subprocess.run(
[
"ssh",
"-o",
"StrictHostKeyChecking=no",
"-o",
"UserKnownHostsFile=/dev/null",
"-i",
ingress_key,
"-p",
ingress_port,
f"{ingress_user}@{ingress_host}",
command,
],
capture_output=True,
text=True,
check=False,
)
def main() -> int:
ops_env = load_env_file(Path(env("OPS_ENV_FILE", "/opt/desineuron-ops-control-plane/.env")))
for key in ["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "AWS_DEFAULT_REGION"]:
if key not in os.environ and key in ops_env:
os.environ[key] = ops_env[key]
os.environ.setdefault("AWS_DEFAULT_REGION", ops_env.get("OPS_DEFAULT_REGION", "us-east-1"))
os.environ.setdefault("INGRESS_SSH_HOST", ops_env.get("OPS_INGRESS_SSH_HOST", ""))
os.environ.setdefault("INGRESS_SSH_USER", ops_env.get("OPS_INGRESS_SSH_USER", "ec2-user"))
os.environ.setdefault("INGRESS_SSH_PORT", ops_env.get("OPS_INGRESS_SSH_PORT", "22"))
normalized_key_path = ops_env.get("OPS_SSH_KEY_PATH", "/opt/desineuron-ops-control-plane/state/desineuron-l4-node.pem")
if normalized_key_path.startswith("/app/state/"):
normalized_key_path = normalized_key_path.replace("/app/state/", "/opt/desineuron-ops-control-plane/state/")
os.environ.setdefault("INGRESS_SSH_KEY_PATH", normalized_key_path)
os.environ.setdefault("INGRESS_ROUTE_HELPER", ops_env.get("OPS_INGRESS_ROUTE_HELPER", "/usr/local/bin/manage_desineuron_routes.py"))
region = os.environ["AWS_DEFAULT_REGION"]
hostname = env("COMFY_ROUTE_HOSTNAME", "comfy.desineuron.in")
port = int(env("COMFY_ROUTE_PORT", "8188"))
state_file = Path(env("COMFY_ROUTE_STATE_FILE", "/var/lib/desineuron-comfy-route-sync/current_target.txt"))
ec2 = boto3.client("ec2", region_name=region)
instance = resolve_target_instance(ec2)
if not instance:
print("No running comfyui target instance found", file=sys.stderr)
return 1
private_ip = instance.get("PrivateIpAddress")
if not private_ip:
print("Target instance has no private IP", file=sys.stderr)
return 1
current = state_file.read_text(encoding="utf-8").strip() if state_file.exists() else ""
if current == private_ip:
print(json.dumps({"status": "noop", "hostname": hostname, "target_host": private_ip}))
return 0
result = upsert_route(hostname, private_ip, port)
if result.returncode != 0:
print(result.stdout)
print(result.stderr, file=sys.stderr)
return result.returncode
state_file.parent.mkdir(parents=True, exist_ok=True)
state_file.write_text(private_ip, encoding="utf-8")
print(json.dumps({"status": "updated", "hostname": hostname, "target_host": private_ip}))
return 0
if __name__ == "__main__":
raise SystemExit(main())