Files
Project_Velocity/.Agent Context/Sprint 1/comfyui_setup_truth.md

4.2 KiB

ComfyUI Setup Truth

Date: 2026-04-15
Purpose: Capture the current ComfyUI operating truth, team access path, model hydration path, and the exact repo and infra artifacts that matter.

1. Current Production Truth

ComfyUI is exposed through the stable ingress, not through the GPU box public IP.

Current live path:

  • public hostname: https://comfy.desineuron.in
  • ingress elastic IP: 98.87.120.120
  • ingress target: 172.31.46.190:8188
  • GPU instance: i-0e4eab5fe67cf9abe
  • GPU type: g6.12xlarge

As of 2026-04-15, the public path is healthy again and returns 200 OK.

2. What Failed

The recent outage was not an ingress TLS problem. The GPU box had lost its ComfyUI working tree and the systemd recovery path expected by the service was missing.

Observed failure state:

  • /opt/dlami/nvme/ComfyUI missing
  • /usr/local/bin/desineuron-ensure-comfyui.sh missing
  • comfyui.service entered restart loops
  • ingress returned 502

3. What Was Restored

The GPU node was restored to the intended service shape:

  • comfyui.service is active
  • /opt/dlami/nvme/ComfyUI exists again
  • ComfyUI is listening on 0.0.0.0:8188
  • ingress can reach 172.31.46.190:8188
  • public https://comfy.desineuron.in returns 200

4. Team Usability Contract

All team members should use the stable hostname only:

  • https://comfy.desineuron.in/
  • https://comfy.desineuron.in/prompt
  • https://comfy.desineuron.in/history/{prompt_id}
  • https://comfy.desineuron.in/queue
  • https://comfy.desineuron.in/upload/image

Do not use the GPU public IP directly.

Do not expose 8188 publicly again.

5. Storage Truth

Model and staging work should land on NVMe, not on the root volume.

Canonical GPU storage roots:

  • ComfyUI app: /opt/dlami/nvme/ComfyUI
  • HF cache: /opt/dlami/nvme/hf
  • model staging: /opt/dlami/nvme/model-staging
  • model logs: /opt/dlami/nvme/model-logs

6. S3 Model Hydration Truth

Existing S3 bucket used for Project Velocity model storage:

  • s3://project-velocity/models/

Existing model prefix examples were already present there before this pass. This is therefore the current working hydration bucket and prefix family.

Wan 2.2 target prefix:

  • s3://project-velocity/models/Wan2.2-Animate-14B/

7. Wan 2.2 Animate 14B Download Path

Tooling installed on the GPU box:

  • hf
  • huggingface_hub with hf_transfer
  • s5cmd

Download is staged to NVMe under:

  • /opt/dlami/nvme/model-staging/Wan2.2-Animate-14B

Support scripts created on the GPU node:

  • /usr/local/bin/desineuron-download-wan22.sh
  • /usr/local/bin/desineuron-sync-wan22-to-s3.sh

The intended flow is:

  1. download from Hugging Face to NVMe
  2. sync from NVMe to s3://project-velocity/models/Wan2.2-Animate-14B/
  3. use S3 as the hydration source for future GPU or Linux-side restoration workflows

8. Current Wan State

The Wan 2.2 Animate 14B download was started on the GPU box and is writing into the NVMe staging directory.

This is a long-running asset download and should be treated as resumable model hydration work, not a short command.

9. Repo Artifacts That Matter

Relevant repo files:

  • [install_gpu_comfyui_service.sh](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\infrastructure\desineuron_ingress\install_gpu_comfyui_service.sh)
  • [sync_comfy_route.py](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\infrastructure\desineuron_ingress\sync_comfy_route.py)
  • [Caddyfile](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\infrastructure\desineuron_ingress\Caddyfile)
  • [Desineuron Stable Ingress Handoff.md](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity.Agent Context\Sprint 1\Desineuron Stable Ingress Handoff.md)

10. Operational Guidance

If Comfy breaks again, check in this order:

  1. public https://comfy.desineuron.in
  2. ingress managed route target
  3. GPU listener on 8188
  4. existence of /opt/dlami/nvme/ComfyUI
  5. existence of /usr/local/bin/desineuron-ensure-comfyui.sh
  6. comfyui.service journal

11. Bottom Line

ComfyUI is a stable-ingress service now, not a direct GPU-IP service. Team usage should go through the ingress hostname, model storage should go to NVMe first, and S3 should act as the hydration source of truth for large model recovery and replication.