Files

Sagnik 4b21c2cad6 feat: Oracle CRM Page, Synthetic Client Data and Live Snapshot when hitting emotion hotpoint

2026-04-19 00:43:01 +05:30

4.2 KiB

Raw Blame History

ComfyUI Setup Truth

Date: 2026-04-15
Purpose: Capture the current ComfyUI operating truth, team access path, model hydration path, and the exact repo and infra artifacts that matter.

1. Current Production Truth

ComfyUI is exposed through the stable ingress, not through the GPU box public IP.

Current live path:

public hostname: https://comfy.desineuron.in
ingress elastic IP: 98.87.120.120
ingress target: 172.31.46.190:8188
GPU instance: i-0e4eab5fe67cf9abe
GPU type: g6.12xlarge

As of 2026-04-15, the public path is healthy again and returns 200 OK.

2. What Failed

The recent outage was not an ingress TLS problem. The GPU box had lost its ComfyUI working tree and the systemd recovery path expected by the service was missing.

Observed failure state:

/opt/dlami/nvme/ComfyUI missing
/usr/local/bin/desineuron-ensure-comfyui.sh missing
comfyui.service entered restart loops
ingress returned 502

3. What Was Restored

The GPU node was restored to the intended service shape:

comfyui.service is active
/opt/dlami/nvme/ComfyUI exists again
ComfyUI is listening on 0.0.0.0:8188
ingress can reach 172.31.46.190:8188
public https://comfy.desineuron.in returns 200

4. Team Usability Contract

All team members should use the stable hostname only:

https://comfy.desineuron.in/
https://comfy.desineuron.in/prompt
https://comfy.desineuron.in/history/{prompt_id}
https://comfy.desineuron.in/queue
https://comfy.desineuron.in/upload/image

Do not use the GPU public IP directly.

Do not expose 8188 publicly again.

5. Storage Truth

Model and staging work should land on NVMe, not on the root volume.

Canonical GPU storage roots:

ComfyUI app: /opt/dlami/nvme/ComfyUI
HF cache: /opt/dlami/nvme/hf
model staging: /opt/dlami/nvme/model-staging
model logs: /opt/dlami/nvme/model-logs

6. S3 Model Hydration Truth

Existing S3 bucket used for Project Velocity model storage:

s3://project-velocity/models/

Existing model prefix examples were already present there before this pass. This is therefore the current working hydration bucket and prefix family.

Wan 2.2 target prefix:

s3://project-velocity/models/Wan2.2-Animate-14B/

7. Wan 2.2 Animate 14B Download Path

Tooling installed on the GPU box:

hf
huggingface_hub with hf_transfer
s5cmd

Download is staged to NVMe under:

/opt/dlami/nvme/model-staging/Wan2.2-Animate-14B

Support scripts created on the GPU node:

/usr/local/bin/desineuron-download-wan22.sh
/usr/local/bin/desineuron-sync-wan22-to-s3.sh

The intended flow is:

download from Hugging Face to NVMe
sync from NVMe to s3://project-velocity/models/Wan2.2-Animate-14B/
use S3 as the hydration source for future GPU or Linux-side restoration workflows

8. Current Wan State

The Wan 2.2 Animate 14B download was started on the GPU box and is writing into the NVMe staging directory.

This is a long-running asset download and should be treated as resumable model hydration work, not a short command.

9. Repo Artifacts That Matter

Relevant repo files:

[install_gpu_comfyui_service.sh](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\infrastructure\desineuron_ingress\install_gpu_comfyui_service.sh)
[sync_comfy_route.py](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\infrastructure\desineuron_ingress\sync_comfy_route.py)
[Caddyfile](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\infrastructure\desineuron_ingress\Caddyfile)
[Desineuron Stable Ingress Handoff.md](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity.Agent Context\Sprint 1\Desineuron Stable Ingress Handoff.md)

10. Operational Guidance

If Comfy breaks again, check in this order:

public https://comfy.desineuron.in
ingress managed route target
GPU listener on 8188
existence of /opt/dlami/nvme/ComfyUI
existence of /usr/local/bin/desineuron-ensure-comfyui.sh
comfyui.service journal

11. Bottom Line

ComfyUI is a stable-ingress service now, not a direct GPU-IP service. Team usage should go through the ingress hostname, model storage should go to NVMe first, and S3 should act as the hydration source of truth for large model recovery and replication.

4.2 KiB Raw Blame History