1238 lines
30 KiB
Markdown
1238 lines
30 KiB
Markdown
# Animatrix Monolithic SRS - Wan 2.2 Flow Studio
|
||
|
||
Date: 2026-04-15
|
||
|
||
Authoring context: This document defines the first production-ready Animatrix system built on top of the existing Desineuron ingress, the current ComfyUI GPU service, and the Wan 2.2 model family.
|
||
|
||
## 1. Purpose
|
||
|
||
Animatrix is a focused product for guided character video generation. It is not a general-purpose node editor. It is a constrained, operator-safe application that exposes two production workflows behind one simple frontend:
|
||
|
||
1. Character Animation and Replacement using `Wan2.2-Animate-14B`
|
||
2. Audio-Driven Character Performance using `Wan2.2-S2V-14B`
|
||
|
||
The frontend interaction model is inspired by the simplicity and compositional feel of Google Flow, but the execution runtime is ComfyUI-backed and Desineuron-hosted.
|
||
|
||
The objective is to give users a minimal interface:
|
||
|
||
- prompt box
|
||
- ground-truth starting image upload
|
||
- optional reference images and pose sheet uploads
|
||
- optional audio upload
|
||
- simple mode selection
|
||
- one-click generation
|
||
|
||
while the backend handles:
|
||
|
||
- asset ingestion
|
||
- workflow selection
|
||
- parameter validation
|
||
- ComfyUI prompt orchestration
|
||
- queueing
|
||
- status tracking
|
||
- result persistence
|
||
- streaming-ready delivery
|
||
|
||
## 2. Executive Product Truth
|
||
|
||
Animatrix v1 must be built around the actual Wan 2.2 model split, not a blended assumption.
|
||
|
||
Capability mapping:
|
||
|
||
- `Wan2.2-Animate-14B` is for character animation and character replacement.
|
||
- `Wan2.2-S2V-14B` is for audio-driven video generation with dialogue, singing, and performance.
|
||
- `Wan2.2 Fun Inp` is the Wan family workflow for strict first-frame and last-frame control.
|
||
- `Wan2.2 Fun Control` is the Wan family workflow for stronger control-video inputs such as OpenPose, depth, canny, and trajectory control.
|
||
|
||
Therefore the first release must not falsely claim that one single model covers all of the following natively:
|
||
|
||
- character replacement
|
||
- motion transfer
|
||
- audio lip-sync
|
||
- exact first/last-frame constraints
|
||
|
||
It does not.
|
||
|
||
The correct v1 product line is:
|
||
|
||
- Workflow A: `Animate Studio` on `Wan2.2-Animate-14B`
|
||
- Workflow B: `Audio Performance Studio` on `Wan2.2-S2V-14B`
|
||
|
||
The correct v1.1 or v2 expansion is:
|
||
|
||
- Workflow C: `Start/End Frame Studio` on `Wan2.2 Fun Inp`
|
||
- Workflow D: `Pose/Trajectory Control Studio` on `Wan2.2 Fun Control`
|
||
|
||
This distinction is mandatory because it affects UI truthfulness, node graphs, validation rules, asset requirements, and customer expectations.
|
||
|
||
## 3. Source Truth and Rationale
|
||
|
||
This SRS is grounded in the following current sources:
|
||
|
||
- official Wan 2.2 GitHub repository: `https://github.com/Wan-Video/Wan2.2`
|
||
- official Wan 2.2 Animate model page: `https://huggingface.co/Wan-AI/Wan2.2-Animate-14B`
|
||
- official ComfyUI Wan 2.2 docs:
|
||
- `https://docs.comfy.org/tutorials/video/wan/wan2_2`
|
||
- `https://docs.comfy.org/tutorials/video/wan/wan2-2-animate`
|
||
- `https://docs.comfy.org/tutorials/video/wan/wan2-2-s2v`
|
||
- `https://docs.comfy.org/tutorials/video/wan/wan2-2-fun-inp`
|
||
- `https://docs.comfy.org/tutorials/video/wan/wan2-2-fun-control`
|
||
- current Desineuron infrastructure truth:
|
||
- [comfyui_setup_truth.md](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\.Agent Context\Sprint 1\comfyui_setup_truth.md)
|
||
- [Desineuron Stable Ingress Handoff.md](F:\Workin In Progress\DESINEURON\GITLAB\Project_Velocity\.Agent Context\Sprint 1\Desineuron Stable Ingress Handoff.md)
|
||
|
||
Critical source-backed facts that drive the design:
|
||
|
||
- ComfyUI is already exposed safely through `https://comfy.desineuron.in`
|
||
- the GPU service already runs behind stable ingress
|
||
- `Wan2.2-Animate-14B` supports two operating modes in ComfyUI docs: `Mix` and `Move`
|
||
- `Wan2.2-S2V-14B` is the audio-driven workflow with image plus audio inputs
|
||
- ComfyUI’s official Animate docs require additional custom nodes for the full direct workflow
|
||
- exact start/end-frame control is documented under `Wan2.2 Fun Inp`, not Animate
|
||
|
||
## 4. Product Vision
|
||
|
||
Animatrix should behave like a focused video creation surface, not like a research sandbox.
|
||
|
||
The product promise is:
|
||
|
||
"Upload a hero frame, optionally attach references, pose guidance, or audio, write a prompt, and generate a directed character video without touching ComfyUI nodes."
|
||
|
||
The UI must feel lightweight, but the execution system behind it must be opinionated and rigid enough to be supportable.
|
||
|
||
That means:
|
||
|
||
- limited number of modes
|
||
- strict validation
|
||
- controlled presets
|
||
- reproducible workflow JSON
|
||
- consistent output formats
|
||
- no raw-node exposure in the customer-facing frontend
|
||
|
||
## 5. Scope
|
||
|
||
### 5.1 In Scope for v1
|
||
|
||
- one frontend
|
||
- one backend API
|
||
- two ComfyUI production workflows
|
||
- status and result tracking
|
||
- stable ingress compatibility
|
||
- persistent storage for uploads and outputs
|
||
- preview and download experience
|
||
- operator-oriented logging and troubleshooting
|
||
- support for team usage through the existing Desineuron architecture
|
||
|
||
### 5.2 Out of Scope for v1
|
||
|
||
- arbitrary node editing by end users
|
||
- live collaborative editing
|
||
- in-browser timeline editing
|
||
- multi-scene stitching
|
||
- automatic sound effects design
|
||
- full NLE replacement
|
||
- customer-facing batch farms
|
||
- fine-tuning or LoRA training
|
||
|
||
### 5.3 Deferred but Planned
|
||
|
||
- first/last-frame exact control as a third workflow using `Wan2.2 Fun Inp`
|
||
- stronger pose or trajectory control using `Wan2.2 Fun Control`
|
||
- style packs and prompt presets
|
||
- branded credits or quota system
|
||
- user libraries and reusable character packs
|
||
|
||
## 6. User Personas
|
||
|
||
### 6.1 Internal Creative Operator
|
||
|
||
This user understands creative direction but should not need to edit node graphs. They need:
|
||
|
||
- fast iteration
|
||
- predictable inputs
|
||
- reliable outputs
|
||
- access to previous runs
|
||
|
||
### 6.2 Sales Demo Operator
|
||
|
||
This user needs a polished experience that can be shown live. They need:
|
||
|
||
- simple UX
|
||
- low operator error
|
||
- dependable queue feedback
|
||
- visible result cards
|
||
|
||
### 6.3 Technical Media Designer
|
||
|
||
This user understands reference material quality and wants more control without dropping into raw ComfyUI. They need:
|
||
|
||
- reference images
|
||
- pose sheet upload
|
||
- clear mode distinctions
|
||
- optional advanced settings
|
||
|
||
## 7. Functional Overview
|
||
|
||
Animatrix v1 will contain one shell product with two generation modes.
|
||
|
||
### 7.1 Mode A: Animate Studio
|
||
|
||
Underlying engine:
|
||
|
||
- `Wan2.2-Animate-14B`
|
||
|
||
Primary purpose:
|
||
|
||
- animate a character from a source image using the motion and expression from a source video
|
||
- replace the subject in a video with a new character image
|
||
|
||
Sub-modes:
|
||
|
||
- `Move`
|
||
- `Mix`
|
||
|
||
User inputs:
|
||
|
||
- prompt
|
||
- ground-truth character image, required
|
||
- source motion video, required
|
||
- optional reference images
|
||
- optional pose sheet image set
|
||
- optional aspect preset
|
||
- optional duration target
|
||
|
||
### 7.2 Mode B: Audio Performance Studio
|
||
|
||
Underlying engine:
|
||
|
||
- `Wan2.2-S2V-14B`
|
||
|
||
Primary purpose:
|
||
|
||
- generate a character video from a static image and audio input
|
||
- support dialogue, singing, and audio-driven performance
|
||
|
||
User inputs:
|
||
|
||
- prompt
|
||
- ground-truth character image, required
|
||
- source audio, required
|
||
- optional reference images
|
||
- optional pose sheet image set
|
||
- optional full-body / half-body framing preset
|
||
- optional duration target inferred from audio length
|
||
|
||
## 8. Frontend Vision
|
||
|
||
The frontend must preserve the interaction language shown in the reference screenshots:
|
||
|
||
- one large prompt composer
|
||
- image chips at the top-left of the composer
|
||
- plus button for additional attachments
|
||
- compact right-aligned mode selector
|
||
- advanced settings revealed through a controlled panel, not always visible
|
||
|
||
The frontend should feel immediate, not enterprise-heavy.
|
||
|
||
### 8.1 Core Layout
|
||
|
||
Top-level zones:
|
||
|
||
1. Attachment rail
|
||
2. Prompt composer
|
||
3. Optional advanced drawer
|
||
4. Generate action and mode switch
|
||
5. Run history / output gallery below
|
||
|
||
### 8.2 Attachment Types
|
||
|
||
Attachment chips in v1:
|
||
|
||
- `Ground Truth`
|
||
- `Reference`
|
||
- `Pose Sheet`
|
||
- `Audio`
|
||
- `Motion Video`
|
||
|
||
Visibility rules:
|
||
|
||
- `Ground Truth` always available
|
||
- `Motion Video` visible only in Animate Studio
|
||
- `Audio` visible only in Audio Performance Studio
|
||
- `Pose Sheet` optional in both modes
|
||
- `Reference` optional in both modes
|
||
|
||
### 8.3 Frontend Controls
|
||
|
||
Base controls:
|
||
|
||
- prompt text area
|
||
- optional keyword helper line
|
||
- mode toggle: `Animate` / `Audio`
|
||
- output aspect toggle: `9:16`, `16:9`, later `1:1`
|
||
- quality profile: `Draft`, `Standard`, `High`
|
||
- generate button
|
||
|
||
Advanced controls:
|
||
|
||
- Animate sub-mode: `Move` / `Mix`
|
||
- target duration
|
||
- seed
|
||
- negative prompt
|
||
- extension segments
|
||
- background preservation flag
|
||
- relighting flag
|
||
- lip-sync intensity or audio adherence preset
|
||
|
||
### 8.4 UX Rules
|
||
|
||
- do not expose raw model names to standard users
|
||
- use user language like `Animate`, `Replace Character`, `Audio Performance`
|
||
- surface warnings before submission if required inputs are missing
|
||
- show asset previews as compact rounded chips
|
||
- keep advanced panel collapsed by default
|
||
|
||
## 9. Exact Capability Mapping by Workflow
|
||
|
||
### 9.1 Workflow A: Animate Studio
|
||
|
||
Supported in v1:
|
||
|
||
- character animation from image plus motion video
|
||
- character replacement from image plus source video
|
||
- prompt conditioning
|
||
- optional pose preprocessing
|
||
- iterative video extension
|
||
|
||
Not truly supported by Animate Studio itself:
|
||
|
||
- direct audio-driven lip sync
|
||
- strict start/end-frame guarantees
|
||
|
||
### 9.2 Workflow B: Audio Performance Studio
|
||
|
||
Supported in v1:
|
||
|
||
- image plus audio driven generation
|
||
- prompt-conditioned motion/environment
|
||
- dialogue and singing style use cases
|
||
- long-form generation by extension chunks
|
||
|
||
Not truly supported by S2V itself:
|
||
|
||
- guaranteed subject replacement from an existing motion video
|
||
- exact last-frame lock
|
||
|
||
### 9.3 Pose Sheet Truth
|
||
|
||
The user-requested pose sheet can be supported in two ways:
|
||
|
||
1. Soft support in v1
|
||
- pose sheet stored as reference asset
|
||
- backend uses it for prompt augmentation and optional preprocessing assistance
|
||
- operator can map selected sheet frames to manual key pose hints
|
||
|
||
2. Hard support in later release
|
||
- migrate pose guidance to a dedicated `Wan2.2 Fun Control` or equivalent control-video workflow
|
||
|
||
The v1 document must state this honestly. A static pose sheet is not the same as a control video. It helps guide generation but does not become full deterministic motion control without an additional preprocessing and control pipeline.
|
||
|
||
## 10. Ground Truth Asset Model
|
||
|
||
The user’s "ground truth" image is the canonical identity anchor.
|
||
|
||
In both workflows it must serve as:
|
||
|
||
- the primary subject identity reference
|
||
- the default starting visual state
|
||
- the basis for preview thumbnails
|
||
|
||
Rules:
|
||
|
||
- exactly one primary ground-truth image per run
|
||
- image must pass minimum size and aspect checks
|
||
- background should preferably be clean but not mandatory
|
||
- user may crop or center the character before submission
|
||
|
||
Optional extension:
|
||
|
||
- future support for multiple identity references per character pack
|
||
|
||
## 11. Workflow Architecture
|
||
|
||
### 11.1 System Shape
|
||
|
||
```text
|
||
Browser
|
||
-> Animatrix frontend
|
||
-> Animatrix API
|
||
-> job store
|
||
-> asset store
|
||
-> workflow composer
|
||
-> ComfyUI client
|
||
-> https://comfy.desineuron.in
|
||
-> GPU ComfyUI service
|
||
-> Wan2.2 workflow execution
|
||
-> result collector
|
||
-> output persistence
|
||
-> result CDN / static delivery
|
||
```
|
||
|
||
### 11.2 Architectural Rule
|
||
|
||
The frontend must never submit raw prompts directly to ComfyUI.
|
||
|
||
The backend must always mediate:
|
||
|
||
- asset upload
|
||
- workflow selection
|
||
- workflow JSON parameter binding
|
||
- run metadata persistence
|
||
- output tracking
|
||
|
||
This is required for observability, rate control, product safety, and sales-readiness.
|
||
|
||
## 12. Ingress and Deployment Compatibility
|
||
|
||
Animatrix must be designed around the current Desineuron ingress truth.
|
||
|
||
Current infrastructure constraints:
|
||
|
||
- ComfyUI is already live at `https://comfy.desineuron.in`
|
||
- ComfyUI runs behind AWS ingress and stable TLS
|
||
- GPU private IP is not a stable application contract
|
||
- Linux origin is currently `192.168.1.2`
|
||
|
||
### 12.1 Mandatory Integration Rule
|
||
|
||
Animatrix backend must integrate with ComfyUI through the stable hostname or through a controlled internal service abstraction that resolves to the same managed route.
|
||
|
||
Do not bind Animatrix to:
|
||
|
||
- the GPU public IP
|
||
- direct `8188` public traffic
|
||
- hardcoded current private IP
|
||
|
||
### 12.2 Recommended Host Layout
|
||
|
||
Recommended public routing:
|
||
|
||
- `animatrix.desineuron.in` -> frontend and public product shell
|
||
- `api.animatrix.desineuron.in` or `animatrix.desineuron.in/api` -> backend API
|
||
- `comfy.desineuron.in` -> internal execution dependency only, not user-facing
|
||
|
||
If separate subdomains are not created immediately, the fallback deployment pattern may mirror the current Velocity site pattern:
|
||
|
||
- frontend served from Linux origin through ingress
|
||
- backend served from Linux origin through ingress
|
||
- backend calls ComfyUI through `https://comfy.desineuron.in`
|
||
|
||
## 13. Runtime Components
|
||
|
||
### 13.1 Frontend Application
|
||
|
||
Responsibilities:
|
||
|
||
- render simplified generation interface
|
||
- manage uploads
|
||
- validate user fields before submit
|
||
- create job requests
|
||
- poll or subscribe to job progress
|
||
- render previews and outputs
|
||
|
||
Suggested stack:
|
||
|
||
- Next.js or Vite React app
|
||
- Tailwind or CSS modules
|
||
- upload components with image/audio/video preview
|
||
|
||
### 13.2 Animatrix Backend API
|
||
|
||
Responsibilities:
|
||
|
||
- receive upload metadata
|
||
- store files
|
||
- generate canonical run record
|
||
- choose workflow template
|
||
- bind node inputs
|
||
- submit prompt payload to ComfyUI
|
||
- track prompt ID and history
|
||
- collect generated outputs
|
||
- persist result artifacts
|
||
|
||
Suggested stack:
|
||
|
||
- FastAPI if aligned with existing Python-heavy operations
|
||
- or Node/TypeScript only if the team wants one frontend-backend language
|
||
|
||
Recommendation:
|
||
|
||
- use Python FastAPI for v1 if reusing current Desineuron operational style and image/media tooling
|
||
|
||
### 13.3 Workflow Composer
|
||
|
||
Responsibilities:
|
||
|
||
- keep frozen template JSON files in version control
|
||
- inject prompt text, model selections, size, length, and asset paths
|
||
- enforce mode-specific constraints
|
||
|
||
This component must be deterministic. It is not a prompt improviser.
|
||
|
||
### 13.4 ComfyUI Execution Layer
|
||
|
||
Responsibilities:
|
||
|
||
- execute pre-approved workflow JSON
|
||
- expose queue, prompt, history, upload endpoints
|
||
- return output metadata
|
||
|
||
### 13.5 Asset Store
|
||
|
||
Responsibilities:
|
||
|
||
- raw upload persistence
|
||
- normalized derivative generation
|
||
- final output video persistence
|
||
- preview image generation
|
||
|
||
Recommended storage split:
|
||
|
||
- hot local cache on Linux origin
|
||
- durable object storage in S3 for long-term retention
|
||
|
||
## 13A. Current Infrastructure Contract
|
||
|
||
Animatrix v1 must be compatible with the currently operating Desineuron media stack as it exists today.
|
||
|
||
Live execution truth:
|
||
|
||
- public ComfyUI hostname: `https://comfy.desineuron.in`
|
||
- ingress elastic IP: `98.87.120.120`
|
||
- GPU private target currently managed behind ingress
|
||
- Linux origin currently: `192.168.1.2`
|
||
|
||
Current GPU-side storage truth:
|
||
|
||
- ComfyUI app root: `/opt/dlami/nvme/ComfyUI`
|
||
- HF cache: `/opt/dlami/nvme/hf`
|
||
- model staging root: `/opt/dlami/nvme/model-staging`
|
||
- model logs: `/opt/dlami/nvme/model-logs`
|
||
|
||
Current model hydration truth:
|
||
|
||
- durable bucket family already in use: `s3://project-velocity/models/`
|
||
- existing Wan hydration prefix: `s3://project-velocity/models/Wan2.2-Animate-14B/`
|
||
|
||
Animatrix must not introduce a second contradictory deployment path for ComfyUI. It must reuse this stable route and storage discipline.
|
||
|
||
## 13B. ComfyUI API Contract
|
||
|
||
The backend integration layer must be implemented against the current ComfyUI HTTP contract.
|
||
|
||
Required endpoints:
|
||
|
||
- `GET /`
|
||
- `POST /prompt`
|
||
- `GET /history/{prompt_id}`
|
||
- `GET /queue`
|
||
- `POST /upload/image`
|
||
|
||
Recommended extension checks:
|
||
|
||
- health probe against `/`
|
||
- prompt submission response validation
|
||
- history polling with bounded backoff
|
||
- queue introspection for operator dashboards
|
||
|
||
The backend must wrap these endpoints in a typed client and must not scatter raw HTTP calls throughout business logic.
|
||
|
||
## 13C. Model and Node Manifest
|
||
|
||
### Workflow A: Animate Studio Required Assets
|
||
|
||
Required model family:
|
||
|
||
- `Wan2.2-Animate-14B`
|
||
- `clip_vision_h.safetensors`
|
||
- `wan_2.1_vae.safetensors`
|
||
- `umt5_xxl_fp8_e4m3fn_scaled.safetensors`
|
||
|
||
Required custom nodes:
|
||
|
||
- `ComfyUI-KJNodes`
|
||
- `ComfyUI-comfyui_controlnet_aux`
|
||
|
||
Suggested placement contract:
|
||
|
||
- diffusion model files under `ComfyUI/models/diffusion_models/`
|
||
- text encoder under `ComfyUI/models/text_encoders/`
|
||
- VAE under `ComfyUI/models/vae/`
|
||
- CLIP Vision under `ComfyUI/models/clip_vision/`
|
||
|
||
### Workflow B: Audio Performance Studio Required Assets
|
||
|
||
Required model family:
|
||
|
||
- `wan2.2_s2v_14B_fp8_scaled.safetensors` or `wan2.2_s2v_14B_bf16.safetensors`
|
||
- `wav2vec2_large_english_fp16.safetensors`
|
||
- `wan_2.1_vae.safetensors`
|
||
- `umt5_xxl_fp8_e4m3fn_scaled.safetensors`
|
||
|
||
Suggested placement contract:
|
||
|
||
- diffusion model under `ComfyUI/models/diffusion_models/`
|
||
- text encoder under `ComfyUI/models/text_encoders/`
|
||
- audio encoder under `ComfyUI/models/audio_encoders/`
|
||
- VAE under `ComfyUI/models/vae/`
|
||
|
||
### Deferred Workflow Assets
|
||
|
||
For future strict start/end-frame control:
|
||
|
||
- `Wan2.2 Fun Inp` models and optional associated LoRAs
|
||
|
||
For future stronger pose control:
|
||
|
||
- `Wan2.2 Fun Control`
|
||
|
||
The frontend and API must be written so these workflows can be added later without reworking the entire product shell.
|
||
|
||
## 14. File and Repository Blueprint
|
||
|
||
Animatrix should be structured as an application repository or top-level product directory with explicit separation between app, API, and workflow assets.
|
||
|
||
Recommended layout:
|
||
|
||
```text
|
||
Animatrix/
|
||
docs/
|
||
Animatrix Monolithic SRS - Wan 2.2 Flow Studio.md
|
||
frontend/
|
||
src/
|
||
app/
|
||
components/
|
||
features/
|
||
lib/
|
||
styles/
|
||
backend/
|
||
app/
|
||
api/
|
||
services/
|
||
models/
|
||
repositories/
|
||
workers/
|
||
workflows/
|
||
animate/
|
||
wan22_animate_mix.json
|
||
wan22_animate_move.json
|
||
s2v/
|
||
wan22_s2v_base.json
|
||
shared/
|
||
prompt_profiles/
|
||
node_maps/
|
||
scripts/
|
||
deploy/
|
||
media/
|
||
sync/
|
||
infra/
|
||
systemd/
|
||
nginx/
|
||
caddy/
|
||
tests/
|
||
api/
|
||
workflows/
|
||
ui/
|
||
```
|
||
|
||
## 15. Workflow A Detailed Design: Animate Studio
|
||
|
||
### 15.1 Objective
|
||
|
||
Deliver a workflow that supports:
|
||
|
||
- character replacement from a source video
|
||
- character animation from a performer video
|
||
- prompt-guided visual refinement
|
||
|
||
### 15.2 Input Contract
|
||
|
||
Required:
|
||
|
||
- `prompt`
|
||
- `ground_truth_image`
|
||
- `motion_video`
|
||
- `mode`: `move` or `mix`
|
||
|
||
Optional:
|
||
|
||
- `reference_images[]`
|
||
- `pose_sheet_images[]`
|
||
- `negative_prompt`
|
||
- `duration_override_seconds`
|
||
- `aspect_ratio`
|
||
- `quality_profile`
|
||
- `seed`
|
||
|
||
### 15.3 Output Contract
|
||
|
||
Primary outputs:
|
||
|
||
- `video_mp4`
|
||
- `poster_frame_jpg`
|
||
- `job_manifest.json`
|
||
- `debug_metadata.json`
|
||
|
||
Secondary outputs:
|
||
|
||
- pose preview if preprocessing is enabled
|
||
- first-frame snapshot
|
||
|
||
### 15.4 Internal Workflow Stages
|
||
|
||
1. Ingest image and video
|
||
2. Normalize formats and dimensions
|
||
3. Extract first frame and thumbnail
|
||
4. Run optional DWPose or auxiliary preprocessing
|
||
5. Bind workflow JSON for `move` or `mix`
|
||
6. Upload normalized assets to ComfyUI
|
||
7. Submit workflow
|
||
8. Poll queue and history
|
||
9. Collect result paths
|
||
10. Persist final outputs and metadata
|
||
|
||
### 15.5 ComfyUI Notes
|
||
|
||
The official Animate workflow requires:
|
||
|
||
- `clip_vision_h.safetensors`
|
||
- `wan_2.1_vae.safetensors`
|
||
- `umt5_xxl_fp8_e4m3fn_scaled.safetensors`
|
||
- Animate diffusion model
|
||
- optional Lightning LoRA
|
||
- custom nodes:
|
||
- `ComfyUI-KJNodes`
|
||
- `ComfyUI-comfyui_controlnet_aux`
|
||
|
||
### 15.6 Product-Level Rule
|
||
|
||
Animatrix v1 must hide these internals from the standard UI, but the backend and operator docs must track them exactly.
|
||
|
||
## 16. Workflow B Detailed Design: Audio Performance Studio
|
||
|
||
### 16.1 Objective
|
||
|
||
Deliver a workflow that supports:
|
||
|
||
- talking-head and half-body performance
|
||
- singing and dialogue use cases
|
||
- audio-driven facial and motion synthesis
|
||
|
||
### 16.2 Input Contract
|
||
|
||
Required:
|
||
|
||
- `prompt`
|
||
- `ground_truth_image`
|
||
- `audio_file`
|
||
|
||
Optional:
|
||
|
||
- `reference_images[]`
|
||
- `pose_sheet_images[]`
|
||
- `negative_prompt`
|
||
- `framing_mode`: `portrait`, `half_body`, `full_body`
|
||
- `quality_profile`
|
||
- `seed`
|
||
|
||
### 16.3 Output Contract
|
||
|
||
Primary outputs:
|
||
|
||
- `video_mp4`
|
||
- `poster_frame_jpg`
|
||
- `job_manifest.json`
|
||
- `debug_metadata.json`
|
||
|
||
### 16.4 Internal Workflow Stages
|
||
|
||
1. Ingest image and audio
|
||
2. Normalize sample rate and file format
|
||
3. Infer required frame count from audio duration
|
||
4. Determine required S2V extension chunks
|
||
5. Bind workflow JSON
|
||
6. Upload image and audio to ComfyUI
|
||
7. Submit workflow
|
||
8. Poll queue and history
|
||
9. Collect output video
|
||
10. Persist artifacts
|
||
|
||
### 16.5 ComfyUI Notes
|
||
|
||
The official S2V workflow requires:
|
||
|
||
- `wan2.2_s2v_14B_fp8_scaled.safetensors` or bf16 variant
|
||
- `wav2vec2_large_english_fp16.safetensors`
|
||
- `wan_2.1_vae.safetensors`
|
||
- `umt5_xxl_fp8_e4m3fn_scaled.safetensors`
|
||
|
||
The ComfyUI docs note that:
|
||
|
||
- fp8 uses less VRAM
|
||
- bf16 may reduce quality degradation
|
||
- Lightning LoRA can reduce generation time but can also significantly reduce quality and dynamics
|
||
|
||
Therefore Animatrix must default to:
|
||
|
||
- `Standard`: fp8 without aggressive LoRA by default for customer-facing quality stability
|
||
- `Draft`: fp8 with acceleration options
|
||
- `High`: bf16 where hardware allows
|
||
|
||
## 17. UI-to-Workflow Mapping
|
||
|
||
The UI must map cleanly to backend request objects.
|
||
|
||
### 17.1 Shared Fields
|
||
|
||
- `mode`
|
||
- `prompt`
|
||
- `negative_prompt`
|
||
- `ground_truth_asset_id`
|
||
- `reference_asset_ids[]`
|
||
- `pose_sheet_asset_ids[]`
|
||
- `aspect_ratio`
|
||
- `quality_profile`
|
||
- `seed`
|
||
|
||
### 17.2 Animate-Specific Fields
|
||
|
||
- `motion_video_asset_id`
|
||
- `animate_submode`
|
||
- `background_preservation`
|
||
- `relighting`
|
||
- `extension_segments`
|
||
|
||
### 17.3 Audio-Specific Fields
|
||
|
||
- `audio_asset_id`
|
||
- `framing_mode`
|
||
- `audio_adherence_profile`
|
||
- `extension_segments`
|
||
|
||
## 18. Suggested Backend API
|
||
|
||
### 18.1 Asset Endpoints
|
||
|
||
- `POST /api/assets/image`
|
||
- `POST /api/assets/video`
|
||
- `POST /api/assets/audio`
|
||
- `GET /api/assets/{asset_id}`
|
||
|
||
### 18.2 Job Endpoints
|
||
|
||
- `POST /api/jobs/animate`
|
||
- `POST /api/jobs/audio-performance`
|
||
- `GET /api/jobs/{job_id}`
|
||
- `GET /api/jobs/{job_id}/events`
|
||
- `GET /api/jobs/{job_id}/outputs`
|
||
- `POST /api/jobs/{job_id}/cancel`
|
||
|
||
### 18.3 Admin Endpoints
|
||
|
||
- `GET /api/admin/workflows`
|
||
- `GET /api/admin/health`
|
||
- `GET /api/admin/queue`
|
||
- `POST /api/admin/retry/{job_id}`
|
||
|
||
### 18.4 Websocket or SSE Progress Channel
|
||
|
||
Recommended:
|
||
|
||
- `GET /api/jobs/{job_id}/stream`
|
||
|
||
This should emit:
|
||
|
||
- accepted
|
||
- uploaded
|
||
- queued
|
||
- executing
|
||
- collecting_outputs
|
||
- completed
|
||
- failed
|
||
|
||
The frontend should use this channel if available and fall back to polling if the connection drops.
|
||
|
||
## 19. Data Model
|
||
|
||
### 19.1 Asset
|
||
|
||
Fields:
|
||
|
||
- `asset_id`
|
||
- `asset_type`
|
||
- `mime_type`
|
||
- `original_filename`
|
||
- `storage_url`
|
||
- `thumbnail_url`
|
||
- `width`
|
||
- `height`
|
||
- `duration_seconds`
|
||
- `size_bytes`
|
||
- `created_at`
|
||
|
||
### 19.2 Job
|
||
|
||
Fields:
|
||
|
||
- `job_id`
|
||
- `mode`
|
||
- `workflow_template`
|
||
- `status`
|
||
- `submitted_by`
|
||
- `prompt`
|
||
- `negative_prompt`
|
||
- `settings_json`
|
||
- `comfy_prompt_id`
|
||
- `created_at`
|
||
- `updated_at`
|
||
|
||
### 19.3 JobOutput
|
||
|
||
Fields:
|
||
|
||
- `output_id`
|
||
- `job_id`
|
||
- `video_url`
|
||
- `poster_url`
|
||
- `manifest_url`
|
||
- `duration_seconds`
|
||
- `resolution`
|
||
- `fps`
|
||
- `created_at`
|
||
|
||
## 20. Workflow Template Governance
|
||
|
||
Workflow JSON must be treated as versioned product assets.
|
||
|
||
Rules:
|
||
|
||
- each production workflow JSON must have an immutable version identifier
|
||
- node IDs must be mapped in a dedicated config file
|
||
- backend parameter injection must never depend on informal manual node lookup
|
||
- each workflow change must pass snapshot regression checks
|
||
|
||
Required metadata for every workflow:
|
||
|
||
- `workflow_name`
|
||
- `workflow_version`
|
||
- `model_family`
|
||
- `required_assets`
|
||
- `required_models`
|
||
- `custom_nodes`
|
||
- `compatible_backend_version`
|
||
|
||
## 21. Storage and Delivery Design
|
||
|
||
### 21.1 Inputs
|
||
|
||
Store raw uploads in durable storage with stable references.
|
||
|
||
Recommended:
|
||
|
||
- object storage in S3
|
||
- local temporary cache for preprocessing
|
||
|
||
### 21.2 Outputs
|
||
|
||
Store:
|
||
|
||
- mp4 output
|
||
- poster image
|
||
- optional animated preview
|
||
- manifest json
|
||
|
||
### 21.3 Delivery
|
||
|
||
Outputs must be streamable from a public HTTPS origin via ingress.
|
||
|
||
If using Linux origin:
|
||
|
||
- serve final assets through nginx under Animatrix public domain
|
||
|
||
If using S3-backed storage:
|
||
|
||
- use signed or public-read delivery depending on account mode
|
||
|
||
## 22. Quality Profiles
|
||
|
||
Animatrix must expose productized quality profiles rather than raw step counts to users.
|
||
|
||
### 22.1 Draft
|
||
|
||
Purpose:
|
||
|
||
- internal ideation
|
||
- faster previews
|
||
|
||
Behavior:
|
||
|
||
- lower resolution
|
||
- lower steps
|
||
- acceleration LoRA allowed
|
||
|
||
### 22.2 Standard
|
||
|
||
Purpose:
|
||
|
||
- most normal production runs
|
||
|
||
Behavior:
|
||
|
||
- balanced speed and quality
|
||
- conservative defaults
|
||
- no quality-destructive shortcuts unless explicitly enabled
|
||
|
||
### 22.3 High
|
||
|
||
Purpose:
|
||
|
||
- demo and delivery quality
|
||
|
||
Behavior:
|
||
|
||
- higher quality model variant when available
|
||
- larger resolution
|
||
- longer runtime accepted
|
||
|
||
## 23. Error Handling
|
||
|
||
Failure classes:
|
||
|
||
- missing asset
|
||
- invalid asset format
|
||
- unsupported aspect ratio
|
||
- workflow binding failure
|
||
- ComfyUI upload failure
|
||
- ComfyUI queue failure
|
||
- generation timeout
|
||
- result collection failure
|
||
|
||
User-facing errors must be simplified.
|
||
|
||
Operator-facing logs must preserve exact failure cause.
|
||
|
||
## 23A. Validation Rules
|
||
|
||
### Shared Validation
|
||
|
||
- reject empty prompt if prompt is required by the selected workflow profile
|
||
- reject missing ground-truth image
|
||
- reject unsupported file extensions
|
||
- reject files above configured upload limit
|
||
|
||
### Animate Studio Validation
|
||
|
||
- reject missing motion video
|
||
- reject unsupported source video codecs that cannot be normalized
|
||
- reject conflicting `move` and `mix` settings
|
||
|
||
### Audio Performance Studio Validation
|
||
|
||
- reject missing audio
|
||
- reject audio longer than configured maximum duration for the selected profile
|
||
- normalize sample rate before workflow submission
|
||
|
||
### Pose Sheet Validation
|
||
|
||
- accept only supported image formats
|
||
- cap pose sheet image count in v1
|
||
- mark pose sheet as "soft guidance" in job metadata unless a later hard-control pipeline is introduced
|
||
|
||
## 24. Observability
|
||
|
||
Minimum operational telemetry:
|
||
|
||
- job creation rate
|
||
- queue depth
|
||
- mean wait time
|
||
- mean generation time by workflow
|
||
- failure rate by workflow version
|
||
- storage growth
|
||
- top asset sizes
|
||
|
||
Required correlation identifiers:
|
||
|
||
- `job_id`
|
||
- `asset_id`
|
||
- `comfy_prompt_id`
|
||
|
||
## 25. Security and Access Control
|
||
|
||
Rules:
|
||
|
||
- do not expose raw ComfyUI publicly to end users as the product surface
|
||
- backend owns ComfyUI credentials and workflow orchestration
|
||
- validate file size and MIME type on upload
|
||
- strip executable uploads
|
||
- limit accepted formats
|
||
- preserve audit trail for every run
|
||
|
||
## 26. Team and Operator UX
|
||
|
||
The system must support:
|
||
|
||
- internal team usage through the stable ingress
|
||
- supportable operator triage
|
||
- easy workflow version rollback
|
||
- safe demo usage during sales calls
|
||
|
||
Operators need:
|
||
|
||
- admin queue view
|
||
- job replay
|
||
- access to input and output manifests
|
||
- workflow version annotation
|
||
|
||
## 27. Non-Functional Requirements
|
||
|
||
### 27.1 Reliability
|
||
|
||
- no direct dependency on ephemeral GPU public IP
|
||
- graceful retry around ComfyUI upload and history polling
|
||
- job state persisted outside memory
|
||
|
||
### 27.2 Performance
|
||
|
||
- fast upload validation
|
||
- async polling and result collection
|
||
- cached thumbnails
|
||
|
||
### 27.3 Scalability
|
||
|
||
- workflow templates stateless
|
||
- API horizontally scalable
|
||
- storage externalized
|
||
|
||
### 27.4 Maintainability
|
||
|
||
- one source-of-truth workflow config per mode
|
||
- explicit model manifest
|
||
- no hidden hand-edited production JSON
|
||
|
||
### 27.5 Sales Readiness
|
||
|
||
- stable hostname
|
||
- reliable queue messaging
|
||
- polished success and failure states
|
||
- deterministic demo inputs
|
||
|
||
## 27A. Demo and Commercial Readiness Requirements
|
||
|
||
Animatrix will be used in live demos and pre-sales conversations. That changes the bar.
|
||
|
||
Required product behavior:
|
||
|
||
- first meaningful UI paint fast enough for live sales use
|
||
- one-click sample project loading for demo mode
|
||
- clear progress messaging during long generations
|
||
- shareable output URL or operator download path
|
||
- no raw ComfyUI terminology in the customer-facing layer unless explicitly in admin mode
|
||
|
||
Required operator support behavior:
|
||
|
||
- known-good demo assets packaged and versioned
|
||
- visible warning when GPU queue is saturated
|
||
- ability to retry a failed job without recreating all metadata manually
|
||
|
||
## 28. MVP Acceptance Criteria
|
||
|
||
Animatrix v1 is only considered complete when all of the following are true:
|
||
|
||
1. A user can upload a ground-truth image, type a prompt, attach a motion video, select `Move` or `Mix`, and receive a finished video output.
|
||
2. A user can upload a ground-truth image, type a prompt, attach audio, and receive an audio-driven character video.
|
||
3. Both flows work through the stable Desineuron ingress model and do not depend on hardcoded GPU IPs.
|
||
4. Every run produces a persisted job record and output manifest.
|
||
5. Generated videos are streamable over HTTPS.
|
||
6. Operators can inspect job state and correlate product job ID to ComfyUI prompt ID.
|
||
7. The UI remains simple enough for a non-technical demo operator.
|
||
|
||
## 29. Explicit Product Decisions
|
||
|
||
### 29.1 What v1 Must Say No To
|
||
|
||
Animatrix v1 must not claim:
|
||
|
||
- perfect deterministic pose-sheet control
|
||
- exact first and last frame locking
|
||
- full timeline editing
|
||
- full audio mastering
|
||
|
||
### 29.2 What v1 Must Say Yes To
|
||
|
||
Animatrix v1 can truthfully claim:
|
||
|
||
- guided character animation
|
||
- guided character replacement
|
||
- audio-driven talking or performance video
|
||
- reference-assisted generation
|
||
- production-safe simplified UI on top of ComfyUI
|
||
|
||
## 30. Recommended Delivery Phases
|
||
|
||
### Phase 1
|
||
|
||
- backend skeleton
|
||
- asset model
|
||
- one frozen Animate workflow
|
||
- one frozen S2V workflow
|
||
- barebones frontend
|
||
|
||
### Phase 2
|
||
|
||
- quality profiles
|
||
- operator dashboard
|
||
- output gallery
|
||
- S3 persistence
|
||
|
||
### Phase 3
|
||
|
||
- first/last-frame workflow
|
||
- stronger pose control
|
||
- reusable character libraries
|
||
|
||
## 31. Final Architecture Recommendation
|
||
|
||
Build Animatrix as a thin product layer over stable infrastructure that already exists:
|
||
|
||
- keep ComfyUI where it is
|
||
- keep ingress where it is
|
||
- add a dedicated Animatrix backend
|
||
- keep the frontend intentionally minimal
|
||
- treat workflow JSON as versioned software artifacts
|
||
|
||
Do not begin by building a large generic creative suite.
|
||
|
||
Build the narrowest saleable product first:
|
||
|
||
- `Animate Studio`
|
||
- `Audio Performance Studio`
|
||
|
||
Then expand to:
|
||
|
||
- `Start/End Frame Studio`
|
||
- `Pose Control Studio`
|
||
|
||
## 32. Bottom Line
|
||
|
||
Animatrix v1 should be a Flow-like creative surface backed by two real Wan 2.2 workflows, not one imaginary super-workflow.
|
||
|
||
The correct implementation target is:
|
||
|
||
- one frontend
|
||
- one orchestration backend
|
||
- two workflow families
|
||
- one stable ingress-compatible execution path
|
||
- one durable output system
|
||
|
||
If the team follows this document strictly, the result will be productizable, supportable, and compatible with the current Desineuron infrastructure without lying about model capabilities.
|