forked from sagnik/Project_Astral
44 lines
3.6 KiB
Plaintext
44 lines
3.6 KiB
Plaintext
Project Brief: Astral AI Creative Suite
|
|
1. Project Overview
|
|
Objective: To build an internal web application that automates the production of high-fidelity commercial video shorts. The system converts raw actor data (iPhone Pro sensor/LiDAR) and static product assets into dynamic, branded lifestyle videos using a headless ComfyUI backend.
|
|
2. Technical Stack
|
|
Frontend: React.js / Next.js (Framework), Tailwind CSS (Styling & Animation), Framer Motion (UI transitions).
|
|
Backend: ComfyUI API (Headless Image/Video Generation), FastAPI (Middleware/Orchestration).
|
|
Models: LTX-2 (Video Generation), SDXL or Flux (Image Generation), ControlNet (Pose/Depth via LiDAR).
|
|
Storage: Local Synology/TrueNAS via SMB/NFS mount.
|
|
3. Data Acquisition & Input Pipeline
|
|
A. The "Actor Digital Twin" (iPhone Input)
|
|
Input Type: Multi-angle 48MP HEIF images + .OBJ or .USDZ files from the iPhone Pro LiDAR sensor.
|
|
Processing:
|
|
The frontend extracts EXIF/Sensor data to determine focal length.
|
|
LiDAR depth maps are converted into grayscale depth buffers to serve as ControlNet inputs, ensuring the 3D space of the actor is respected.
|
|
Dataset: Images are automatically tagged and sent to a private vector database for LoRA-style "In-Context" retrieval.
|
|
B. Product Integration
|
|
Assets: Transparent PNGs of products (watches, clothing, cars).
|
|
Logic: The system utilizes IP-Adapter (Instant Prototyping) to "inject" the product's visual identity into the latent space without losing fine detail (e.g., watch dial text or car reflections).
|
|
4. The "Hidden" ComfyUI Workflow Logic
|
|
The backend executes a multi-stage pipeline triggered by a single API call:
|
|
Stage 1: Multi-Modal Fusion (Image Gen)
|
|
Combines the Actor LoRA (Identity), LiDAR Depth Map (Pose/Space), and Product IP-Adapter (Object).
|
|
Result: A high-resolution static frame of the actor interacting with the product.
|
|
Stage 2: Temporal Expansion (LTX-2 Video Gen)
|
|
The Stage 1 image is used as the Initial Frame for the LTX-2 DiT model.
|
|
In-Context LoRAs are applied to maintain temporal consistency of the product (preventing "hallucinating" different car wheels or watch faces).
|
|
Stage 3: Refinement & Delivery
|
|
FaceDetailer nodes upscale facial features.
|
|
The video is encoded to H.265 and saved directly to the NAS path.
|
|
5. Functional Requirements (Frontend)
|
|
Drag-and-Drop Interface: Separate zones for "Actor Set" (multiple files) and "Product Asset."
|
|
System Prompt Orchestrator: A simple text box where the user enters "Professional car commercial in rain." The frontend then wraps this in a "Hidden System Prompt" (e.g., 8k, cinematic lighting, shot on Arri Alexa, highly detailed).
|
|
Real-time Progress: A Tailwind-styled progress bar driven by WebSockets connecting to the ComfyUI queue status.
|
|
Local Gallery: A view linked to the NAS to preview previous generations.
|
|
6. Infrastructure & Deployment
|
|
Local Server: Minimum 2x NVIDIA RTX 4090 (24GB VRAM each). One for the Image Gen stage, one for LTX-2 inference.
|
|
Networking: 10GbE connection between the Workstation and the NAS to handle high-bitrate video transfers without latency.
|
|
Security: Air-gapped or VPN-restricted; since the dataset is private, no data leaves the local network.
|
|
7. Project Phases
|
|
Phase 1 (Setup): Configure ComfyUI with LTX-2 nodes and verify API connectivity.
|
|
Phase 2 (Workflow): Build the "Hidden" .json workflow that accepts LiDAR depth and product images.
|
|
Phase 3 (App Dev): Develop the Next.js frontend and integrate the /prompt API endpoint.
|
|
Phase 4 (Storage): Configure automated file-moving scripts to the NAS.
|
|
Phase 5 (Testing): Benchmarking generation speed (Target: < 2 mins per 5-second clip). |