Project_Astral/.agent Context/Project Brief.txt

Project Brief: Astral AI Creative Suite
1. Project Overview
Objective: To build an internal web application that automates the production of high-fidelity commercial video shorts. The system converts raw actor data (iPhone Pro sensor/LiDAR) and static product assets into dynamic, branded lifestyle videos using a headless ComfyUI backend.
2. Technical Stack
Frontend: React.js / Next.js (Framework), Tailwind CSS (Styling & Animation), Framer Motion (UI transitions).
Backend: ComfyUI API (Headless Image/Video Generation), FastAPI (Middleware/Orchestration).
Models: LTX-2 (Video Generation), SDXL or Flux (Image Generation), ControlNet (Pose/Depth via LiDAR).
Storage: Local Synology/TrueNAS via SMB/NFS mount.
3. Data Acquisition & Input Pipeline
A. The "Actor Digital Twin" (iPhone Input)
Input Type: Multi-angle 48MP HEIF images + .OBJ or .USDZ files from the iPhone Pro LiDAR sensor.
Processing:
The frontend extracts EXIF/Sensor data to determine focal length.
LiDAR depth maps are converted into grayscale depth buffers to serve as ControlNet inputs, ensuring the 3D space of the actor is respected.
Dataset: Images are automatically tagged and sent to a private vector database for LoRA-style "In-Context" retrieval.
B. Product Integration
Assets: Transparent PNGs of products (watches, clothing, cars).
Logic: The system utilizes IP-Adapter (Instant Prototyping) to "inject" the product's visual identity into the latent space without losing fine detail (e.g., watch dial text or car reflections).
4. The "Hidden" ComfyUI Workflow Logic
The backend executes a multi-stage pipeline triggered by a single API call:
Stage 1: Multi-Modal Fusion (Image Gen)
Combines the Actor LoRA (Identity), LiDAR Depth Map (Pose/Space), and Product IP-Adapter (Object).
Result: A high-resolution static frame of the actor interacting with the product.
Stage 2: Temporal Expansion (LTX-2 Video Gen)
The Stage 1 image is used as the Initial Frame for the LTX-2 DiT model.
In-Context LoRAs are applied to maintain temporal consistency of the product (preventing "hallucinating" different car wheels or watch faces).
Stage 3: Refinement & Delivery
FaceDetailer nodes upscale facial features.
The video is encoded to H.265 and saved directly to the NAS path.
5. Functional Requirements (Frontend)
Drag-and-Drop Interface: Separate zones for "Actor Set" (multiple files) and "Product Asset."
System Prompt Orchestrator: A simple text box where the user enters "Professional car commercial in rain." The frontend then wraps this in a "Hidden System Prompt" (e.g., 8k, cinematic lighting, shot on Arri Alexa, highly detailed).
Real-time Progress: A Tailwind-styled progress bar driven by WebSockets connecting to the ComfyUI queue status.
Local Gallery: A view linked to the NAS to preview previous generations.
6. Infrastructure & Deployment
Local Server: Minimum 2x NVIDIA RTX 4090 (24GB VRAM each). One for the Image Gen stage, one for LTX-2 inference.
Networking: 10GbE connection between the Workstation and the NAS to handle high-bitrate video transfers without latency.
Security: Air-gapped or VPN-restricted; since the dataset is private, no data leaves the local network.
7. Project Phases
Phase 1 (Setup): Configure ComfyUI with LTX-2 nodes and verify API connectivity.
Phase 2 (Workflow): Build the "Hidden" .json workflow that accepts LiDAR depth and product images.
Phase 3 (App Dev): Develop the Next.js frontend and integrate the /prompt API endpoint.
Phase 4 (Storage): Configure automated file-moving scripts to the NAS.
Phase 5 (Testing): Benchmarking generation speed (Target: < 2 mins per 5-second clip).