Generative Media Pipeline Patterns

Reusable multi-model chains: draft to finish, generate to edit, image to video, character-consistency rigs, 3D and audio-driven pipelines. The creative twin of the agent-patterns dataset, and a template library for the studio, each pattern is an ordered chain of model/node steps. Browse it below, or take the raw files.

Snapshot

2026-07-04

Coverage

66pipelines·9families·191chain steps

Download

About this dataset

Unit

One row = one reusable multi-model pipeline pattern, with its archetype family, input-to-output modality, why it needs more than one model, the ordered chain of model/node steps, the control levers, an example stack, use cases, the common pitfall, and a reference.

Steps as nodes

Each step carries a modelType drawn from the workflow node vocabulary (text-to-image, image-to-video, upscale, reference-adapter, image-to-3d, and so on), so a pattern maps directly onto workflow nodes and can seed a template.

Scope

Only genuine multi-model chains (two or more distinct generative models). A single unified model, a prompt trick, or a one-off workflow is out of scope.

Sources

arXiv papers, model documentation, and published ComfyUI / fal / Replicate workflows. References travel with the dataset.

License

Free to use, distribute, and reproduce with attribution to Hermosa Labs LLC (CC BY 4.0).

Cite

Hermosa Labs LLC, “Generative Media Pipeline Patterns,” 2026-07-04.

The catalogue

Draft to finish2

A rough pass, then a high-fidelity finishing pass.

Base + Refiner Draft-to-Finish

text -> image

Draft
SDXL baseFLUX.1 schnell
text-to-image
fast low-step base for composition and layout
Refine
SDXL refinerFLUX.1 dev
image-to-image
high-fidelity refiner pass for detail and artifact cleanup
Upscale
Real-ESRGAN 4x
upscale
optional 2-4x detail-preserving upscaler

A fast base model lays down composition and structure, then a high-fidelity refiner pass adds detail and fixes artifacts, optionally followed by an upscaler.

Stack, controls & more

Example stack: SDXL base -> SDXL refiner -> 4x ESRGAN upscaler (or FLUX.1 schnell draft -> FLUX.1 dev refine).
Controls: Shared seed across passes; denoise strength on the refine step; base/refiner step-split ratio; optional control net on the draft.
Why multi-model: One model rarely optimizes both global composition and fine detail at once; splitting draft from refine lets a cheap fast pass explore many options and an expensive pass finish only the keepers.
Use cases: Hero art and key visuals, Batch concepting then finishing keepers, Print-resolution stills
Pitfalls: Over-denoising the refine step drifts away from the draft composition; a mismatched seed breaks continuity between passes.

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (arXiv:2307.01952)

ComfyTextures Unreal create-refine-upscale-edit texture chain

render capture -> image -> image -> image -> image

draft
SDXL Base 1.0LCM-LoRAControlNet depth/canny
text-to-image
Create mode: fast SDXL + LCM-LoRA low-res texture per selected actor, ControlNet-guided from viewport depth/canny
refine
SDXL Base 1.0SDXL Refiner 1.0
image-to-image
Refine mode: slower SDXL base+refiner pass improves the low-res Create output
edit
SDXL inpaint
inpaint
Edit mode: inpainting workflow with From-Texture or From-Object targeting for local corrections
finish
4x-UltraSharp
upscale
final resolution enhancement pass

Unreal Engine editor plugin ComfyTextures captures viewport render passes of selected actors and runs them through a ComfyUI-orchestrated chain: a fast SDXL+LCM-LoRA pass creates a low-res draft texture, a slower SDXL base+refiner pass refines it, an inpainting workflow allows targeted edits, and a 4x-UltraSharp upscale finishes the result, which is applied back as a new material/texture on the actor.

Stack, controls & more

Example stack: UE viewport capture -> depth/canny ControlNet -> SDXL+LCM Create -> SDXL Base+Refiner Refine -> SDXL inpaint Edit -> 4x-UltraSharp Upscale -> applied as UE material
Controls: Actor/viewpoint selection in the UE editor, single or multi point-of-view projection, ControlNet weight (canny/depth), edit target (texture region vs whole object), workflow JSON is user-editable in ComfyUI
Why multi-model: Distinct from baseline base-refiner-draft-to-finish (generic T2I draft/refine/upscale) and from StableGen because it operates purely on existing Unreal actor geometry via viewport-camera depth/canny ControlNet projection (no mesh-generation stage), and chains four separately-named modes (Create/Refine/Edit/Upscale) each backed by a distinct model or LoRA variant, wired directly into the UE material graph.
Use cases: rapid environment art texturing directly inside the UE editor, iterative look-dev without leaving Unreal, targeted texture fixes on shipped game assets
Pitfalls: multi-viewpoint projection is explicitly marked work-in-progress in the plugin, so single-viewpoint consistency is the reliable path; needs 16GB+ VRAM recommended for SDXL workflows

ComfyTextures GitHub README

Generate to edit8

Generate, then a targeted edit, inpaint, or variation.

Bria native GenAI node chain in Nuke

text -> image -> image -> image

draft
Bria Fibo
text-to-image
Fibo Generate node: text-to-image with aspect ratio/guidance/seed control
edit/inpaint
Bria GenFill
inpaint
GenFill node: prompt-driven inpaint fed the prior node's output at full res
finish
Bria UpscaleBria Enhance
upscale
Upscale/Enhance node: dedicated super-resolution model, 2x/4x or target-megapixel enhance

Bria ships 11 GenAI nodes (Fibo Generate, Fibo Edit, RMBG, Erase, GenFill, Expand, Enhance, Upscale, Sequence Output) as native Nuke Group nodes that chain directly in the compositing node graph: text-to-image draft, then prompt-based inpaint/edit, then dedicated enhance/upscale, all at full resolution without leaving Nuke.

Stack, controls & more

Example stack: Fibo Generate -> GenFill -> Upscale, all native Nuke Group nodes in one .nk script
Controls: Node-graph wiring only (standard Nuke Group nodes); prompt text, mask input on GenFill/Erase, aspect ratio and guidance on Fibo Generate; Sequence Output node batches across frame ranges
Why multi-model: Fibo Generate (text-to-image foundation model), GenFill (inpainting/ControlNet-based edit model), and Upscale/Enhance (dedicated super-resolution and Keymix quality models) are distinct Bria model families chained node-to-node; this is not one model with multiple presets.
Use cases: VFX plate generation without leaving the compositor, in-comp set extension/inpaint, batch frame-range AI touch-ups
Pitfalls: Multi-viewpoint/3D-aware consistency is not addressed (2D compositing only); each node runs full-resolution inference per frame which is costly for long sequences via Sequence Output

Bria: Bringing Generative AI into Nuke Natively

Generate-then-Instruct-Edit Chain

text -> image -> image

Generate base image
FLUX.1-devSDXL
text-to-image
Base T2I model (FLUX.1-dev, SDXL) generates the initial image from a text prompt at full denoise
Instruct edit / inpaint
FLUX.1-Fill-devSDXL-Inpaint
inpaint
Dedicated inpaint model (FLUX.1-Fill, SDXL-Inpaint) receives the base image, a binary mask, and a text instruction; regenerates the masked region while preserving context

A text-to-image model generates a base image from a prompt, then a separate instruction-following inpaint model takes that image plus a mask and a text instruction and applies targeted changes to specific regions while leaving the rest intact.

Stack, controls & more

Example stack: FLUX.1-dev (generate) -> FLUX.1-Fill-dev (inpaint masked region); or SDXL base -> SDXL-Inpaint; or Stable Diffusion 1.5 -> SD-Inpaint.
Controls: Mask precision (tight vs loose); inpaint denoise strength; prompt describing only the desired change; blending mode at mask boundary; VAE tiling for memory management.
Why multi-model: The generation and editing objectives require different conditioning: the base model maximizes prompt fidelity from noise; the edit model must preserve unchanged regions, understand natural-language deltas, and apply localized changes. Using the same model and prompt for both leads to either image drift (high denoise) or insufficient change (low denoise). FLUX.1-dev and FLUX.1-Fill are distinct model weights with different training objectives (velocity prediction on pure noise vs. masked inpainting), making this a genuine two-model chain. This is distinct from base-refiner which uses low-denoise image-to-image on the whole image for detail finishing, not semantic region editing.
Use cases: Generating a hero image then swapping background, Creating a base composition then iteratively adjusting individual objects, Product photography with post-generation surface or label edits
Pitfalls: Mask bleeding (inpainted region texture mismatches surroundings); inconsistent lighting between generated base and edited region; over-large masks cause the edit model to ignore the surrounding context.

ComfyUI Official Docs: Flux.1 Fill dev Example

IDM-VTON Virtual Try-On

person image + garment image -> image

Parse & warp
IDM-VTON
image-to-image
segment the person and warp the garment to body pose
Try-on inpaint
CatVTONOOTDiffusion
inpaint
diffusion composites the warped garment realistically
Upscale
Real-ESRGAN 2x
upscale
optional detail pass

Realistically transfer a garment from a product photo onto a person image, warping the cloth to body geometry and inpainting it in with a diffusion try-on model.

Stack, controls & more

Example stack: IDM-VTON (or CatVTON/OOTDiffusion) -> 2x upscale.
Controls: Person and garment inputs; garment category; agnostic masks (pose, parse); try-on guidance scale.
Why multi-model: Try-on couples geometric garment warping with a diffusion inpaint conditioned on the warped garment; neither a pure generator nor a plain inpainter handles the two together.
Use cases: E-commerce product previews, Lookbook generation, Catalog at scale
Pitfalls: Extreme poses and occlusions warp the garment incorrectly; hair and hands crossing the garment region cause artifacts.

Improving Diffusion Models for Authentic Virtual Try-on in the Wild, IDM-VTON (arXiv:2403.05139)

Inpaint Anything: Segment-Remove-Replace

image + click -> mask -> removed or replaced region -> composited output

Click-to-mask
Segment Anything (SAM)
image-to-image
Segment the clicked object into a precise mask
Remove
LaMa
inpaint
Fill the masked hole with plausible background content, no text needed
Replace (alt path)
Stable Diffusion Inpainting
inpaint
Text-guided fill of the same mask with new object/background content

A user clicks an object, Segment Anything produces its mask, and depending on intent the hole is either filled by a fast non-diffusion inpainter (object removal) or filled by a text-guided diffusion model (object/background replacement).

Stack, controls & more

Example stack: SAM -> LaMa (remove) or Stable Diffusion Inpainting (replace)
Controls: click/point prompt, mask dilation, text prompt (replace path only), inpainting strength, seed
Why multi-model: Segmentation and inpainting are different problem classes: SAM has no generative capability, LaMa cannot follow text prompts for novel content, and Stable Diffusion alone can't reliably target an arbitrary user-clicked region without an upstream mask.
Use cases: one-click object removal from photos, background replacement via text prompt, e-commerce photo cleanup
Pitfalls: LaMa struggles with large masks needing coherent new structure (shadows, reflections); Stable Diffusion replace path can hallucinate seams without careful mask feathering and colour matching

Inpaint Anything: Segment Anything Meets Image Inpainting, arXiv:2304.06790

InstructPix2Pix Instruction Edit

image + instruction -> image

Generate (optional)
SDXLFLUX
text-to-image
produce a source image, or start from a user upload
Instruct-edit
InstructPix2PixFLUX Kontext
image-to-image
apply the natural-language edit

Edit an existing image from a natural-language instruction ("make it sunset", "add sunglasses") without a mask, by conditioning the model jointly on the source image and the command.

Stack, controls & more

Example stack: InstructPix2Pix (SD) or FLUX edit -> upscale.
Controls: Image-conditioning strength (how much of the original to keep); text-conditioning strength; instruction phrasing.
Why multi-model: Mask-free, instruction-driven editing needs a model trained to follow commands over an image condition, distinct from both pure generation and masked inpainting.
Use cases: Iterative art direction, No-mask quick edits, Batch style tweaks
Pitfalls: Over-strong image conditioning ignores the instruction; weak conditioning hallucinates changes the user did not ask for.

InstructPix2Pix: Learning to Follow Image Editing Instructions (arXiv:2211.09800)

Product Photo Scene Generation & Relight

image -> mask -> generated background -> relit composite

Cutout
SAMBiRefNet
image-to-image
Extract product from its original background
Scene generation
Stable Diffusion XLFLUX.1
text-to-image
Generate a new lifestyle/studio background from a text prompt
Relight
IC-LightIC-Light v2
image-to-image
Relight the composited product so lighting direction/colour matches the generated background

A product cutout is segmented from its original photo, dropped into a newly generated lifestyle background, then IC-Light relights the product to match the new scene's lighting so the composite reads as a single photograph.

Stack, controls & more

Example stack: BiRefNet -> SDXL (background) -> IC-Light v2 (relight)
Controls: segmentation mask, background text prompt, IC-Light lighting-direction/text conditioning, denoise strength to preserve product detail, seed
Why multi-model: Segmentation, background generation, and physically-plausible relighting are three distinct capabilities: a single text-to-image model can generate a scene but cannot preserve the exact product geometry/texture, and it cannot relight an already-composited foreground to match a background it didn't originate.
Use cases: e-commerce product photography at scale, advertising creative variants, catalogue localisation with different scene moods
Pitfalls: IC-Light can drift product colour/texture at high denoise; frequency separation or detail-preserving masks are needed to keep label text and material sharpness intact

Product Photo Relight ComfyUI workflows (risunobushi, OpenArt)

Product-into-Scene Composite and Relight

product cutout image + background scene image -> lit, composited ad-ready image

Product isolation
RMBG-2.0
image-to-image
Remove background from the raw product photo to get a clean cutout/mask
Composite and relight
ByteDance Seedream 4.5
image-to-image
Blend the product into the chosen background scene and adjust product lighting/shadows to match the new environment
Upscale finish
Topaz Gigapixel
upscale
Upscale composited result for print/4K ad delivery

A studio product photo and a separate background/scene image are composited by an image-editing model that also relights the product to match the new scene's light direction and color temperature, producing an advertising-ready packshot without a physical reshoot.

Stack, controls & more

Example stack: RMBG-2.0 -> ByteDance Seedream 4.5 (composite + relight) -> Topaz Gigapixel upscale
Controls: background/scene reference image, lighting preset (golden hour, studio, dramatic directional), product placement/scale, upscale factor
Why multi-model: Compositing two independent images and physically-plausible relighting are distinct operations; the workflow depends on prior product segmentation/cutout plus a generative compositing-and-relighting model, and often a separate upscale pass for print/4K delivery, so no single call handles isolation, blend, and light-matching in one pass reliably.
Use cases: e-commerce catalog imagery at scale, seasonal ad creative reusing one product shoot across many scenes, swapping backgrounds for regional/localized campaigns
Pitfalls: Relighting can shift product color fidelity (brand color accuracy), and shadow direction mismatches between product and scene are a common tell if the compositing step skips explicit light-direction analysis.

Composite your Product + Scene and Relight (official ComfyUI workflow template)

SAM 2 to MatAnyone Rotoscoping Composite

video + point/text prompt -> propagated mask -> alpha matte -> composite

Prompted segmentation
SAM 2Grounding DINO
image-to-image
Segment the subject in the first frame from a click or text prompt
Matte propagation
MatAnyone 2
video-to-video
Propagate and refine the mask into a full alpha matte across all frames
Composite
MatAnyone 2 (alpha output)
image-to-image
Composite the matted foreground onto a new or generated background

A subject is segmented in one frame via a promptable segmentation model, the mask is propagated across the clip, then a dedicated video matting model refines the rough mask into a per-pixel alpha matte for clean compositing onto a new background.

Stack, controls & more

Example stack: SAM 2 (first-frame mask) -> MatAnyone 2 (alpha propagation) -> compositing tool (background swap)
Controls: click/box/text prompt for initial mask, mask correction clicks, matte quality evaluator threshold, background choice
Why multi-model: Promptable segmentation models produce binary, sometimes coarse masks that are not designed for fine alpha detail (hair, motion blur, semi-transparency); a separate matting model specialized in alpha propagation is needed to get broadcast-quality composites.
Use cases: greenscreen-free background replacement for UGC video, VFX rotoscoping for indie production, compositing AI-generated characters onto filmed plates
Pitfalls: SAM 2 wins on speed but loses edge fidelity on hair/motion blur versus a manual rotoscoper; the recommended fix is pairing with a dedicated matting model rather than shipping the raw SAM 2 mask

MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator (arXiv:2512.11782, CVPR 2026)

Style & control6

Condition structure or aesthetic with references and control nets.

ControlNet Structure-Lock Chain

image + text -> image

Extract control map
MiDaSOpenPoseCanny
depth
Annotator model (canny, MiDaS depth, OpenPose, HED, M-LSD, ADE20K segment) converts reference image to a structured feature map
Conditioned generation
ControlNet-depthSDXLFLUX.1 ControlNet
controlnet
ControlNet adapter injects the control signal into the frozen diffusion backbone via skip-connections; text prompt guides content

A preprocessor (canny, depth, or pose annotator) extracts a structural control map from a reference image; that map is then injected into a conditioned diffusion model to generate a new image that preserves the source structure.

Stack, controls & more

Example stack: MiDaS depth -> ControlNet-v1.1-depth + SDXL; or OpenPose -> ControlNet-OpenPose + SD 1.5; or Canny -> FLUX.1 ControlNet (Jasper canny).
Controls: Annotator type (canny vs depth vs pose); ControlNet conditioning scale; text prompt; optional second ControlNet for combined conditions (e.g. depth + canny simultaneously).
Why multi-model: Two genuinely distinct model types run in sequence: an annotator/extractor model (e.g. HED edge detector, MiDaS depth, OpenPose keypoint) produces a dense control signal, and a separate ControlNet-augmented diffusion model consumes that signal during generation. Neither step can substitute for the other.
Use cases: Pose-controlled character generation, Preserving scene layout across style variations, Sketch / lineart to rendered image
Pitfalls: Over-conditioning (scale too high) makes the output look traced with no creative latitude; mismatched annotator and ControlNet model type produces incoherent outputs; depth annotator struggles with textureless or reflective surfaces.

Adding Conditional Control to Text-to-Image Diffusion Models (arXiv:2302.05543)

Depth/Pose-Guided Video-to-Video Restyle

video -> control signal (depth/canny/pose) -> video

Extract control signal
Lotus DepthOpenPose
controlnet
preprocessor (Lotus Depth, Canny, or OpenPose) extracts depth map or skeleton from the source video
Generate restyled video
LTX-2 ControlNetWan 2.1 Fun
video-to-video
video diffusion model (LTX-2 ControlNet, Wan 2.1 Fun) generates the output video conditioned on the control signal and a style/content prompt

A source video is passed through a preprocessor that extracts a structural control signal (depth map, Canny edges, or OpenPose skeleton). That signal is fed as conditioning to a video diffusion model which generates new video that preserves the original motion and geometry while applying a new appearance, style, or character. The result is a restyled video with motion inherited from the reference footage.

Stack, controls & more

Example stack: Lotus Depth preprocessor -> LTX-2 ControlNet (guided video generation). OR OpenPose extractor -> Wan 2.1 Fun ControlNet.
Controls: Control signal type (depth vs. canny vs. pose); control strength; style and content prompts; sampler schedule (LTXVScheduler); frame dimensions.
Why multi-model: Video diffusion models cannot natively decompose appearance from structure; a separate ControlNet preprocessor model is required to extract the structural signal, which then constrains the generator. Neither the preprocessor nor the generator can do the other's job alone.
Use cases: Motion retargeting (human to character), Scene relighting and restyle, Product placement in existing footage, Previz from rough reference video
Pitfalls: Strong depth control preserves layout but can suppress texture creativity; OpenPose fails on non-standard human shapes; motion blur in source degrades extracted control quality.

RunComfy: LTX-2 ControlNet in ComfyUI - Depth-Controlled Video Workflow

InstantID Zero-Shot Identity-Style

face image + style text -> stylized portrait

Encode identity
InstantID
reference-adapter
face encoder plus IdentityNet from one photo
Generate
SDXL
text-to-image
identity-conditioned stylized generation
Upscale
Real-ESRGAN 2x
upscale
optional

Zero-shot identity-preserving generation: with only one reference face and a style prompt, generate that person in any style or scene, combining a face encoder, an IdentityNet, and a lightweight adapter.

Stack, controls & more

Example stack: InstantID (SDXL) -> 2x upscale.
Controls: Reference face; style and prompt text; identity strength; ControlNet pose or edge.
Why multi-model: Locking a specific identity while changing style needs a dedicated identity encoder and IdentityNet, separate from the text-to-image backbone.
Use cases: Stylized profile portraits, Themed personal art, Brand persona variants
Pitfalls: Identity versus style is a trade-off set by the strength knob; non-face references give no signal.

InstantID: Zero-shot Identity-Preserving Generation in Seconds (arXiv:2401.07519)

Reference-Driven Line Art Video Colorization

line-art video + reference character image -> tracked correspondences -> colorized animation

Correspondence extraction
CoTracker2
depth
Track pixel-level correspondences between the reference and each line-art frame (tracking pre-processor; approximate node type)
Sketch conditioning
AniDoc sketch extractor
controlnet
Extract per-frame sketch/line-art control signal from the input video
Colorized video generation
Stable Video Diffusion (AniDoc fine-tune)
video-to-video
Diffuse color and shading onto the line-art sequence conditioned on reference image + correspondence tracks

A raw line-art/sketch video sequence and a single colored reference image of the character are fed into a video diffusion model that uses point-tracking correspondences to transfer color and shading from the reference onto every sketch frame, producing a temporally consistent colored animation.

Stack, controls & more

Example stack: CoTracker2 -> AniDoc sketch extractor -> Stable Video Diffusion (AniDoc checkpoint)
Controls: reference character image (color/design sheet), point-tracking correspondence maps, per-frame sketch/line-art control, frame count (14-frame native window)
Why multi-model: A single video diffusion model cannot both (a) infer temporally-stable pixel correspondences between a static reference pose and a moving sketch character and (b) generate coherent color; a dedicated point-tracker supplies explicit correspondence guidance the diffusion U-Net conditions on, and an SVD-based backbone handles temporal coherence and generation separately.
Use cases: 2D anime in-betweening automation, colorizing rough animatics from a character color sheet, reducing manual cel-coloring labor in production pipelines
Pitfalls: Large pose deviations between reference and sketch frames can cause color bleeding; native window is 14 frames so longer shots need chunking/stitching, which can introduce seams.

AniDoc: Animation Creation Made Easier (CVPR 2025, arXiv:2412.14173)

Restyled First-Frame Propagation

source video + style prompt -> restyled first frame -> structure-guided full-video restyle

First-frame restyle
Flux Kontext
image-to-image
Restyle only the first frame of the source video into the target look so the exact style is locked as a single edited image
Full-video propagation
Wan 2.2 (Fun model)
video-to-video
Regenerate the full clip from the styled first frame; depth/canny/pose maps from the original footage are fed in only as ancillary geometry guardrails

An image-editing model restyles only the first frame of a source video into a target look; a video model then regenerates the entire clip by propagating that one already-styled frame forward through time, using lightweight structural maps only as an ancillary guardrail against geometry drift rather than as the source of style.

Stack, controls & more

Example stack: Flux Kontext (first-frame restyle) -> Wan 2.2 Fun (video-to-video, first-frame + ControlNet conditioned)
Controls: style prompt / reference frame for the first-frame edit, choice of ancillary ControlNet preprocessor (depth/canny/pose), ControlNet conditioning strength, output resolution
Why multi-model: The style decision is made once, by an image-editing model, on a single frame where it can be art-directed cheaply; a text-to-video or per-frame ControlNet pass cannot reliably apply that same specific edited look to every frame without drifting, so a second model only propagates an already-finished still.
Use cases: Stylizing existing footage into a new visual genre while keeping the original motion, Converting live-action reference into anime or painterly renders, Brand-consistent restyle of stock/user video
Pitfalls: Style fidelity depends entirely on how well the first frame was restyled; a first frame that conflicts with the ancillary structure maps produces visible flicker or geometry fighting once propagated across the full clip.

Wan 2.2 Video Restyle: First Frame Style Transfer ComfyUI Workflow

StreamDiffusionTD live multi-ControlNet + StreamV2V temporal-consistency chain

live video/control signal -> image (real-time loop) -> temporally-consistent video

Live generate
SDXL-TurboTensorRT ControlNet
image-to-image
TouchDesigner composites camera/procedural TOP feeds into conditioning images that drive real-time multi-ControlNet generation (the control-signal prep is TD-native tooling, not a separate model)
Temporal consistency
StreamV2V
video-to-video
StreamV2V cached-attention pass removes flicker across the live frame stream

TouchDesigner tox StreamDiffusionTD runs SDXL-Turbo in real time driven by any TouchDesigner-generated control signal (camera feed, procedural TOP, depth/canny/normal maps built in TD) through multiple simultaneous TensorRT-accelerated ControlNets, then applies StreamV2V cached-attention video-to-video processing so consecutive live frames stay temporally coherent instead of flickering independently.

Stack, controls & more

Example stack: TouchDesigner TOP control -> SDXL-Turbo + multi-ControlNet (StreamDiffusionTD) -> StreamV2V temporal pass
Controls: ControlNet on/off toggle mid-stream and per-unit weight, multiple simultaneous ControlNets, TensorRT engine build per model for real-time speed, any TD-authored conditioning image (audio-reactive, camera, procedural) wired into the second input
Why multi-model: Distinct from baseline depth-pose-guided-v2v-restyle (offline video-to-video restyle via LTX/Wan Fun) because this runs live/interactively at TouchDesigner frame rates with a different model pair, SDXL-Turbo for per-frame generation plus StreamV2V for cross-frame temporal consistency, driven by arbitrary TD-authored control signals rather than a pre-extracted depth/pose track on a fixed video file.
Use cases: live VJ/installation real-time AI restyle of camera feed, interactive projection-mapped generative visuals reacting to audio or sensors, real-time previz restyle driven by procedural TD signals
Pitfalls: requires per-model TensorRT engine builds (CUDA/TRT setup) before streaming; ControlNet must be enabled at stream start to be usable mid-session; real-time frame rate depends heavily on GPU (documented benchmarks are for high-end cards)

DotSimulate StreamDiffusionTD docs

Restore & upscale3

Restoration, upscaling, and detail recovery passes.

Face-Restore + Background-Upscale Chain

image -> image

Detect and restore faces
GFPGAN v1.4CodeFormer
face-restore
GFPGAN or CodeFormer detects face regions, crops them, applies the face GAN prior, and blends restored faces back into the full image
Upscale background / non-face regions
Real-ESRGAN x4plus
upscale
Real-ESRGAN 4x or similar applies to the non-face background to match resolution and recover texture detail

A face-restoration model (GFPGAN, CodeFormer) recovers degraded facial detail in the generated or photographed image, while a separate general-purpose upscaler (Real-ESRGAN) handles background and non-face regions.

Stack, controls & more

Example stack: GFPGAN v1.4 (face regions) + Real-ESRGAN x4plus (background); or CodeFormer + Real-ESRGAN via A1111 Extra tab or ComfyUI FaceRestoreCF node.
Controls: CodeFormer fidelity weight (0=quality, 1=identity fidelity); Real-ESRGAN model variant (x4plus, x4plus-anime); face detection confidence threshold; blend mask feathering.
Why multi-model: Face restoration and general upscaling are fundamentally different tasks requiring specialized models: face models exploit facial priors (identity, geometry) that general upscalers lack, and general upscalers handle textures and backgrounds that face models cannot attend to. Base-refiner pipelines add a detail-finishing pass on the whole image; this pattern uses region-split specialist models where each model operates on a distinct pixel domain.
Use cases: Restoring small or degraded faces in generated images, Old photo revival (scan to high-res), Upscaling AI portraits without over-smoothing faces
Pitfalls: Face detection failures on unusual angles or heavy occlusion leave faces unrestored; high CodeFormer fidelity may preserve blur from the original; seam artifacts at the face/background boundary if blend mask is not feathered.

GFPGAN: Towards Real-World Blind Face Restoration with Generative Facial Prior (TencentARC/GFPGAN, GitHub README)

Flux Kontext Restore then Kling-Animate Vintage Photo

image -> image -> video

restoration
FLUX Kontext
image-to-image
colorizes and enhances quality of an old/damaged photo
animation
Kling Video AI
image-to-video
animates the restored photo with natural movement

An old/vintage photograph is first colorized and quality-enhanced with Flux Kontext, then the restored image is animated into a short video with Kling.

Stack, controls & more

Example stack: n8n + imgbb hosting + FLUX Kontext (via API) + Kling Video AI
Controls: colorization strength, animation motion prompt, clip length (~5s)
Why multi-model: Two distinct generative-media models in sequence: Flux Kontext performs image-to-image restoration (colorization/enhancement) of a real historical photo, and Kling then performs image-to-video animation of the restored result. This differs from the baseline's fresh-generation keyframe-to-i2v pattern and from the baseline's ESRGAN+RIFE finishing chain because the source asset is a real degraded photograph being restored, not a freshly generated or already-clean keyframe, and no upscaler/interpolator is involved.
Use cases: bringing family archive photos to life for social media, memorial/nostalgia video content, heritage/history content creation
Pitfalls: colorization can introduce historically inaccurate colors; animation motion is generic and not guided by the actual scene depth/pose

n8n workflow template: Transform old photos into animated videos with FLUX & Kling AI for social media

Generated-Video Upscale + Frame-Interpolation Finishing

video -> upscaled video -> high-fps video

Load generated video
Wan 2.1
upscale
LoadVideo + GetVideoComponents breaks the clip into frames
Spatial upscale
Real-ESRGAN 4x
upscale
Real-ESRGAN (or CR Upscale / Topaz Video AI Proteus) upscales each frame 2-4x via ImageUpscaleWithModel
Temporal interpolation
RIFE VFI
interpolate
RIFE VFI inserts synthesised in-between frames, doubling or quadrupling FPS (e.g. 15 fps -> 30 or 60 fps)
Reassemble
FFmpeg
interpolate
frames are reassembled into a video and optionally re-encoded via FFmpeg

AI-generated video is typically low resolution (480p-720p) and low frame rate (12-24fps). A two-stage finishing chain first passes every frame through a spatial upscaler (Real-ESRGAN or similar GAN) to reach 2-4x resolution, then applies a temporal frame interpolation model (RIFE) to double or quadruple the frame rate, producing a smooth, high-resolution output suitable for delivery.

Stack, controls & more

Example stack: Wan 2.1 (generated 480p 15fps video) -> Real-ESRGAN 4x upscale per frame -> RIFE VFI x2 interpolation -> 1080p 30fps output.
Controls: Upscale factor (2x, 4x); ESRGAN model variant (for film vs. animation); RIFE multiplier (x2, x4); FPS target; scale parameter in RIFE for VRAM-limited runs.
Why multi-model: Spatial upscaling and temporal interpolation are fundamentally different tasks: upscalers work frame-by-frame on spatial frequency, while interpolators predict optical flow between frames to synthesise new ones. No single model handles both, and applying interpolation before upscaling degrades flow estimation accuracy.
Use cases: Delivery-ready upres of AI short films, Slow-motion effects on generated clips, Social platform output requiring 1080p 30fps minimum
Pitfalls: RIFE struggles with large fast motions or scene cuts (ghosting artifacts); upscaling before interpolation is safer but requires more VRAM; temporal consistency breaks if source frames are already low-quality.

Real-ESRGAN: Practical Algorithms for General Image/Video Restoration (GitHub: xinntao/Real-ESRGAN)

Character consistency9

Identity locked across generations.

Diffusion-Transformer Garment-Preserving Video Try-On

person video + garment image -> garment-preserved try-on video

Garment cue disentanglement
MagicTryOn garment-preservation encoder
reference-adapter
Decompose the garment image into texture/print priors separate from pose, so garment identity survives body motion
Pose-aware video try-on
MagicTryOn (Wan 2.1-based DiT)
video-to-video
Inject garment priors into the denoising process of a video diffusion transformer conditioned on the person video's pose via spatiotemporal RoPE
Fast sampling
MagicTryOn distilled sampler
video-to-video
Distribution-matching distillation compresses sampling to ~4 steps for practical turnaround

A video of a person and a flat garment reference image are fed into a diffusion-transformer model that disentangles garment texture/print from pose and injects it frame-by-frame, producing a video of the person wearing the new garment with consistent fabric detail and motion.

Stack, controls & more

Example stack: MagicTryOn garment encoder -> MagicTryOn DiT (Wan 2.1 backbone) -> distilled 4-step sampler
Controls: garment reference image, person video, mask-aware loss region (garment area), sampling steps (distilled to 4)
Why multi-model: Garment appearance (fine texture/print) and body pose/motion are different signal types that must be extracted, disentangled and separately conditioned on; a single frame-level try-on model cannot maintain garment fidelity across motion without an explicit garment-preservation encoder plus a video-native temporal backbone with distillation for practical speed.
Use cases: e-commerce garment videos showing fabric drape/motion instead of static try-on stills, fashion social ads without reshoot per SKU, size/fit preview videos
Pitfalls: Fast motion or self-occlusion (arms crossing torso) causes garment texture drift across frames; front-back consistency during turns remains a known failure mode versus single-frame try-on.

MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on (arXiv:2505.21325)

HunyuanCustom Multi-Subject Video Customization

reference image(s) + text (+ optional driving video) -> subject-consistent customized video

Text-image fusion
LLaVA
reference-adapter
LLaVA-based module fuses the subject reference image with the text prompt for richer multimodal conditioning
Identity-reinforced video generation
HunyuanVideo (HunyuanCustom)
text-to-video
Image ID enhancement module temporally concatenates identity features across frames inside the HunyuanVideo DiT to keep single- or multi-subject identity stable
Subject replacement (optional)
HunyuanCustom video-driven injection module
video-to-video
A video-driven injection module patchifies and feature-aligns a compressed input video so a reference subject can replace an existing on-screen subject

Tencent's HunyuanCustom fuses a LLaVA-based text-image understanding module with the HunyuanVideo DiT and an image-ID enhancement module (temporal concatenation) to keep one or more reference subjects visually consistent throughout a generated or subject-replaced video.

Stack, controls & more

Example stack: LLaVA text-image fusion -> HunyuanVideo DiT with image-ID enhancement -> video-driven injection module (subject replacement)
Controls: one or more subject reference images, text prompt, optional driving/source video for replacement mode, dual-subject vs single-subject workflow selection
Why multi-model: The multimodal conditioning explicitly combines a separate vision-language model (LLaVA) for text-image fusion with the HunyuanVideo generation backbone, plus dedicated injection sub-networks (video-driven injection module for subject-replacement mode); no single model handles both the semantic fusion and the temporal identity reinforcement.
Use cases: virtual product/human advertisements with a consistent subject dropped into new video scenes, dual-subject consistent video generation, replacing a subject in existing footage with a reference identity while keeping motion
Pitfalls: dual/multi-subject workflows are more failure-prone than single-subject (identity bleed between subjects); the video-driven injection (replacement) path is sensitive to the compression/alignment of the source video and can introduce warping at subject boundaries

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation (arXiv:2505.04512)

InfiniteYou Recraft to Animated Portrait

single face photo + text -> identity-recrafted photo -> animated clip

Identity encode + inject
InfiniteYou (InfuseNet)
reference-adapter
Face identity embedding is injected into the DiT base model via residual connections (InfuseNet), avoiding face copy-paste artifacts common to naive ID adapters
Recraft generation
FLUX.1 devFLUX.1 schnell
text-to-image
Base DiT generates the new photo (pose/style/scene from text) with injected identity; supports swapping in FLUX.1-schnell for faster 4-step generation
Animate
Wan 2.2 image-to-video
image-to-video
The identity-consistent still becomes the reference frame for image-to-video generation

ByteDance InfiniteYou injects identity features through InfuseNet residual connections into a FLUX DiT to recraft a person's photo under a new prompt/pose/style while preserving likeness, then a downstream image-to-video model animates the recrafted still.

Stack, controls & more

Example stack: InfiniteYou (InfuseNet) + FLUX.1 dev -> Wan 2.2 image-to-video
Controls: identity reference image, text prompt for pose/scene/style, base model swap (dev vs schnell) for quality/speed tradeoff, SFT stage controlling text-image alignment strength
Why multi-model: InfuseNet's identity injection is specific to still-image DiT generation and carries no temporal component; a distinct video diffusion backbone is required to add motion, and InfiniteYou's own docs describe it as plug-and-play specifically for composing with other methods/backbones.
Use cases: personalized avatar photo then short animated intro clip, recrafting a single reference photo into varied scenes before turning into a talking/moving asset, identity-locked marketing creative across image and short-video formats
Pitfalls: identity similarity vs. text-prompt alignment is a direct tradeoff; pushing pose/style far from the reference photo degrades likeness, and any facial artifact in the recrafted still propagates and amplifies once animated

InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity (arXiv:2503.16418)

IP-Adapter Identity-Lock

image + text -> image

Encode reference
IP-Adapter
reference-adapter
CLIP image encoder turns the reference portrait into a visual prompt
Generate
SDXLFLUX
text-to-image
base diffusion conditioned on text plus injected image features
Upscale
Real-ESRGAN 4x
upscale
optional detail pass

A lightweight image-prompt adapter injects a reference image as a decoupled visual condition, so a single subject stays recognizable across many re-prompts and styles without any fine-tuning.

Stack, controls & more

Example stack: SDXL + IP-Adapter (or FLUX + IP-Adapter) -> 4x upscale.
Controls: Reference image and weight; prompt text and CFG; scale of the image-prompt contribution; compositional masks to keep layout text-driven.
Why multi-model: The base diffusion model has no notion of a specific identity; the adapter supplies a separate image-encoder pathway that the U-Net cannot reproduce from text alone.
Use cases: Brand mascots and recurring characters, Portrait stylization, Product-on-model with locked identity
Pitfalls: Too-high adapter weight overfits the reference and kills prompt adherence; low-fidelity reference images degrade identity rather than stabilize it.

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models (arXiv:2308.06721)

PhotoMaker Custom-Identity Generation

images + text -> image

Fuse identity
PhotoMaker
reference-adapter
merge several reference photos into one identity embedding
Generate
SDXL
text-to-image
condition generation on the fused identity plus text
Upscale
Real-ESRGAN
upscale
optional

Stack a few reference photos of a person into a unified identity embedding, then generate that exact person in any scene, style, or action from text alone, with no per-subject training.

Stack, controls & more

Example stack: PhotoMaker (SDXL) -> upscaler.
Controls: Number and variety of reference photos; identity strength; text prompt and style tags.
Why multi-model: Identity fusion across multiple reference images needs a dedicated encoder and a merged embedding that a vanilla text-to-image model cannot represent.
Use cases: Personalized avatars, Same person across marketing scenes, Fast persona prototyping
Pitfalls: Too few or too similar references collapse identity diversity; strong style tags can overwhelm the identity signal.

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding (arXiv:2312.04461)

Reference-Locked Character Consistency

reference image + text -> image

Encode identity
InstantIDIP-Adapter FaceID
reference-adapter
IP-Adapter / InstantID / character LoRA encodes the reference identity
Generate
SDXLFLUX
text-to-image
base model generates the new scene conditioned on the identity embedding
Restore
GFPGANCodeFormer
face-restore
optional face restoration to sharpen identity in small faces

Lock a character's identity across many generations by conditioning a base image model on a reference through an identity adapter, so new poses and scenes keep the same face and character.

Stack, controls & more

Example stack: InstantID or IP-Adapter FaceID -> SDXL or FLUX -> GFPGAN/CodeFormer restore.
Controls: Identity-adapter weight and scale; reference image quality; pose or control-net guidance; LoRA strength.
Why multi-model: A base text-to-image model has no persistent identity; an identity adapter or trained LoRA injects the reference's features so the character stays consistent across prompts in a way a single model cannot.
Use cases: Consistent characters across a storyboard, Brand mascots, Comic and graphic-novel panels
Pitfalls: Too-high adapter weight copies the reference pose and lighting (overfit); too low loses identity; identity drifts under extreme pose changes.

InstantID: Zero-shot Identity-Preserving Generation in Seconds (arXiv:2401.07519)

Soul ID Locked Character to Veo 3 Talking Video

text/image -> image -> identity-lock -> video+voice -> polish

character still generation
Soul (Higgsfield image model)Nano Banana
text-to-image
Generate or upload the base character portrait
identity training / lock
Soul ID
reference-adapter
Train a reusable digital-twin identity from the reference photo so the same face persists across future generations without re-upload
talking video and voice generation
Veo 3
image-to-video
Animate the locked identity into a video with native synchronized dialogue/voice from a text prompt
polish
Higgsfield VFXHiggsfield Camera Motion
video-to-video
Apply VFX and camera-motion finishing passes and choose export aspect ratio

Generate or upload a character image (Soul or Nano Banana), train a portable digital-twin identity on it with Soul ID, then feed that locked identity into Veo 3 to generate a talking video with native synchronized dialogue and voice, finishing with VFX and camera-motion polish nodes, all inside Higgsfield's node-based Canvas without leaving the app or re-uploading references per shot.

Stack, controls & more

Example stack: Soul or Nano Banana (still) -> Soul ID (identity lock) -> Veo 3 (talking video + voice) -> Higgsfield VFX/Camera Motion (polish)
Controls: Soul ID reference photo set for identity training; text prompt describing motion, setting and dialogue for Veo 3; VFX and camera-motion node parameters; export aspect ratio (9:16 or 16:9)
Why multi-model: Chains at least three distinct generative systems: an image generator (Soul or Nano Banana) for the base character still, Soul ID as a trained identity-lock layer that persists the face across generations without re-uploading a reference each time, and Veo 3 as a separate video-and-voice model that produces native lip-synced dialogue directly (no separate TTS or lip-sync model bolted on afterward). This differs from reference-locked-character-consistency (InstantID/IP-Adapter face lock on still images only, no video stage) and from voice-clone-to-talking-head / tts-video-diffusion-lipsync-chain (both of which pair a separate zero-shot TTS clone with a separate lip-sync/talking-head model rather than a video model with native audio generation).
Use cases: recurring branded spokesperson videos across campaign variations, UGC-style talking avatar ads without re-shooting per script, consistent narrator/host across a multi-episode video series
Pitfalls: Requires a Higgsfield subscription tier with Veo 3 access; identity lock quality depends on the base reference image; native Veo 3 audio/dialogue means separate TTS or lip-sync tooling is not used, so voice style is constrained to what Veo 3 can render from the prompt rather than a cloned voice.

Higgsfield: Idea to Talking Character. All in One Place.

Stand-In Lightweight Identity Video Control

single face photo + text (+ optional pose video / style LoRA) -> identity-locked video

Identity embed
antelopev2
reference-adapter
antelopev2 face recognition model extracts an identity embedding from a single reference photo
ID-conditioned video synthesis
Stand-In adapterWan 2.1-14B
text-to-video
A lightweight adapter (153M params on the 14B model) injects the identity embedding via restricted self-attention with conditional position mapping into the video DiT
Optional motion/style compose
VACE (pose control)community Wan LoRAs
controlnet
Pose-guided motion via VACE, or stylization via community LoRAs, applied alongside the identity adapter without retraining

Stand-In adds a tiny conditional-image branch (~1% extra parameters) onto Wan2.1/2.2 that restricts self-attention with conditional position mapping, locking a face identity across an entire generated video clip while composing with pose-driven control or community style LoRAs.

Stack, controls & more

Example stack: antelopev2 face encoder -> Stand-In adapter + Wan 2.1-14B text-to-video -> VACE pose control (optional)
Controls: single reference face image, text prompt, optional driving pose video (VACE), optional LoRA for stylization, denoising-strength adjustment for the experimental face-swap mode
Why multi-model: Identity extraction and video synthesis are handled by two separate networks: a dedicated face-recognition encoder produces the identity embedding, and a large pretrained video DiT (untouched in its core weights) consumes it; optional third-party pose control (VACE) or LoRA further compose in as independent modules.
Use cases: identity-locked talking/acting clips from one photo, face-swap-in-video variants, stylized (cartoon/anime) identity-consistent video via community LoRA composition
Pitfalls: trained only on real-person data, so cartoon/object generalization is a secondary, less-tested mode; combining pose control (VACE) with identity locking simultaneously can still trade off motion naturalness against strict facial consistency

Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation (arXiv:2508.07901)

UNO Multi-Subject Compose to Wan Animate

reference images (subject + object) + text -> composed still -> animated video

Reference compose
ByteDance UNOFLUX.1 dev
image-to-image
In-context conditioning on 1+ subject/object reference images plus text produces one consistent composed image (progressive cross-modal alignment + universal RoPE)
Vision re-encode
CLIP Vision
reference-adapter
CLIP Vision encodes the composed still to capture identity cues (face, hair, clothing) for the video model's reference conditioning
Animate
Wan 2.2
image-to-video
Image-to-video diffusion generates motion while the vision-encoded reference anchors identity/outfit across frames

ByteDance UNO composes a character (and optional object/outfit reference) into a single consistent still via in-context multi-image conditioning on FLUX.1-dev, then a CLIP-Vision-encoded image-to-video model animates that still into a moving clip with the same identity and outfit.

Stack, controls & more

Example stack: ByteDance UNO (FLUX.1 dev) -> CLIP Vision encode -> Wan 2.2 image-to-video
Controls: number/order of reference images, text prompt for scene, UnoPE position mapping (subject vs. object disambiguation), CLIP Vision reference averaging across multiple stills to stabilize identity, negative reference image to suppress unwanted traits
Why multi-model: UNO's DiT adapter only produces a single still frame from reference images; it has no temporal/motion prior. A separate video diffusion model with its own vision encoder is required to turn the consistent still into motion without re-drifting identity.
Use cases: product/character mockups that need to move, outfit-consistent social clips from a single compose, game/animation asset previsualization
Pitfalls: UNO can still show attribute confusion between subject and object on complex multi-reference prompts; if the composed still has inconsistent lighting/crop, the video model's identity embedding destabilizes and drifts within a few seconds of generated motion

UNO: Less-to-More Generalization (ByteDance, ICCV 2025)

Image to video5

Stills driven into motion.

Depth-Guided Video Relighting Chain

video -> matte + depth -> per-frame relight -> temporally-stabilized video

Matte
Robust Video Matting
image-to-image
Isolate subject per frame from background
Depth
Depth Anything V2
depth
Depth map per frame to condition relighting geometry
Relight
IC-Light ControlNetStable Diffusion 1.5
controlnet
Apply IC-Light ControlNet driven by a light map onto SD1.5 base
Temporal stabilize
AnimateDiff
video-to-video
Motion module to reduce flicker across relit frames

A character is matted out of source footage, depth-mapped, then relit frame-by-frame with an IC-Light ControlNet driven by a hand-painted light map, with AnimateDiff-based motion modules used to stabilize the relit sequence.

Stack, controls & more

Example stack: Robust Video Matting -> Depth Anything V2 -> IC-Light ControlNet (SD1.5) -> AnimateDiff
Controls: light map image, ControlNet strength/end-percent, ip-adapter reference for lighting style, AnimateDiff motion LoRA, seed
Why multi-model: IC-Light is an image relighting model with no native temporal awareness; matting isolates the subject from background contamination, depth conditioning keeps the relighting geometrically consistent, and a motion module is needed to prevent frame-to-frame lighting flicker.
Use cases: re-lighting talking-head footage for a new set, matching actor lighting to a virtual background, day-for-night conversions
Pitfalls: frame-independent relighting causes flicker without a temporal/motion module; matte edge errors bleed background light onto the subject

ComfyUI IC-Light Workflow for Video Relighting (RunComfy)

First-Last-Frame Interpolation (FLF2V)

image + image -> video

Generate start frame
FLUX.1 dev
text-to-image
t2i model (FLUX or SDXL) produces the opening keyframe
Generate end frame
FLUX.1 dev
text-to-image
same or variant prompt produces the closing keyframe; color/lighting grading applied for consistency
Interpolate
Wan 2.2 FLF2VRIFE
image-to-video
Wan 2.2 FLF2V fills intermediate frames between the two anchors, producing a 720p+ clip

Two keyframe images, a start frame and an end frame, are generated (or sourced) independently; a specialised first-last-frame-to-video model then fills in all intermediate frames, producing a coherent motion arc between the two endpoints. Chained passes can drive longer transformations by treating the last frame of one pass as the first frame of the next.

Stack, controls & more

Example stack: FLUX.1 dev (start frame) + FLUX.1 dev (end frame) -> Wan 2.2 FLF2V interpolation -> optional RIFE 60fps pass.
Controls: Positive/negative prompts to describe the in-between motion; video size and length; denoising schedule for early vs. late timesteps; chained FLF passes for longer sequences.
Why multi-model: Conventional i2v only anchors the start; FLF2V anchors both endpoints, so the transformation arc is determinate. The two keyframes are typically generated by a separate t2i model (FLUX, SDXL), making this a t2i -> FLF2V two-stage chain where each model handles only what it excels at.
Use cases: Transformation and morph effects, Matched cuts and transitions, Character pose changes
Pitfalls: Extreme pose or scene changes can cause incoherent mid-motion; chained passes must carefully manage extracted last-frame quality. FLF2V clips are short (up to ~5 s per pass).

Apatero: WAN 2.2 FLF First-Last Frame Video Guide - ComfyUI 2025

Keyframe-Drive Image-to-Video

text -> image -> video

Prompt keyframe
Character LoRAIP-Adapter
text-to-image
generate the anchor still with precise composition and character identity (e.g. via KSampler + Character LoRA + IPAdapter)
Drive to video
Kling
image-to-video
i2v model (Kling, Runway, Wan i2v) animates the still into a motion clip via the video API

A text-to-image model generates one or more keyframe stills that lock composition, character, and lighting; those stills are then fed into an image-to-video model to produce motion. The i2v model inherits the keyframe's identity rather than hallucinating from text alone, giving tighter control over subject appearance and scene layout.

Stack, controls & more

Example stack: ComfyUI KSampler + Character LoRA + IPAdapter (keyframe still) -> Kling API image-to-video (motion clip).
Controls: Seed consistency between t2i and i2v passes; motion prompt describing camera move or action; motion scale / CFG; clip duration.
Why multi-model: A single text-to-video model must simultaneously handle composition, identity, and motion; splitting the job lets a high-quality t2i model nail the still, then a separate i2v model concentrate on motion, preserving the still's identity in a way end-to-end t2v cannot.
Use cases: AI influencer and character video content, Product reveal shots, Character animation from concept art, Social media motion graphics
Pitfalls: i2v models can drift from the keyframe under strong motion prompts; over-long clips degrade identity. Short (2-6 s) clips per shot recommended. Source image quality matters more than generation settings.

Apatero: AI Influencer Image to Video - Kling AI ComfyUI Workflow 2025

Light-A-Video Progressive Light Fusion

video -> per-frame relit candidates -> temporally fused relit video

Per-frame relight
IC-Light
image-to-image
Base image relighting model adapted with cross-frame attention
Consistent Light Attention
Light-A-Video (CLA module)
video-to-video
Cross-frame self-attention coupling to stabilize the lighting source across frames
Progressive Light Fusion
Light-A-Video (PLF strategy)
video-to-video
Linear blend of source and relit video appearance for smooth temporal transitions

A training-free video relighting pipeline that repurposes an existing image relighting model with a Consistent Light Attention module across frames, then linearly blends source and relit appearance over time (Progressive Light Fusion) to remove flicker without any video-specific training.

Stack, controls & more

Example stack: IC-Light (per-frame) -> Light-A-Video CLA (cross-frame attention) -> Light-A-Video PLF (temporal blend)
Controls: light source/text prompt for target illumination, blending ratio in PLF, base video diffusion backbone choice (e.g. VideoCrafter2, AnimateDiff)
Why multi-model: A still-image relighting model alone produces per-frame lighting-source drift when run independently on each frame; a second temporal-fusion mechanism operating over the video diffusion's attention and output space is required to enforce cross-frame consistency.
Use cases: relighting UGC or archival video without retraining, time-of-day changes for existing video clips, consistent studio-light simulation on handheld footage
Pitfalls: training-free temporal fusion can under-correct for large lighting-direction changes, causing residual low-frequency flicker on fast camera motion

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion, arXiv:2502.08590 (ICCV 2025)

Motion-Brush Trajectory-Controlled Image Animation

static image + hand-drawn motion paths -> per-element trajectory video

Region masking
Kling Motion Brush region selector
controlnet
Brush-select up to six regions/elements to constrain which pixels move (masking UI; approximate node type)
Trajectory encoding
Kling Motion Control path encoder
controlnet
Convert user-drawn paths (direction, curvature, depth) per region into a motion-field conditioning signal
Trajectory-conditioned video generation
Kling V2.6 Motion ControlRunway Motion Brush (Gen-4)
image-to-video
Generate video where each masked element follows its assigned path while unselected areas stay static

A still image is segmented into up to six user-selected regions, each assigned its own hand-painted motion trajectory (direction, curve, speed), and a commercial video diffusion model animates each region along its path while holding the rest of the frame static.

Stack, controls & more

Example stack: Kling Motion Brush region selector -> Kling Motion Control path encoder -> Kling V2.6 Motion Control
Controls: brush-selected motion regions (up to 6), hand-drawn trajectory curves per element, static-lock brush for background, optional text prompt for action description
Why multi-model: Region selection/masking and trajectory-path encoding are handled by a separate interactive control layer distinct from the underlying video generator; the diffusion backbone alone has no mechanism to accept per-object drawn paths without an intermediate trajectory-to-motion-field encoder.
Use cases: product shots with a single element (e.g. bottle cap, fabric) animated in place, cinemagraph-style social ads, explainer graphics where only an icon/arrow should move
Pitfalls: Long or sharply curved trajectories can produce warping/limb-breaking artifacts; overlapping trajectories for adjacent regions often bleed into each other.

Kling V2.6 Motion Control (Replicate model docs)

Multi-shot7

Storyboard to per-shot animation to a stitched sequence.

Character Sheet to Lettered Comic Panel Sequence

text prompt -> character reference sheet -> panel images -> paneled page + lettering

Character sheet generation
SDXL
text-to-image
Generate a multi-angle reference sheet (front/side/3-quarter) for one character to lock design
Identity + pose-locked panel generation
IPAdapter FaceIDSDXL + ControlNet (OpenPose)
image-to-image
Reuse the character sheet via a face/identity adapter and a pose ControlNet to generate each story panel consistently
Panel layout composite
ComfyUI Panels (bmad4ever)
image-to-image
Arrange generated panel images into a nested hierarchical comic-page cut layout with gutters/borders
Lettering
ComfyUI PanelForge speech-bubble nodes
inpaint
Add speech bubbles and text lettering onto the finished page as a human-assisted or automated post-production pass

A consistent character is first generated as a multi-view reference sheet (front/side/angled), then reused via identity+pose adapters to populate a sequence of panel images, which are finally arranged into a hierarchical comic-page layout with speech-bubble lettering added as a separate post-process.

Stack, controls & more

Example stack: SDXL (character sheet) -> IPAdapter FaceID + ControlNet OpenPose (per-panel generation) -> ComfyUI Panels layout -> PanelForge lettering
Controls: character reference sheet (multi-view), IPAdapter identity weight, ControlNet pose skeleton per panel, panel-cut hierarchy (direction/count/angle), bubble placement/text
Why multi-model: Identity consistency across many panels, per-panel pose/composition control, and page-layout/lettering are three separate concerns: an identity adapter locks the face/design, a pose ControlNet governs per-panel staging, and a dedicated panel-layout/lettering engine (not a diffusion model) composites the final publishable page.
Use cases: webcomic/manga production pipelines needing consistent recurring characters, storyboard-to-comic adaptation, rapid pitch-deck comic mockups
Pitfalls: Identity drift creeps in over long panel sequences despite IPAdapter locking, especially with extreme pose or expression changes; automated lettering still typically needs human copyediting/placement pass before publishing.

comfyui_panels: Comics/Manga like panel layouts (GitHub)

LLM-Planned Multi-Scene POV Video Cascade (fal Live-in-Scene)

text->text(x6 prompts)->image(x6)->video(x6 w/audio)->merged video

generate scene keyframes
fal-ai/nano-banana-pro
text-to-image
render one keyframe image per planned POV scene (x6)
animate each scene with audio
fal-ai/bytedance/seedance/v1.5/pro/image-to-video
image-to-video
animate each keyframe into a 4-second clip with generated ambient/environmental audio (x6)

fal.ai's 'Live-in-Scene' workflow template takes a film name, has an LLM scene planner break it into 6 first-person POV scene prompts, generates a keyframe image for each with Nano Banana Pro, animates each into a 4-second clip with generated ambient audio via Seedance 1.5 Pro image-to-video, then merges all six clips into one ~24-second video.

Stack, controls & more

Example stack: OpenRouter Gemini 2.5 Flash (scene planner LLM) -> Nano Banana Pro (per-scene text-to-image, x6) -> Seedance 1.5 Pro i2v (per-scene animate+audio, x6) -> fal ffmpeg-api merge-videos
Controls: film name input; LLM scene planner enforces one clear action per scene and ambient-only audio (no dialogue), sequencing scenes toward a narrative climax; ffmpeg merge stitches final order
Why multi-model: An LLM planning model drives per-shot prompt generation, a separate text-to-image model renders each keyframe, a separate image-to-video model (with native audio generation) animates each shot, and an ffmpeg merge step assembles the final cut, all as one chained endpoint.
Use cases: generating short immersive POV trailers/homages to a film's setting, rapid multi-scene sizzle reels from a single text seed, prototyping episodic short-form video from narrative prompts
Pitfalls: LLM-authored per-scene prompts can drift in visual continuity between the 6 independently generated keyframes since there is no shared identity/style-lock adapter across scenes; distinct from multi-shot-narrative-stitch (Qwen Image Edit + Wan, per-shot keyframe->i2v->concatenate) and storyboard-grid-to-multishot-video (single Nano Banana 2 storyboard grid then Kling 3.0 multi-shot animate) because here an LLM planner generates independent per-scene prompts and each scene is a separately generated keyframe+clip with native generated audio, not a single grid split into shots

fal-ai-community/skills WORKFLOWS.md reference (fal.ai Workflow Templates, Live-in-Scene)

Multi-Shot Narrative Stitch

text -> image (per shot) -> video (per shot) -> stitched video

Plan shots
Qwen Image Edit
text-to-image
LLM or manual prompt produces per-shot descriptions with subject, camera, and motion directives
Generate keyframes
Qwen Image Edit
text-to-image
reference-conditioned image model (e.g. Qwen Image Edit, Seedance storyboard, GPT Image) generates each shot's opening still; prior scene image conditions subsequent shots for style alignment
Animate per shot
Wan 2.2 i2v
image-to-video
i2v model (Wan 2.2, Seedance 2.0, Runway) animates each keyframe into a 2-6 s clip
Concatenate
FFmpeg
interpolate
clips are stitched in order; optional cross-dissolve or RIFE bridging between shots

A narrative or storyboard is broken into individual shots; a reference-conditioned image model generates a consistent keyframe per shot, an image-to-video model animates each keyframe into a short clip, and all clips are concatenated in order to form a full multi-shot sequence. Character and style consistency is maintained by conditioning each shot on a shared reference.

Stack, controls & more

Example stack: Qwen Image Edit (scene stills, reference-conditioned) -> Wan 2.2 i2v (per shot) -> concatenate into final cut.
Controls: Per-shot motion and camera prompts; shared character reference weight; clip duration per shot; concatenation order and transition type.
Why multi-model: No single t2v model can reliably maintain character identity and narrative continuity across multiple shots. Decomposing into per-shot i2v passes with a shared reference adapter is the only practical way to hold consistency while exercising per-shot camera and motion control.
Use cases: Short-form narrative films, Brand story ads, Game cinematics, Automated social video
Pitfalls: Identity drift accumulates across shots without a strong reference signal; stitching transitions are abrupt if not bridged; inconsistent lighting between shots breaks cohesion.

RunComfy: Create Coherent Scenes - Qwen Image Edit & Wan 2.2 ComfyUI Workflow

Storyboard Grid to Multi-Shot Video

text concept -> storyboard image grid -> multi-shot video

Storyboard grid
Nano Banana 2 (Gemini 3.1 Flash Image)
text-to-image
Generate all shots in a single multi-panel grid in one pass so lighting, staging, and character silhouette stay shared
Detail fix
Nano Banana 2
image-to-image
Repair detail lost in the grid render (faces, product fidelity) using reference photos before the video stage
Multi-shot animate
Kling 3.0
image-to-video
Feed the storyboard panels as timestamped reference images so the video model maps each panel to its shot and animates camera + motion in one continuous multi-shot clip

A dedicated image model renders the full shot sequence as consistent staged panels (sometimes a single grid image) so composition and character silhouette are locked before any motion exists; a video model then animates the panels together, using them as reference images plus a timestamped multi-shot prompt to decide camera work and motion per shot.

Stack, controls & more

Example stack: Nano Banana 2 (storyboard grid + detail fix) -> Kling 3.0 (multi-shot, omni reference)
Controls: storyboard panel order/count, per-shot text description, reference-image weighting (omni reference), timestamped multi-shot prompts, anchor-frame selection
Why multi-model: Image models hold composition, lighting, and character silhouette across many panels in one pass; video models are better at motion and camera dynamics but drift on composition if asked to invent both from text alone. Splitting what it looks like from how it moves across two model families is what stabilizes multi-shot output.
Use cases: Ad and trailer previz, Narrative shorts with consistent characters across scenes, Product commercials with multiple staged shots
Pitfalls: Consistency comes from a strong anchor frame, not the grid itself; a weak or ambiguous panel drifts its downstream shot, and product/face detail lost during grid rendering silently carries into the animated shot unless fixed first.

Nano Banana 2 and Kling 3.0: Cinematic AI Ad Workflow (2026)

StoryDiffusion Consistent Storyboard

text -> image sequence (+ video)

Storyboard
StoryDiffusionSDXL
text-to-image
batch generation with consistent self-attention locking characters
Animate
SoraKling
image-to-video
optional frame-to-frame motion between panels

Generate a multi-panel comic or storyboard where the same characters stay consistent across shots using consistent self-attention across a batch, then optionally animate the transitions between frames.

Stack, controls & more

Example stack: StoryDiffusion (SDXL) -> Sora/Kling image-to-video for transitions.
Controls: Character reference prompts; number of panels; shared-attention on/off; motion strength for transitions.
Why multi-model: Cross-shot consistency needs attention shared across generations, plus a motion model to bridge frames; a single image pass cannot enforce identity across panels.
Use cases: Comic and graphic-novel pages, Ad storyboards, Children's book illustration
Pitfalls: Shared attention can homogenize distinct characters; long sequences drift as the consistency window is bounded.

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation (arXiv:2405.01434)

Suno Song to Flux-Runway Generative Music Video

text -> audio -> image -> video

song generation
Suno API
text-to-audio
lyrics + music generated from a text prompt
scene artwork
Flux
text-to-image
per-scene stills generated from image prompts
scene animation
Runway ML (Gen video)
image-to-video
each still animated into a short clip
assembly
Creatomate
render
clips synced/merged with the generated audio track into a final video

A text prompt drives Suno to write and generate a full song, then per-scene image stills are generated with Flux and animated into short clips with Runway, which are composited against the finished track into a music video.

Stack, controls & more

Example stack: n8n + Suno API + Flux (BlackForest Labs/RapidAPI) + Runway ML + Creatomate
Controls: song/style prompt, per-scene image prompts, clip duration
Why multi-model: Chains three distinct generative-media models across two modalities: Suno generates the music track itself (not just narration), Flux generates the scene artwork, and Runway generates video motion from those stills; a rendering step composites them. This differs from the baseline's beat-synced music-video pattern (which pairs a generated song with a beat-synced *editing* tool over existing footage) because here every visual asset, not just the audio, is freshly generated per scene.
Use cases: AI music video generation, social lyric videos, artist promo clips from a single prompt
Pitfalls: per-scene image/video costs compound quickly across a full song length; visual continuity between scenes is not enforced by any identity/consistency model

n8n workflow template: Generate AI songs + music videos using Suno API, Flux, Runway and Creatomate

Text-to-Narrated Documentary Video (fal Documentary)

text->text(script/shots)->image(x6)->video(x12)->speech->merged narrated video

generate scene images
fal-ai/bytedance/dreamina/v3.1/text-to-image
text-to-image
render 6 scene images from the LLM-drafted shot script
animate scenes
fal-ai/bytedance/seedance/v1/pro/image-to-video
image-to-video
animate each of the 6 scene images into video clips (12 total video calls across shots)
synthesize narration
fal-ai/elevenlabs/tts/eleven-v3
text-to-audio
generate spoken narration track from the script text

fal.ai's 'Documentary' workflow template turns a text topic into a narrated short documentary: an LLM drafts the story/shot script, Dreamina v3.1 generates six scene images, Seedance 1/pro animates them into video clips, ElevenLabs eleven-v3 synthesizes voiceover narration, and ffmpeg merges audio with video and concatenates all shots into one final cut.

Stack, controls & more

Example stack: LLM (fal-ai/any-llm, script + shot breakdown) -> Dreamina v3.1 text-to-image (x6 scenes) -> Seedance v1 pro image-to-video (per-scene clips) -> ElevenLabs eleven-v3 TTS narration -> fal ffmpeg-api merge-audio-video + merge-videos
Controls: text topic/prompt input drives the LLM script; narration voice via ElevenLabs eleven-v3; ffmpeg merge-audio-video then merge-videos assembles the final concatenated cut
Why multi-model: Combines an LLM scriptwriter, a text-to-image model, an image-to-video model, and a text-to-speech model, each a distinct hosted model, glued together by ffmpeg merge/concat utilities into a single narrated-video pipeline.
Use cases: turning a topic/prompt into a narrated explainer or mini-documentary video, auto-generating voiced short-form video essays, rapid narrated video drafts for social/education content
Pitfalls: narration timing must be aligned to clip lengths via the merge-audio-video step, mismatched pacing between the 6 generated shots and the continuous voiceover can produce dead air or overlap; distinct from text-to-song-beat-synced-music-video (Suno/Udio song-driven, beat-synced cuts, no spoken narration) and llm-orchestrated-talking-scene-assembly (Qwen-Image/Edit + Wan2.2 + InfiniteTalk + MiniMax, built around a talking on-camera subject) because this chain is narration-over-B-roll documentary style with ElevenLabs TTS and Dreamina/Seedance visuals, no talking-head lipsync and no music generation

fal.ai Workflow Templates - Documentary

Audio driven13

Audio or speech driving visuals (lip-sync, talking head).

Chain-of-Thought Reasoning Foley Editing

video -> MLLM reasoning trace -> object-targeted audio -> edited audio

Scene reasoning
ThinkSound (MLLM reasoning stage)
video-to-video
MLLM decomposes the video into objects/actions/acoustics and emits a CoT plan (reasoning over video to a plan; approximate node type)
Foundational foley generation
ThinkSound audio foundation model
text-to-audio
CoT-guided audio foundation model generates semantically coherent soundscape
Object-centric / instruction editing
ThinkSound audio foundation model
text-to-audio
Interactive refinement of specific sound objects via follow-up natural-language instructions

A multimodal LLM first reasons step-by-step about a video's objects, actions, and acoustic environment, producing a structured chain-of-thought plan that then steers a separate audio foundation model to generate foley, and the same MLLM can re-reason to guide targeted natural-language edits of specific sound elements.

Stack, controls & more

Example stack: ThinkSound MLLM reasoning stage -> AudioCoT-guided audio foundation model -> instruction-guided re-edit pass
Controls: object-click targeting for interactive refinement, natural-language edit instructions, reasoning trace granularity
Why multi-model: The audio synthesis model has no native reasoning capacity to decide which objects should sound like what or when; a multimodal LLM is needed to decompose the scene into an interpretable plan (AudioCoT), and a distinct audio diffusion/foundation model executes that plan into actual waveforms, with the two iterating for object-level and instruction-level refinement.
Use cases: Iteratively directed foley for film post-production, Targeted SFX fixes without regenerating the whole soundtrack, Sound design review tools with explainable reasoning traces
Pitfalls: Reasoning quality bottlenecks final audio quality; a wrong CoT plan (misidentified object or action) propagates directly into an incorrect sound choice.

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing (arXiv:2506.21448)

DALL-E Generated Character to ElevenLabs-Hedra Talking Podcast

text -> image -> audio -> video

script
OpenAI GPT-4
text-to-image
GPT-4 writes the short script (LLM, not counted as the media-gen step)
character generation
DALL-E
text-to-image
generates a novel photorealistic character portrait from text
voice synthesis
ElevenLabs
text-to-audio
synthesizes a voice reading the generated script
talking animation
Hedra
audio-to-video
animates the generated portrait's face in sync with the audio

A script is written by GPT-4, a wholly new photorealistic character portrait is generated from scratch with DALL-E, ElevenLabs synthesizes a voice reading the script, and Hedra animates the generated portrait's face to lip-sync and express in time with that audio.

Stack, controls & more

Example stack: n8n + OpenAI GPT-4 + DALL-E + ElevenLabs + Hedra
Controls: character description prompt, script topic, voice selection, expression intensity
Why multi-model: Chains a script-writing LLM with three distinct generative-media models: DALL-E (text-to-image) creates the character portrait itself, ElevenLabs (text-to-audio) synthesizes the voice, and Hedra (audio-to-video) animates the face. This is structurally distinct from the baseline's voice-clone-to-talking-head pattern, which assumes an existing photo and a cloned voice from a reference sample; here both the character image and the voice are generated from nothing (no reference photo, no voice clone), making the character-creation step itself a generative-media stage rather than an input.
Use cases: novelty talking-character podcast clips, generated-mascot social content, no-reference-photo talking avatar content
Pitfalls: DALL-E-generated faces can drift between regenerations with no identity lock across episodes; Hedra's expression range is limited on non-frontal generated portraits

n8n workflow template: Create Animated Baby Podcast Videos with GPT, DALL-E, ElevenLabs and Hedra

Image to Scripted Voice-Clone UGC Video

image -> text (script) -> audio (cloned voice) -> video

Script + scene planning
Gemini
text-to-image
Gemini reads the uploaded image and writes a speech script with expression tags plus a separate scene description; this is a multimodal LLM planning step, not a media-generative model in the vocab, and is discounted from the multi-model count (no clean vocab fit)
Voice synthesis
ElevenLabs Instant Voice CloningElevenLabs Text to Speech
text-to-audio
ElevenLabs synthesizes the script as narration using a preset voice or instant voice cloning from a reference sample
Video generation with native lipsync
LTX-2.3
image-to-video
LTX-2.3 takes the source image, scene description, and generated audio to produce a talking-style video with built-in lip sync, no separate lipsync model needed

A single product/character image is turned into a short lip-synced UGC-style video: an LLM writes a performance script and scene description from the image, ElevenLabs clones or selects a voice to synthesize the narration, then LTX-2.3 generates the video with native lip sync driven by that audio.

Stack, controls & more

Example stack: Gemini (script) -> ElevenLabs Instant Voice Clone (narration) -> LTX-2.3 (video + native lipsync)
Controls: source image, script/expression tags from the LLM stage, voice selection or cloned voice sample, scene description prompt, LTX-2.3 duration and lipsync strength
Why multi-model: No single model both writes a grounded performance script from an image, clones/synthesizes a matching voice, and renders a lip-synced video; the chain needs an image-conditioned script generator, a separate voice-cloning TTS model, and a distinct audio-conditioned video generator.
Use cases: product explainer videos from a single photo, testimonial-style UGC ads, social media talking posts without filming
Pitfalls: LTX-2.3's native lipsync can drift on longer scripts or fast speech; cloned voice quality depends on the reference sample length and clarity, and mismatched scene description vs. script tone produces awkward framing

ComfyUI official template: Generate UGC Video With Voice Clone

LLM-Orchestrated Talking-Scene Assembly

character brief -> keyframes -> video + talking-head -> voiced final cut

Keyframe generation
Qwen-Image
text-to-image
Generate the initial scene frame from the LLM-authored per-shot prompt
Keyframe edit
Qwen-Image-Edit
image-to-image
Edit successive frames for scene transitions while keeping character and set consistent
Shot animation
Wan 2.2 I2V
image-to-video
Convert each keyframe into a video shot, stitched via last-frame continuation between scenes
Talking-head performance
InfiniteTalk
audio-to-video
Add lip-synced talking-head animation and face consistency on top of the animated shot
Voice
MiniMax voice synthesis
text-to-audio
Synthesize dialogue audio from the cast voice profile, measured to sync exact frame timing

A general-purpose LLM first drafts the dialogue, per-shot prompts, and voice casting from a character brief; an image model then generates and edits successive scene keyframes, a video model animates each keyframe into a shot stitched via last-frame continuation, a talking-head model adds lip-synced performance and face consistency, and a voice model supplies the matching speech track.

Stack, controls & more

Example stack: Qwen-Image -> Qwen-Image-Edit -> Wan 2.2 I2V -> InfiniteTalk -> MiniMax voice synthesis
Controls: character brief/persona, per-shot LLM-authored prompts, LoRA selection, last-frame continuation between segments, voice profile, audio-duration-driven video timing
Why multi-model: No single video model renders and edits a sequence of scene keyframes, animates those into shots, drives lip-synced facial performance, and synthesizes a timed voice track; each is a distinct trained model family chained by an LLM-authored shot plan that keeps them in sync via shared scene/character state.
Use cases: AI-hosted talking-scene shorts with dialogue, Synthetic interview/skit production, Multi-scene narrative shorts with lip-synced characters
Pitfalls: Because timing is driven by measured audio duration feeding back into video segment length, a mismatch anywhere in the LLM-authored shot plan propagates through every downstream stage and only surfaces as broken sync in the final assembly.

lovisdotio/ComfyUI-Workflow-Sora2Alike-Full-loop-video

LTX-2 Audio-Conditioned Two-Stage Generation

audio + reference image + prompt -> base audio-video -> upscaled audio-video

Base generation
LTX-2.3 22B (ltx-2.3-22b-dev)
audio-to-video
Jointly denoise video and audio latents conditioned on an input audio track, a face/reference image, and a prompt
Spatial upscale
LTX-2.3 Spatial Upscaler x2
upscale
Second-stage spatial upscaler sharpens the base clip resolution as a distinct chained checkpoint

LTX-2's dedicated audio-to-video pipeline generates a synchronized base clip directly conditioned on an input audio file (not text), jointly denoising audio and video in one diffusion pass, then a separate spatial-upscaler checkpoint sharpens the result in a second chained pass.

Stack, controls & more

Example stack: ltx-2.3-22b-dev (A2VidPipelineTwoStage) -> ltx-2.3-spatial-upscaler-x2
Controls: input audio track (dialogue/music pacing drives motion energy), reference/face image, upscaler factor (x1.5 or x2), prompt for scene description
Why multi-model: The base diffusion transformer that jointly denoises audio+video latents is trained for temporal/audio coherence, not resolution; production output requires a distinct spatial-upscaler checkpoint chained afterward. These are two separately downloadable model weights, not one model run twice.
Use cases: Podcast/avatar audio-driven scenes, Voice-driven talking clips, Music-synced motion generation
Pitfalls: Because audio and video are denoised jointly rather than muxed afterward, mismatched or noisy input audio directly corrupts motion quality (not just lip sync), and re-running only the upscale stage cannot fix a bad base generation.

Lightricks/LTX-2 official repository (A2VidPipelineTwoStage, spatial upscaler checkpoints)

MultiTalk Multi-Person Dialogue Video

reference image + multi-stream audio + text prompt -> multi-person conversational video

Speech synth + encode
Kokoro-82M (TTS)Chinese-Wav2Vec2 (audio encoder)
text-to-audio
Kokoro-82M synthesizes each speaker line; Chinese-Wav2Vec2 encodes the audio into per-speaker features (encoder, not a generative model)
Audio-conditioned multi-person video generation
MultiTalk (built on Wan 2.1-I2V-14B)
audio-to-video
L-RoPE binds each audio stream to its reference face; diffuses full conversational video

Given a reference image containing multiple people, separate audio streams per speaker, and a scene prompt, the pipeline generates a video where each character's lips and turn-taking match their own audio stream, driven by a large video diffusion backbone conditioned on per-person audio embeddings.

Stack, controls & more

Example stack: Kokoro-82M (TTS for each character line) -> Chinese-Wav2Vec2 (audio embedding) -> MultiTalk / Wan 2.1-I2V-14B (multi-person audio-driven video)
Controls: per-speaker audio-person binding (L-RoPE), resolution (480p/720p), streaming vs clip mode, TeaCache acceleration, LoRA for style
Why multi-model: A single-speaker lip-sync model cannot bind multiple simultaneous audio streams to the correct face in a shared frame; MultiTalk needs a speech encoder to extract per-speaker embeddings, a Label Rotary Position Embedding scheme to bind each audio stream to its person, and a large video diffusion model to render coherent multi-person motion and interaction.
Use cases: Multi-character animated dialogue scenes, Podcast-to-video with two animated hosts, Cartoon character conversation generation
Pitfalls: Audio-person binding can swap identities when reference image faces are visually similar or poorly separated in frame; long streaming generations can drift in lip accuracy.

Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation (arXiv:2505.22647)

OmniHuman Audio-to-Talking-Human

image/video + audio -> human-animation video

Synthesize speech (optional)
ElevenLabsTortoise TTS
text-to-audio
TTS voice clone for the dialogue
Animate
OmniHuman
audio-to-video
one-stage audio-conditioned human animation
Upscale
Real-ESRGAN 2x
upscale
optional video upscale

A one-stage model animates a full human (not just the face) from audio, scaled up by conditioning on portrait, pose, and body signals together so motion and lip-sync emerge from a single network.

Stack, controls & more

Example stack: TTS voice clone -> OmniHuman -> 2x video upscale.
Controls: Source portrait or video; audio; body and pose conditioning; motion intensity.
Why multi-model: Full-body co-speech motion, gestures, and lip-sync run on different timescales; OmniHuman unifies them but still pairs with speech synthesis and upscaling in a real pipeline.
Use cases: Presenter and spokesperson videos, Multilingual dubbing with motion, Social-content avatars
Pitfalls: Hands and fine gestures can smear under fast motion; low-resolution inputs upscale with artifacts.

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models (arXiv:2502.01061)

SadTalker Audio-to-Portrait

portrait + audio -> talking-head video

Animate
SadTalker
audio-to-video
audio to 3D face motion coefficients, rendered onto the portrait
Upscale
GFPGAN
upscale
optional

Drive a single portrait photo with speech by estimating realistic 3D motion coefficients (head pose and expression) from the audio, then rendering a lip-synced talking-head clip.

Stack, controls & more

Example stack: SadTalker (or a commercial talking-head API) -> upscale.
Controls: Source portrait; audio clip; expression and pose amplification; stillness of background.
Why multi-model: Speech-to-motion mapping and photo-real rendering are separate problems; a dedicated motion-coefficient model feeds a renderer that a generator alone cannot.
Use cases: Narrator and explainer avatars, Localization and dubbing, Talking-photo gifts
Pitfalls: Extreme head turns break the single-image 3D assumption; fast speech outruns the expression model.

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation (arXiv:2211.12194)

Text-to-Song Beat-Synced Music Video

lyrics + style prompt -> generated song -> beat/lyric analysis -> synced visual cuts

Song generation
SunoUdio
text-to-audio
Generate full song audio from lyrics + style/genre prompt
Beat-synced visual assembly
FreebeatOne More Shot AI
image-to-video
Detect tempo/beats internally, then generate scenes and cut/transition them on the beat with karaoke captions

Lyrics and a style prompt are turned into a full song via a text-to-music model, then a separate analysis+generation stage detects tempo, beats, and lyric timing from the audio to drive cut points, transitions, and karaoke-style captions over AI-generated or stock visuals.

Stack, controls & more

Example stack: Suno (lyrics+prompt -> song) -> Freebeat / One More Shot AI (beat detection + beat-synced scene generation)
Controls: beat-detection sensitivity, cut density per BPM, karaoke caption styling, aspect ratio export presets (16:9/9:16/1:1)
Why multi-model: The music generator produces audio only and has no visual or editing capability; a distinct beat/onset-detection and lyric-alignment stage must analyze the finished waveform, and a video assembly/generation stage must consume that timing data to place cuts and captions in sync.
Use cases: Independent artists generating music videos from a single song, Social-platform lyric videos with karaoke captions, Rapid pitch/demo music videos for unreleased tracks
Pitfalls: Beat detection can misfire on syncopated or tempo-shifting tracks, causing cuts that feel arbitrary; lyric alignment drifts on ad-libbed or heavily processed vocals.

One More Shot AI, beat-synced music-video generator (product docs)

Text/Pose-to-Video Avatar to Real-Time Lip-Sync

text/image/pose prompt -> generated avatar video -> real-time audio-driven lip-sync

Avatar/base video generation
MuseV
text-to-video
Generate the avatar's motion and background scene from text, image, or pose input
Real-time lip-sync inpainting
MuseTalk
audio-to-video
Latent-space inpainting of the mouth region synced to streaming audio at 30+ FPS

A text-to-video (or image-to-video / pose-to-video) model first generates the avatar's body motion and scene, and a separate real-time latent-space lip-sync model then re-renders just the mouth region at 30+ FPS to match streaming audio, decoupling body/scene generation from low-latency mouth synchronization.

Stack, controls & more

Example stack: MuseV (text/image/pose -> base avatar video) -> MuseTalk (streaming audio -> real-time lip-synced output)
Controls: generation mode (text/image/pose-to-video) for the base clip, streaming audio chunk size, target FPS, latent inpainting region size
Why multi-model: The upstream video generator is optimized for motion and scene diversity, not phoneme-accurate low-latency lip sync, while the lip-sync model is optimized for real-time inpainting of the mouth region only and cannot generate body motion or backgrounds itself; combining them lets each specialize.
Use cases: Live streaming virtual presenters/VTubers, Real-time conversational avatar assistants, Low-latency dubbing of generated avatar video into a live audio feed
Pitfalls: Base video's head pose/angle range must stay within what the lip-sync inpainting model was trained on, or mouth-region inpainting artifacts appear during large head turns.

MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting

TTS-Driven Video-Diffusion Talking Character

text script -> synthesized speech -> generated base video -> lip-synced final video

Speech synthesis
Kokoro TTS
text-to-audio
Generate spoken audio from a text script
Base video generation
LTX-VideoWan 2.1
image-to-video
Animate a still reference image into a moving clip (no accurate lip sync yet)
Lip-sync re-pass
SonicLatentSync
audio-to-video
Re-synchronize mouth motion in the generated video to the TTS audio

A text script is converted to speech by a local TTS engine, an image-to-video model animates a character from a still reference into a base video clip, and a lip-sync model then re-syncs that generated video's mouth movements to the TTS audio, producing a fully text-driven talking character without any recorded voice or footage.

Stack, controls & more

Example stack: Kokoro TTS (text -> speech) -> LTX-Video (image -> base video) -> Sonic or LatentSync (audio -> lip-synced final video)
Controls: voice selection/language in TTS, video model choice for quality vs speed tradeoff, lip-sync model choice (Sonic quality vs LatentSync speed), lips_expression intensity
Why multi-model: No single model both generates a moving video from a still image AND produces accurate phoneme-level lip motion; the pipeline needs a TTS model for speech, an image-to-video diffusion model to create believable body/head motion, and a dedicated audio-conditioned lip-sync model as a final pass because the video model's own lip movements are not driven by the actual phonemes.
Use cases: Fully synthetic talking-character shorts with no source footage or recorded voice, Localizing generated character videos into multiple languages/voices, Rapid iteration on scripted avatar content
Pitfalls: Compounding errors across three models: TTS prosody, base-video head motion, and lip-sync fidelity errors stack, and swapping the middle video-generation stage changes how well the final lip-sync pass locks on.

Geeky Kokoro TTS and Sonic Lipsync maker workflow (Civitai)

Video-to-Audio Foley Synthesis

video -> text prompt -> synced audio

Generate/source video
Wan 2.1HunyuanVideo
text-to-video
Silent video clip from any generator or real footage
Video+text to audio
MMAudio
text-to-audio
Joint multimodal model conditions on visual + text features and synthesizes synced audio

A silent generated (or real) video plus an optional text caption is fed into a joint video-audio-text diffusion model that synthesizes semantically matched, temporally synchronized sound effects and ambience directly from visual motion.

Stack, controls & more

Example stack: Wan 2.1 (text-to-video) -> MMAudio (video+text -> audio) -> mux audio onto video
Controls: text caption strength, classifier-free guidance scale, audio duration alignment to video FPS, synchronization module weight
Why multi-model: The video generator has no concept of sound; a dedicated video-to-audio model trained on paired video-audio-text data is required to infer materials, impacts, and timing from pixels and produce synchronized waveforms, and it in turn depends on a video encoder distinct from the audio decoder.
Use cases: Auto-Foley for silent AI-generated b-roll, Sound design for product demo videos, Rapid SFX drafts for game trailers
Pitfalls: Struggles with style continuity across cuts (no reference-audio injection pathway) and can hallucinate plausible-but-wrong sound sources for ambiguous visual motion.

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis (arXiv:2412.15322)

Voice-Clone TTS to Talking-Head Lip Sync

text + reference audio + reference face -> audio -> talking-head video

Voice cloning / TTS
Tortoise TTSElevenLabs
text-to-audio
zero-shot TTS model (e.g. Tortoise TTS) synthesises speech matching a reference speaker's voice from a text script
Talking-head / lip-sync
Wav2LipHeyGen Avatar IV
audio-to-video
lip-sync model (e.g. Wav2Lip, LatentSync) drives mouth and facial motion on a reference face image or video from the generated audio
Enhancement (optional)
GFPGAN
face-restore
super-resolution or face-restore pass to sharpen face details degraded by the video synthesis

A zero-shot voice-cloning TTS model synthesises speech from a text script in the voice of a reference speaker; that generated audio is then fed into a separate lip-sync or talking-head model which drives realistic facial animation on a reference image or video, producing a complete avatar video without studio recording.

Stack, controls & more

Example stack: Tortoise TTS (voice clone) -> Wav2Lip (lip sync) -> GFPGAN upscaler. Commercial equivalent: ElevenLabs voice clone -> HeyGen Avatar IV talking head.
Controls: TTS: speaker reference audio length and quality, speaking rate, emotional guidance. Lip-sync: face-crop padding, sync confidence threshold, video fps. Enhancement: denoise strength.
Why multi-model: Voice cloning (modelling prosody and speaker timbre from a reference clip) and talking-head animation (modelling facial dynamics from audio signals) are separate learned problems with different architectures and training data; chaining specialist models gives independent control over voice and face, and allows either to be swapped without retraining the other.
Use cases: Personalised AI avatar video from scripts, Video dubbing in a target speaker's voice, Automated marketing or training video production, Accessibility narration with a familiar voice
Pitfalls: Temporal jitter when generated audio phoneme boundaries mismatch the lip-sync model's training distribution; unnatural eye-blink and head-pose when the reference face is a still image; voice cloning requires clean reference audio (background noise degrades timbre matching).

A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis (arXiv:2509.12831)

3D13

Image or text to 3D mesh, texture, and multi-view chains.

Concept Image to Rigged, Motion-Ready Character (Scenario Uthana Chain)

text->image->3D mesh->rigged 3D->animated 3D

generate character concept
GPT Image 2
text-to-image
create the character concept art
convert to 3D mesh
Hunyuan3D 3.1
image-to-3d
reconstruct a 3D mesh from the concept image

Scenario's platform workflow chain generates a character concept image with GPT Image 2, converts it to a 3D mesh with Hunyuan3D 3.1, runs the mesh through Uthana Character Rigging to auto-place a bipedal skeleton and skin weights in under 30 seconds, then animates the rigged character with Uthana Video-to-Motion (from a reference clip) or Uthana Text-to-Motion (from a text description).

Stack, controls & more

Example stack: GPT Image 2 (text-to-image concept) -> Hunyuan3D 3.1 (image-to-3d mesh) -> Uthana Character Rigging (auto-rig: skeleton + skin weights, no valid modelType slot) -> Uthana Video-to-Motion or Text-to-Motion (animate, no valid modelType slot)
Controls: T-pose/A-pose mesh required (OBJ/GLB/FBX) for the auto-rigger to read shoulder line, hips, and spine cleanly; downstream motion source is either a reference video clip (Video-to-Motion) or a text description (Text-to-Motion)
Why multi-model: Chains a text-to-image model and a separate image-to-3D mesh model as the two genuinely distinct generative-media stages; the downstream Uthana auto-rigger and motion-application steps are utility/retrieval operations (skeleton fitting, library motion application), not additional generative models, but are included because they differentiate the full-chain vendor stack from the existing Tripo + Mixamo pattern.
Use cases: concept-to-animated game character without manual rigging, rapid prototyping of animation-ready NPCs from a single prompt, reusing one auto-generated rig across multiple motion passes
Pitfalls: auto-rigging requires a clean humanoid mesh in a T/A-pose with feet on the ground, non-bipedal or non-canonical-pose meshes from the upstream 3D step can fail to rig correctly; closest existing pattern is mesh-to-rig-to-animation-chain (Tripo Studio mesh + Mixamo auto-rig + animation library) but this chain uses an entirely different vendor stack throughout (GPT Image 2 concept generation, Hunyuan3D 3.1 reconstruction, and Uthana's auto-rigger plus Uthana's own video/text-to-motion generators instead of a Mixamo animation library), and adds an upstream text-to-image concept-generation stage that the Tripo+Mixamo chain does not include

Scenario blog - Uthana Character Rigging is now on Scenario

Generate-then-Decompose PBR Material (CHORD)

text -> image (flat-lit texture) -> PBR map set (base color, normal, height, roughness, metalness)

Texture generation
Z-Image-TurboSDXL
text-to-image
a diffusion model generates a seamless tileable 2D texture from a text prompt (or reference lineart/height input)
PBR decomposition
Ubisoft La Forge CHORD
image-to-image
CHORD estimates SVBRDF maps (base color, normal, roughness, metalness) from the single generated texture using chained decomposition and single-step diffusion inference; this estimation role has no exact vocab match, closest fit is image-to-image, but it is a distinct generative/estimation model from the texture generator
Height derivation
ChordNormalToHeight
image-to-image
ChordNormalToHeight converts the predicted normal map into a displacement-ready height map

A text prompt is first turned into a seamless, tileable flat-lit texture by an image diffusion model, then Ubisoft La Forge's CHORD model decomposes that single texture into a full physically-based-rendering material map set via SVBRDF estimation.

Stack, controls & more

Example stack: Z-Image-Turbo (text to tileable texture) -> Ubisoft CHORD (SVBRDF decomposition) -> ChordNormalToHeight (height map)
Controls: text prompt for the base texture, optional lineart/height conditioning input, group bypass toggles to run texture-gen and CHORD stages independently or feed in a user-provided texture directly to CHORD
Why multi-model: Generating a plausible texture and estimating physically accurate SVBRDF channels (base color, normal, height, roughness, metalness) from it are different learned tasks; CHORD is a dedicated decomposition/estimation model chained after a separate text-to-image generator rather than one model doing both.
Use cases: rapid PBR material authoring for game engines and DCC tools, converting AI-generated textures into production-ready material sets, iterating on material ideas without manual SVBRDF capture
Pitfalls: CHORD is released under a research-only license limiting production use; decomposition quality depends on the flatness/lighting of the generated source texture, poorly lit or shaded input textures produce inaccurate roughness/metalness estimates

ComfyUI official blog: Ubisoft Open-Sources the CHORD Model and ComfyUI Nodes for End-to-End PBR Material Generation (example workflow: chord_zimage_turbo_t2i_image_to_material.json)

Generated Mesh to Rigged, Animated Character

text/image -> 3D character mesh -> auto-rigged skeleton + skin weights -> retargeted animation

Mesh generation
Tripo Studio
image-to-3d
Text or image prompt produces a static 3D character mesh in T-pose, exported and cleaned for rigging
Auto-rigging
Mixamo
render
Analyzes mesh topology, builds a skeleton, and computes skin weights automatically (learned rigging utility; approximate node type)
Animation application
Mixamo animation library
render
Applies clips from the animation library onto the rigged skeleton (retargeting utility; approximate node type)

Turns a prompt or reference image into a game-ready animated character by generating a T-posed base mesh with one model, automatically detecting a skeleton and computing skin weights with a dedicated auto-rigging service, then applying that service motion-capture library onto the resulting rig.

Stack, controls & more

Example stack: Tripo Studio (T-pose mesh) -> Blender cleanup -> Mixamo (auto-rig + skin weights) -> Mixamo animation library -> Unity/Unreal
Controls: T-pose generation toggle, mesh cleanup (orientation, vertex merging), skin-weight smoothing, animation clip selection
Why multi-model: Image/text-to-3D generators produce static geometry only, with no joints or deformation; auto-rigging services are trained specifically to infer skeletal topology and per-vertex skin weights from arbitrary mesh geometry, and separately hold the animation clip library. Shape generation and rigging/animation are distinct systems chained through a mesh export/cleanup step.
Use cases: Indie game NPC pipelines from concept art to playable character, Rapid prototyping of animated avatars, Batch-rigging libraries of AI-generated characters
Pitfalls: Auto-rigging struggles with non-humanoid or heavily stylized meshes and can produce broken skin weights at joints; meshes that skip cleanup (non-manifold, wrong facing) fail upload; library clips often need per-character proportion adjustment.

How to Rig an AI-Generated Character for Mixamo: Auto-Rigging Guide (Tripo AI)

Geometry-Aware PBR Material Distillation

untextured 3D mesh + text -> lit multi-view renders -> decomposed PBR material maps

Base mesh
Hunyuan3D 2.1TRELLISTripo
image-to-3d
Any existing untextured/bare mesh generator supplies the input geometry
Geometry+light-conditioned shading
DreamMat geometry- and light-aware diffusion model
image-to-image
Diffusion model fine-tuned to render the mesh under a specified environment light, avoiding generic unconditioned RGB textures
PBR distillation
DreamMat inverse-rendering distillation
render
Inverse-rendering optimization decomposes the shaded multi-view renders into albedo, roughness, and metallic maps consistent across views

Takes an existing bare mesh and generates physically-based material maps (albedo, roughness, metallic) matched to its geometry, using a diffusion model conditioned on the mesh's own geometry and a chosen lighting environment, then distilling the shaded outputs into decomposed BRDF parameters via inverse rendering.

Stack, controls & more

Example stack: Hunyuan3D 2.1 (bare mesh) -> DreamMat diffusion -> DreamMat PBR distillation -> Blender/Unreal import
Controls: environment light choice/rotation during conditioning, number of viewpoints rendered, per-material text prompt, distillation iteration count, mesh UV parameterization
Why multi-model: An image-to-3D model only produces baked-in RGB texture or bare geometry; it cannot separate real material properties from shading. DreamMat needs a geometry- and light-aware diffusion model to render plausible shaded views under controlled illumination, plus a separate inverse-rendering distillation to decompose those views into albedo/roughness/metallic maps free of baked-in lighting.
Use cases: Relightable game/film asset texturing, Converting scanned or generated bare meshes into engine-ready PBR assets, Batch material re-skinning of an existing mesh library
Pitfalls: Without light-aware conditioning, shading bakes into albedo and looks wrong under new lighting; multi-view inconsistency causes seams at UV boundaries; distillation is compute-heavy per asset (minutes, not seconds).

DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models (SIGGRAPH 2024, arXiv:2405.17176)

Geometry-Then-Texture 3D Asset Pipeline

image -> 3D mesh -> textured 3D asset

Image conditioning
SDXL
text-to-image
optional text-to-image step to produce a reference image when starting from a prompt
Geometry generation
Step1X-3DHunyuan3D-DiT
image-to-3d
VAE-DiT or flow-based diffusion model produces untextured mesh / TSDF from the reference image
Texture synthesis
Step1X-3D PaintHunyuan3D-Paint
render
multi-view diffusion model (SD-XL-fine-tuned) synthesises PBR texture maps conditioned on the geometry and input image, then bakes them onto the mesh

A first model generates an untextured 3D geometry (mesh / TSDF) from a single image or text prompt; a second, separate diffusion model synthesises high-resolution texture maps conditioned on the produced geometry, yielding a fully textured, PBR-ready 3D asset.

Stack, controls & more

Example stack: Step1X-3D geometry stage (VAE-DiT TSDF) -> Step1X-3D texture stage (SD-XL-fine-tuned multi-view generator) -> mesh export. Alternative: Hunyuan3D-DiT shape -> Hunyuan3D-Paint texture -> GLB.
Controls: Input image quality and field of view; geometry resolution cap; texture diffusion guidance scale; number of synthesis views; UV unwrap quality; PBR channel selection (albedo, metallic, roughness, normal).
Why multi-model: Geometry generation and texture synthesis have conflicting objectives: the shape model optimises for accurate 3D form while the texture model optimises for view-consistent, photorealistic appearance conditioned on that form. Splitting into two specialist models lets each solve its own problem; the texture model explicitly conditions on the finished geometry rather than trying to hallucinate both simultaneously.
Use cases: Game asset prototyping from concept art, Product visualisation from a single photo, 3D printing prep from reference images, Rapid 3D scene population for VFX
Pitfalls: Texture seams and occlusion holes where multi-view coverage is sparse; UV inpainting quality degrades for highly concave surfaces; geometry errors propagate into texture conditioning (garbage-in, garbage-out); PBR decomposition may bake lighting into albedo.

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets (arXiv:2505.07747)

Hunyuan3D-2 Image-to-Mesh

image -> textured mesh

Shape
Hunyuan3D-2
image-to-3d
image to 3D shape via diffusion
Texture
Hunyuan3D-2 Paint
image-to-image
PBR texture synthesis onto the mesh
Retopo/finish
Remesh
image-to-image
optional cleanup and remesh

A two-stage shape-then-texture pipeline: a diffusion model produces a clean 3D shape from a single image, then a second model bakes PBR textures onto it for a production-ready mesh.

Stack, controls & more

Example stack: Hunyuan3D-2 (shape + texture) -> remesh.
Controls: Input image; view conditioning; texture resolution; seed.
Why multi-model: Shape and texture have different distributions and resolutions; splitting them into a shape model and a texture model yields far cleaner meshes than a single joint generator.
Use cases: Printable miniatures, Game props, Product 3D shots
Pitfalls: Back and underside faces are guessed from a single view; complex topology retopologizes poorly.

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets (arXiv:2501.12202)

Multi-Angle Turnaround to Rigged Game Asset (fal Game Assets)

text->image->image(xN)->3D mesh

generate base object
fal-ai/bytedance/seedream/v4
text-to-image
create the object concept image
generate turnaround views
fal-ai/bytedance/seedream/v4/edit
image-to-image
4 angle views (front, left, right, back) from the concept image via instruction edit
reconstruct 3D mesh
fal-ai/hyper3d/rodin/v2
image-to-3d
feed the multi-view image set into Rodin's multi-view concat mode for accurate mesh reconstruction

fal.ai's 'Game Assets' workflow template generates an object image with Seedream, then uses Seedream Edit to produce left/right/back angle views of that same object, then feeds the multi-view set into Hyper3D Rodin v2 to reconstruct a textured 3D mesh in one API call.

Stack, controls & more

Example stack: Seedream v4 (text-to-image) -> Seedream v4 Edit (multi-view turnaround, 3x) -> Hyper3D Rodin v2 (multi-view image-to-3d)
Controls: prompt for object concept; edit prompts specify angle (left side/right side/back view) and background/texture cleanup; Rodin multi-view concat mode input
Why multi-model: Chains a text-to-image generator, an instruction-based image editor (for consistent multi-view turnaround), and a dedicated multi-view-conditioned 3D reconstruction model, each a separate hosted model exposed as one workflow endpoint.
Use cases: game-ready prop/asset generation from a single concept prompt, turning a product concept into a 3D asset without a photo shoot, rapid prototyping of 3D collectibles/NPCs for prototyping
Pitfalls: edited turnaround views can drift in proportions/details between angles since Seedream Edit has no hard 3D consistency guarantee, which can degrade Rodin's reconstruction versus true multi-view-diffusion methods (e.g. Zero123/SyncDreamer) that are trained explicitly for view consistency; closest existing pattern is multiview-diffusion-to-3d-reconstruction (Zero123/SyncDreamer -> NeuS/InstantMesh) but this chain substitutes a general-purpose text-to-image + instruction-edit pair for the multi-view-diffusion step and Rodin v2 for the NeuS/InstantMesh reconstruction stage, a materially different model family and generation goal (single-prompt asset creation, not reconstruction from an existing photo)

fal.ai Workflow Templates - Game Assets

Multi-View Diffusion to 3D Reconstruction

single image -> multi-view images -> 3D mesh

View-consistent image synthesis
SyncDreamerZero123++
image-to-3d
multi-view diffusion model (e.g. SyncDreamer, Zero123++) generates 4-16 geometrically consistent views from a single input image via a synchronised reverse diffusion process
Neural 3D reconstruction
NeuSInstantMesh
image-to-3d
reconstruction algorithm (NeuS, NeRF, or instant-mesh) fits a 3D representation from the synthesised views without SDS loss
Mesh export / texture bake (optional)
Marching Cubes
render
optional marching-cubes mesh extraction and UV texture bake from the NeRF or SDF volume

A multi-view diffusion model generates a set of geometrically consistent novel-view images of an object from a single reference image; those synthetic views are then passed to a neural reconstruction algorithm (NeuS, NeRF, or an instant-reconstruction model) to recover the full 3D mesh without needing real multi-view capture. Note: no 'multi-view' family enum value exists; draft-to-finish is the closest fit for this two-stage view-synthesis-then-reconstruction chain.

Stack, controls & more

Example stack: SyncDreamer (16-view generation) -> NeuS mesh reconstruction -> marching cubes export. Alternative: Zero123++ views -> InstantMesh reconstruction.
Controls: Number of synthesised views; elevation and azimuth sampling angles; diffusion guidance scale; reconstruction iteration count; NeuS surface threshold; optional mask for background removal before reconstruction.
Why multi-model: A single-image-to-3D model must simultaneously understand 2D appearance and 3D structure, which is under-constrained. Splitting into a view-synthesis model (which inherits rich 2D priors from large diffusion training) and a reconstruction model (which specialises in 3D geometry from image sets) lets each focus on its strength, and the reconstruction model receives denser, consistent multi-view supervision.
Use cases: Single-photo 3D object digitisation for e-commerce, Game asset creation from concept art, 3D model generation for AR placement, Museum / heritage object 3D capture from photographs
Pitfalls: Multi-view consistency is imperfect for textureless or symmetric objects; NeuS reconstruction quality degrades when generated views have geometric drift; thin structures (wires, hair) are not recovered well by implicit surface methods.

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image (project page)

Photogrammetry to Surface-Aligned Mesh

multi-view photos/video -> camera poses -> 3D Gaussian splats -> textured surface mesh

Pose recovery
COLMAP
depth
Structure-from-Motion estimates camera intrinsics/extrinsics and a sparse point cloud from overlapping images
Radiance capture
3D Gaussian Splatting
render
3D Gaussian Splatting trains per-scene on the COLMAP poses to produce a photoreal, view-consistent splat cloud
Surface extraction
SuGaR
image-to-3d
Surface-alignment regularization plus Poisson reconstruction pulls a coherent mesh out of the splats, then bakes a UV texture

Reconstructs a real-world object from ordinary photos or a video walkthrough: Structure-from-Motion recovers camera poses, Gaussian Splatting trains a photoreal radiance field on those poses, then a surface-extraction pass converts the volumetric splats into a clean, editable, UV-textured mesh.

Stack, controls & more

Example stack: COLMAP -> 3D Gaussian Splatting -> SuGaR -> Blender/Unreal/Unity import
Controls: capture density/overlap, COLMAP matcher settings, 3DGS training iterations and densification thresholds, SuGaR regularization type, target mesh face count, Poisson depth
Why multi-model: COLMAP solves camera geometry but produces no renderable surface; 3D Gaussian Splatting renders photoreal views but is a volumetric point representation with no faces, UVs, or watertight geometry; SuGaR is a distinct surface-alignment and Poisson-extraction stage needed to pull an editable mesh out of the splats. Each stage solves a problem the others cannot.
Use cases: Photogrammetry asset capture from a phone video, Location scanning for virtual production, Cultural heritage digitization with editable, engine-ready output, AR/VR asset creation from real objects
Pitfalls: Poor image overlap or motion blur makes COLMAP fail pose estimation, propagating floaters and fractured geometry; reflective/transparent surfaces break both SfM matching and Gaussian convergence; mesh extraction can lose fine detail present in the raw splats.

SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction (CVPR 2024)

Segment-Then-Reconstruct Scene Kitbashing

single cluttered photo -> per-object segmentation masks -> individually reconstructed posed 3D objects

Object segmentation
SAM 3
image-to-image
Segments every distinct object instance into individual masks, handling occlusion and clutter
Per-object 3D reconstruction
SAM 3D Objects
image-to-3d
Each masked object is independently lifted into a full 3D shape, texture, and pose estimate

Extracts every distinct object from one ordinary photo of a cluttered scene and reconstructs each as its own posed, textured 3D asset, by first running a general-purpose segmentation model to isolate object masks, then feeding each mask into a dedicated single-image-to-3D reconstruction model.

Stack, controls & more

Example stack: SAM 3 (segmentation) -> SAM 3D Objects (per-object 3D reconstruction) -> scene assembly in Blender/Unity
Controls: mask selection/refinement (click or box prompts), per-object occlusion handling, output pose alignment to the original camera view, which detected objects to reconstruct
Why multi-model: A 2D segmentation model identifies what and where objects are but has no concept of 3D geometry; a single-image 3D reconstruction model lifts one already-isolated object into 3D but cannot first find and separate multiple objects in a cluttered scene. SAM 3D Objects reconstructs one selected object at a time and depends on an upstream segmentation step.
Use cases: Rapid kitbashing of game props from a single reference photo, AR product placement extracted from lifestyle photography, Building a 3D asset library from real-world photo references
Pitfalls: Reconstructs objects one at a time with no reasoning about physical interactions, so assembled scenes can show interpenetration or floating objects; heavy occlusion degrades reconstruction; segmentation errors propagate into missing or fused assets.

Introducing SAM 3D: 3D Reconstruction for Physical World Images (Meta AI)

StableGen TRELLIS mesh-gen to multi-ControlNet texture to PBR decomposition (Blender)

text/image -> 3d mesh -> image (x N views) -> PBR maps

mesh generation
TRELLIS.2
image-to-3d
TRELLIS.2 generates a mesh from a text prompt or single image (optional stage; can also texture an existing mesh)
geometry control
ControlNet depth/canny/normal
controlnet
depth/canny/normal maps rendered per-camera-view feed multiple simultaneous ControlNet units
multi-view texture synthesis
SDXLFLUX.1 devQwen Image Edit
image-to-image
SDXL, FLUX.1-dev, or Qwen Image Edit generates textures per viewpoint with IPAdapter reference-style guidance, projected and blended onto the mesh
PBR decomposition
MarigoldStableDelight
image-to-image
optional post-pass decomposes the baked albedo into roughness/metallic/normal/height/AO/emission maps

Blender addon StableGen chains TRELLIS.2 (text/image-to-3D mesh generation) into a ComfyUI-driven multi-viewpoint diffusion texturing pass using SDXL/FLUX/Qwen-Image-Edit with simultaneous depth+canny+normal ControlNet and IPAdapter style guidance, then optionally decomposes the generated texture into full PBR material maps via Marigold/StableDelight, all inside one Blender workflow.

Stack, controls & more

Example stack: TRELLIS.2 mesh -> multi-view depth+canny ControlNet + IPAdapter SDXL texturing -> Marigold/StableDelight PBR decomposition, driven from the StableGen Blender panel via a local ComfyUI server
Controls: ControlNet weight per unit (depth/canny/normal), IPAdapter reference image and weight, generation strategy (sequential/grid/separate per view), per-region local-edit masks, PBR map toggles
Why multi-model: Distinct from baseline's geometry-then-texture-3d and geometry-aware-pbr-material-distillation because it locks simultaneous multi-ControlNet + IPAdapter conditioning across many camera viewpoints in a single texturing pass (not a single depth pass), and unifies mesh generation, multi-view texturing, and in-addon PBR decomposition as one chain rather than separate tools; TRELLIS.2, SDXL/FLUX/Qwen diffusion, ControlNet, IPAdapter, and Marigold/StableDelight are five distinct model families invoked in sequence.
Use cases: game-ready PBR-textured props from a prompt, re-texturing existing game/VFX meshes with geometry-locked multi-view AI textures, local editing of specific texture regions without a full re-generation
Pitfalls: Multi-view seams/blending are projection-based and can show artifacts at UV boundaries; PBR decomposition is a heuristic estimate, not measured material data; requires a running local ComfyUI server plus VRAM for SDXL/FLUX

StableGen GitHub README

Text-to-Image-Depth Room Mesh Fusion

text -> per-view 2D renders -> per-view depth -> fused, inpainted textured room mesh

Initial view synthesis
Stable Diffusion
text-to-image
Diffusion model renders the first room view from the text prompt
Depth lifting
MiDaS-family depth estimator
depth
Monocular depth estimation converts each 2D frame into 3D geometry aligned with prior frames
Gap inpainting
Stable Diffusion inpainting
inpaint
Text-conditioned inpainting fills regions disoccluded by camera movement before fusing into the mesh
Mesh fusion
Text2Room mesh fusion
render
Continuous alignment iteratively fuses each new frame with the existing mesh, selecting next viewpoints to maximize coverage

Generates a room-scale 3D environment purely from a text prompt by iteratively rendering images at chosen camera poses, lifting each into 3D with monocular depth, and fusing/inpainting the growing mesh so each new view integrates seamlessly with previously generated geometry.

Stack, controls & more

Example stack: Stable Diffusion -> monocular depth -> Stable Diffusion inpainting -> Text2Room fusion -> textured room mesh (OBJ)
Controls: text prompt per region/object, camera viewpoint trajectory/selection strategy, depth alignment tolerance, inpainting mask region, mesh simplification thresholds
Why multi-model: A text-to-image model produces flat 2D pixels with no camera-consistent geometry; a monocular depth model lifts a single image to a partial point cloud but cannot generate novel content or fill disocclusions; a text-conditioned inpainting model fills the gaps exposed as the camera moves, and a mesh-fusion algorithm stitches each frame into one consistent mesh.
Use cases: Rapid environment blockouts for game levels from text briefs, Virtual production background generation, Synthetic training environments for embodied AI, Architectural concept walkthroughs
Pitfalls: Depth misalignment across frames causes seams, ghosting, or duplicated geometry; viewpoint selection can miss occluded regions leaving holes; style can drift between frames; scales to room interiors but not large open scenes.

Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models (ICCV 2023, arXiv:2303.11989)

TRELLIS Structured-3D-Latents Pipeline

text/image -> 3D (mesh, radiance field, 3DGS)

Latent generate
TRELLISTRELLIS.2
image-to-3d
image or text to structured 3D latent
Decode
TRELLIS decoder
render
decode to mesh / volume / 3DGS representation
Texture/finish
Retopo
image-to-image
optional texture refinement or retopo

Generate a 3D asset in a structured latent space, then decode it to whichever representation the pipeline needs (mesh, volume, or Gaussian splats) from a single forward pass.

Stack, controls & more

Example stack: TRELLIS (or TRELLIS.2) -> texture refine -> retopo.
Controls: Input image or text; output format (mesh/3DGS/volume); guidance strength; seed.
Why multi-model: A single representation cannot serve every downstream use; the structured-latent decoder lets one model emit interchangeable mesh, voxel, and 3DGS outputs that each need different post-processing.
Use cases: Game and VFX assets, AR product previews, Rapid concept modeling
Pitfalls: Thin structures and interiors are under-represented; topology needs retopology before production use.

TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation (arXiv:2412.01506)