Concept Image to Rigged, Motion-Ready Character (Scenario Uthana Chain)
text->image->3D mesh->rigged 3D->animated 3DScenario's platform workflow chain generates a character concept image with GPT Image 2, converts it to a 3D mesh with Hunyuan3D 3.1, runs the mesh through Uthana Character Rigging to auto-place a bipedal skeleton and skin weights in under 30 seconds, then animates the rigged character with Uthana Video-to-Motion (from a reference clip) or Uthana Text-to-Motion (from a text description).
Stack, controls & moreHide
- Example stack
- GPT Image 2 (text-to-image concept) -> Hunyuan3D 3.1 (image-to-3d mesh) -> Uthana Character Rigging (auto-rig: skeleton + skin weights, no valid modelType slot) -> Uthana Video-to-Motion or Text-to-Motion (animate, no valid modelType slot)
- Controls
- T-pose/A-pose mesh required (OBJ/GLB/FBX) for the auto-rigger to read shoulder line, hips, and spine cleanly; downstream motion source is either a reference video clip (Video-to-Motion) or a text description (Text-to-Motion)
- Why multi-model
- Chains a text-to-image model and a separate image-to-3D mesh model as the two genuinely distinct generative-media stages; the downstream Uthana auto-rigger and motion-application steps are utility/retrieval operations (skeleton fitting, library motion application), not additional generative models, but are included because they differentiate the full-chain vendor stack from the existing Tripo + Mixamo pattern.
- Use cases
- concept-to-animated game character without manual rigging, rapid prototyping of animation-ready NPCs from a single prompt, reusing one auto-generated rig across multiple motion passes
- Pitfalls
- auto-rigging requires a clean humanoid mesh in a T/A-pose with feet on the ground, non-bipedal or non-canonical-pose meshes from the upstream 3D step can fail to rig correctly; closest existing pattern is mesh-to-rig-to-animation-chain (Tripo Studio mesh + Mixamo auto-rig + animation library) but this chain uses an entirely different vendor stack throughout (GPT Image 2 concept generation, Hunyuan3D 3.1 reconstruction, and Uthana's auto-rigger plus Uthana's own video/text-to-motion generators instead of a Mixamo animation library), and adds an upstream text-to-image concept-generation stage that the Tripo+Mixamo chain does not include
Scenario blog - Uthana Character Rigging is now on Scenario
Generate-then-Decompose PBR Material (CHORD)
text -> image (flat-lit texture) -> PBR map set (base color, normal, height, roughness, metalness)A text prompt is first turned into a seamless, tileable flat-lit texture by an image diffusion model, then Ubisoft La Forge's CHORD model decomposes that single texture into a full physically-based-rendering material map set via SVBRDF estimation.
Stack, controls & moreHide
- Example stack
- Z-Image-Turbo (text to tileable texture) -> Ubisoft CHORD (SVBRDF decomposition) -> ChordNormalToHeight (height map)
- Controls
- text prompt for the base texture, optional lineart/height conditioning input, group bypass toggles to run texture-gen and CHORD stages independently or feed in a user-provided texture directly to CHORD
- Why multi-model
- Generating a plausible texture and estimating physically accurate SVBRDF channels (base color, normal, height, roughness, metalness) from it are different learned tasks; CHORD is a dedicated decomposition/estimation model chained after a separate text-to-image generator rather than one model doing both.
- Use cases
- rapid PBR material authoring for game engines and DCC tools, converting AI-generated textures into production-ready material sets, iterating on material ideas without manual SVBRDF capture
- Pitfalls
- CHORD is released under a research-only license limiting production use; decomposition quality depends on the flatness/lighting of the generated source texture, poorly lit or shaded input textures produce inaccurate roughness/metalness estimates
ComfyUI official blog: Ubisoft Open-Sources the CHORD Model and ComfyUI Nodes for End-to-End PBR Material Generation (example workflow: chord_zimage_turbo_t2i_image_to_material.json)
Generated Mesh to Rigged, Animated Character
text/image -> 3D character mesh -> auto-rigged skeleton + skin weights -> retargeted animationTurns a prompt or reference image into a game-ready animated character by generating a T-posed base mesh with one model, automatically detecting a skeleton and computing skin weights with a dedicated auto-rigging service, then applying that service motion-capture library onto the resulting rig.
Stack, controls & moreHide
- Example stack
- Tripo Studio (T-pose mesh) -> Blender cleanup -> Mixamo (auto-rig + skin weights) -> Mixamo animation library -> Unity/Unreal
- Controls
- T-pose generation toggle, mesh cleanup (orientation, vertex merging), skin-weight smoothing, animation clip selection
- Why multi-model
- Image/text-to-3D generators produce static geometry only, with no joints or deformation; auto-rigging services are trained specifically to infer skeletal topology and per-vertex skin weights from arbitrary mesh geometry, and separately hold the animation clip library. Shape generation and rigging/animation are distinct systems chained through a mesh export/cleanup step.
- Use cases
- Indie game NPC pipelines from concept art to playable character, Rapid prototyping of animated avatars, Batch-rigging libraries of AI-generated characters
- Pitfalls
- Auto-rigging struggles with non-humanoid or heavily stylized meshes and can produce broken skin weights at joints; meshes that skip cleanup (non-manifold, wrong facing) fail upload; library clips often need per-character proportion adjustment.
How to Rig an AI-Generated Character for Mixamo: Auto-Rigging Guide (Tripo AI)
Geometry-Aware PBR Material Distillation
untextured 3D mesh + text -> lit multi-view renders -> decomposed PBR material mapsTakes an existing bare mesh and generates physically-based material maps (albedo, roughness, metallic) matched to its geometry, using a diffusion model conditioned on the mesh's own geometry and a chosen lighting environment, then distilling the shaded outputs into decomposed BRDF parameters via inverse rendering.
Stack, controls & moreHide
- Example stack
- Hunyuan3D 2.1 (bare mesh) -> DreamMat diffusion -> DreamMat PBR distillation -> Blender/Unreal import
- Controls
- environment light choice/rotation during conditioning, number of viewpoints rendered, per-material text prompt, distillation iteration count, mesh UV parameterization
- Why multi-model
- An image-to-3D model only produces baked-in RGB texture or bare geometry; it cannot separate real material properties from shading. DreamMat needs a geometry- and light-aware diffusion model to render plausible shaded views under controlled illumination, plus a separate inverse-rendering distillation to decompose those views into albedo/roughness/metallic maps free of baked-in lighting.
- Use cases
- Relightable game/film asset texturing, Converting scanned or generated bare meshes into engine-ready PBR assets, Batch material re-skinning of an existing mesh library
- Pitfalls
- Without light-aware conditioning, shading bakes into albedo and looks wrong under new lighting; multi-view inconsistency causes seams at UV boundaries; distillation is compute-heavy per asset (minutes, not seconds).
DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models (SIGGRAPH 2024, arXiv:2405.17176)
Geometry-Then-Texture 3D Asset Pipeline
image -> 3D mesh -> textured 3D assetA first model generates an untextured 3D geometry (mesh / TSDF) from a single image or text prompt; a second, separate diffusion model synthesises high-resolution texture maps conditioned on the produced geometry, yielding a fully textured, PBR-ready 3D asset.
Stack, controls & moreHide
- Example stack
- Step1X-3D geometry stage (VAE-DiT TSDF) -> Step1X-3D texture stage (SD-XL-fine-tuned multi-view generator) -> mesh export. Alternative: Hunyuan3D-DiT shape -> Hunyuan3D-Paint texture -> GLB.
- Controls
- Input image quality and field of view; geometry resolution cap; texture diffusion guidance scale; number of synthesis views; UV unwrap quality; PBR channel selection (albedo, metallic, roughness, normal).
- Why multi-model
- Geometry generation and texture synthesis have conflicting objectives: the shape model optimises for accurate 3D form while the texture model optimises for view-consistent, photorealistic appearance conditioned on that form. Splitting into two specialist models lets each solve its own problem; the texture model explicitly conditions on the finished geometry rather than trying to hallucinate both simultaneously.
- Use cases
- Game asset prototyping from concept art, Product visualisation from a single photo, 3D printing prep from reference images, Rapid 3D scene population for VFX
- Pitfalls
- Texture seams and occlusion holes where multi-view coverage is sparse; UV inpainting quality degrades for highly concave surfaces; geometry errors propagate into texture conditioning (garbage-in, garbage-out); PBR decomposition may bake lighting into albedo.
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets (arXiv:2505.07747)
Hunyuan3D-2 Image-to-Mesh
image -> textured meshA two-stage shape-then-texture pipeline: a diffusion model produces a clean 3D shape from a single image, then a second model bakes PBR textures onto it for a production-ready mesh.
Stack, controls & moreHide
- Example stack
- Hunyuan3D-2 (shape + texture) -> remesh.
- Controls
- Input image; view conditioning; texture resolution; seed.
- Why multi-model
- Shape and texture have different distributions and resolutions; splitting them into a shape model and a texture model yields far cleaner meshes than a single joint generator.
- Use cases
- Printable miniatures, Game props, Product 3D shots
- Pitfalls
- Back and underside faces are guessed from a single view; complex topology retopologizes poorly.
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets (arXiv:2501.12202)
Multi-Angle Turnaround to Rigged Game Asset (fal Game Assets)
text->image->image(xN)->3D meshfal.ai's 'Game Assets' workflow template generates an object image with Seedream, then uses Seedream Edit to produce left/right/back angle views of that same object, then feeds the multi-view set into Hyper3D Rodin v2 to reconstruct a textured 3D mesh in one API call.
Stack, controls & moreHide
- Example stack
- Seedream v4 (text-to-image) -> Seedream v4 Edit (multi-view turnaround, 3x) -> Hyper3D Rodin v2 (multi-view image-to-3d)
- Controls
- prompt for object concept; edit prompts specify angle (left side/right side/back view) and background/texture cleanup; Rodin multi-view concat mode input
- Why multi-model
- Chains a text-to-image generator, an instruction-based image editor (for consistent multi-view turnaround), and a dedicated multi-view-conditioned 3D reconstruction model, each a separate hosted model exposed as one workflow endpoint.
- Use cases
- game-ready prop/asset generation from a single concept prompt, turning a product concept into a 3D asset without a photo shoot, rapid prototyping of 3D collectibles/NPCs for prototyping
- Pitfalls
- edited turnaround views can drift in proportions/details between angles since Seedream Edit has no hard 3D consistency guarantee, which can degrade Rodin's reconstruction versus true multi-view-diffusion methods (e.g. Zero123/SyncDreamer) that are trained explicitly for view consistency; closest existing pattern is multiview-diffusion-to-3d-reconstruction (Zero123/SyncDreamer -> NeuS/InstantMesh) but this chain substitutes a general-purpose text-to-image + instruction-edit pair for the multi-view-diffusion step and Rodin v2 for the NeuS/InstantMesh reconstruction stage, a materially different model family and generation goal (single-prompt asset creation, not reconstruction from an existing photo)
fal.ai Workflow Templates - Game Assets
Multi-View Diffusion to 3D Reconstruction
single image -> multi-view images -> 3D meshA multi-view diffusion model generates a set of geometrically consistent novel-view images of an object from a single reference image; those synthetic views are then passed to a neural reconstruction algorithm (NeuS, NeRF, or an instant-reconstruction model) to recover the full 3D mesh without needing real multi-view capture. Note: no 'multi-view' family enum value exists; draft-to-finish is the closest fit for this two-stage view-synthesis-then-reconstruction chain.
Stack, controls & moreHide
- Example stack
- SyncDreamer (16-view generation) -> NeuS mesh reconstruction -> marching cubes export. Alternative: Zero123++ views -> InstantMesh reconstruction.
- Controls
- Number of synthesised views; elevation and azimuth sampling angles; diffusion guidance scale; reconstruction iteration count; NeuS surface threshold; optional mask for background removal before reconstruction.
- Why multi-model
- A single-image-to-3D model must simultaneously understand 2D appearance and 3D structure, which is under-constrained. Splitting into a view-synthesis model (which inherits rich 2D priors from large diffusion training) and a reconstruction model (which specialises in 3D geometry from image sets) lets each focus on its strength, and the reconstruction model receives denser, consistent multi-view supervision.
- Use cases
- Single-photo 3D object digitisation for e-commerce, Game asset creation from concept art, 3D model generation for AR placement, Museum / heritage object 3D capture from photographs
- Pitfalls
- Multi-view consistency is imperfect for textureless or symmetric objects; NeuS reconstruction quality degrades when generated views have geometric drift; thin structures (wires, hair) are not recovered well by implicit surface methods.
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image (project page)
Photogrammetry to Surface-Aligned Mesh
multi-view photos/video -> camera poses -> 3D Gaussian splats -> textured surface meshReconstructs a real-world object from ordinary photos or a video walkthrough: Structure-from-Motion recovers camera poses, Gaussian Splatting trains a photoreal radiance field on those poses, then a surface-extraction pass converts the volumetric splats into a clean, editable, UV-textured mesh.
Stack, controls & moreHide
- Example stack
- COLMAP -> 3D Gaussian Splatting -> SuGaR -> Blender/Unreal/Unity import
- Controls
- capture density/overlap, COLMAP matcher settings, 3DGS training iterations and densification thresholds, SuGaR regularization type, target mesh face count, Poisson depth
- Why multi-model
- COLMAP solves camera geometry but produces no renderable surface; 3D Gaussian Splatting renders photoreal views but is a volumetric point representation with no faces, UVs, or watertight geometry; SuGaR is a distinct surface-alignment and Poisson-extraction stage needed to pull an editable mesh out of the splats. Each stage solves a problem the others cannot.
- Use cases
- Photogrammetry asset capture from a phone video, Location scanning for virtual production, Cultural heritage digitization with editable, engine-ready output, AR/VR asset creation from real objects
- Pitfalls
- Poor image overlap or motion blur makes COLMAP fail pose estimation, propagating floaters and fractured geometry; reflective/transparent surfaces break both SfM matching and Gaussian convergence; mesh extraction can lose fine detail present in the raw splats.
SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction (CVPR 2024)
Segment-Then-Reconstruct Scene Kitbashing
single cluttered photo -> per-object segmentation masks -> individually reconstructed posed 3D objectsExtracts every distinct object from one ordinary photo of a cluttered scene and reconstructs each as its own posed, textured 3D asset, by first running a general-purpose segmentation model to isolate object masks, then feeding each mask into a dedicated single-image-to-3D reconstruction model.
Stack, controls & moreHide
- Example stack
- SAM 3 (segmentation) -> SAM 3D Objects (per-object 3D reconstruction) -> scene assembly in Blender/Unity
- Controls
- mask selection/refinement (click or box prompts), per-object occlusion handling, output pose alignment to the original camera view, which detected objects to reconstruct
- Why multi-model
- A 2D segmentation model identifies what and where objects are but has no concept of 3D geometry; a single-image 3D reconstruction model lifts one already-isolated object into 3D but cannot first find and separate multiple objects in a cluttered scene. SAM 3D Objects reconstructs one selected object at a time and depends on an upstream segmentation step.
- Use cases
- Rapid kitbashing of game props from a single reference photo, AR product placement extracted from lifestyle photography, Building a 3D asset library from real-world photo references
- Pitfalls
- Reconstructs objects one at a time with no reasoning about physical interactions, so assembled scenes can show interpenetration or floating objects; heavy occlusion degrades reconstruction; segmentation errors propagate into missing or fused assets.
Introducing SAM 3D: 3D Reconstruction for Physical World Images (Meta AI)
StableGen TRELLIS mesh-gen to multi-ControlNet texture to PBR decomposition (Blender)
text/image -> 3d mesh -> image (x N views) -> PBR mapsBlender addon StableGen chains TRELLIS.2 (text/image-to-3D mesh generation) into a ComfyUI-driven multi-viewpoint diffusion texturing pass using SDXL/FLUX/Qwen-Image-Edit with simultaneous depth+canny+normal ControlNet and IPAdapter style guidance, then optionally decomposes the generated texture into full PBR material maps via Marigold/StableDelight, all inside one Blender workflow.
Stack, controls & moreHide
- Example stack
- TRELLIS.2 mesh -> multi-view depth+canny ControlNet + IPAdapter SDXL texturing -> Marigold/StableDelight PBR decomposition, driven from the StableGen Blender panel via a local ComfyUI server
- Controls
- ControlNet weight per unit (depth/canny/normal), IPAdapter reference image and weight, generation strategy (sequential/grid/separate per view), per-region local-edit masks, PBR map toggles
- Why multi-model
- Distinct from baseline's geometry-then-texture-3d and geometry-aware-pbr-material-distillation because it locks simultaneous multi-ControlNet + IPAdapter conditioning across many camera viewpoints in a single texturing pass (not a single depth pass), and unifies mesh generation, multi-view texturing, and in-addon PBR decomposition as one chain rather than separate tools; TRELLIS.2, SDXL/FLUX/Qwen diffusion, ControlNet, IPAdapter, and Marigold/StableDelight are five distinct model families invoked in sequence.
- Use cases
- game-ready PBR-textured props from a prompt, re-texturing existing game/VFX meshes with geometry-locked multi-view AI textures, local editing of specific texture regions without a full re-generation
- Pitfalls
- Multi-view seams/blending are projection-based and can show artifacts at UV boundaries; PBR decomposition is a heuristic estimate, not measured material data; requires a running local ComfyUI server plus VRAM for SDXL/FLUX
StableGen GitHub README
Text-to-Image-Depth Room Mesh Fusion
text -> per-view 2D renders -> per-view depth -> fused, inpainted textured room meshGenerates a room-scale 3D environment purely from a text prompt by iteratively rendering images at chosen camera poses, lifting each into 3D with monocular depth, and fusing/inpainting the growing mesh so each new view integrates seamlessly with previously generated geometry.
Stack, controls & moreHide
- Example stack
- Stable Diffusion -> monocular depth -> Stable Diffusion inpainting -> Text2Room fusion -> textured room mesh (OBJ)
- Controls
- text prompt per region/object, camera viewpoint trajectory/selection strategy, depth alignment tolerance, inpainting mask region, mesh simplification thresholds
- Why multi-model
- A text-to-image model produces flat 2D pixels with no camera-consistent geometry; a monocular depth model lifts a single image to a partial point cloud but cannot generate novel content or fill disocclusions; a text-conditioned inpainting model fills the gaps exposed as the camera moves, and a mesh-fusion algorithm stitches each frame into one consistent mesh.
- Use cases
- Rapid environment blockouts for game levels from text briefs, Virtual production background generation, Synthetic training environments for embodied AI, Architectural concept walkthroughs
- Pitfalls
- Depth misalignment across frames causes seams, ghosting, or duplicated geometry; viewpoint selection can miss occluded regions leaving holes; style can drift between frames; scales to room interiors but not large open scenes.
Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models (ICCV 2023, arXiv:2303.11989)
TRELLIS Structured-3D-Latents Pipeline
text/image -> 3D (mesh, radiance field, 3DGS)Generate a 3D asset in a structured latent space, then decode it to whichever representation the pipeline needs (mesh, volume, or Gaussian splats) from a single forward pass.
Stack, controls & moreHide
- Example stack
- TRELLIS (or TRELLIS.2) -> texture refine -> retopo.
- Controls
- Input image or text; output format (mesh/3DGS/volume); guidance strength; seed.
- Why multi-model
- A single representation cannot serve every downstream use; the structured-latent decoder lets one model emit interchangeable mesh, voxel, and 3DGS outputs that each need different post-processing.
- Use cases
- Game and VFX assets, AR product previews, Rapid concept modeling
- Pitfalls
- Thin structures and interiors are under-represented; topology needs retopology before production use.
TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation (arXiv:2412.01506)