2.9 KiB
source_url, ingested, sha256
| source_url | ingested | sha256 |
|---|---|---|
| https://arxiv.org/abs/2606.13655 | 2026-06-13 | flex4dhuman-raw-v1 |
Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction
arXiv: 2606.13655 Authors: Jen-Hao Cheng (UW), Yipeng Wang (World Labs), Hao Zhang (World Labs), Gengshan Yang (World Labs), Jenq-Neng Hwang (UW) Categories: cs.CV, cs.GR Published: 2026-06-11
Abstract
We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on Wan 2.1's 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats.
Key Contributions
- Multi-view video diffusion without explicit geometry priors — Adapts Wan 2.1 using only relative camera-pose positional encoding
- Flexible synchronized generation — Supports monocular and variable sparse-view inputs, arbitrary target viewpoints, and temporal rollout
- Monocular video to 4D Gaussian splats — Generated multi-view videos feed into FreeTimeGS for dynamic reconstruction
Key Concepts
- five-axis-positional-encoding: (time, view, SE(3), h, w) RoPE extension
- se3-relative-camera-encoding: Continuous SE(3) camera geometry via PRoPE
- clean-conditioning-mask: Binary mask distinguishing reference vs target tokens
- three-stage-curriculum-training: Stage 1 pose following → Stage 2 dynamic refs → Stage 3 temporal rollout
- temporal-rollout: Chunked inference with teacher-forced history overlap
- multi-view-captioning: Gemini 3 Flash generated appearance captions
Results
- DNA-Rendering: +1.21 dB PSNR over Diffuman4D-GT-skeleton (25.44 dB)
- Zero-shot ActorsHQ: +3.35 dB PSNR over Diffuman4D-mono-skeleton (21.32 dB)
- Generalizes to animals (DFA) after fine-tuning
- Robust to reference view azimuth (<1 dB variation)
- Monotonically improves with more reference views