--- source_url: https://arxiv.org/abs/2606.13655 ingested: 2026-06-13 sha256: flex4dhuman-raw-v1 --- # Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction **arXiv:** 2606.13655 **Authors:** Jen-Hao Cheng (UW), Yipeng Wang (World Labs), Hao Zhang (World Labs), Gengshan Yang (World Labs), Jenq-Neng Hwang (UW) **Categories:** cs.CV, cs.GR **Published:** 2026-06-11 ## Abstract We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on Wan 2.1's 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. ## Key Contributions 1. **Multi-view video diffusion without explicit geometry priors** — Adapts Wan 2.1 using only relative camera-pose positional encoding 2. **Flexible synchronized generation** — Supports monocular and variable sparse-view inputs, arbitrary target viewpoints, and temporal rollout 3. **Monocular video to 4D Gaussian splats** — Generated multi-view videos feed into FreeTimeGS for dynamic reconstruction ## Key Concepts - [[five-axis-positional-encoding]]: (time, view, SE(3), h, w) RoPE extension - [[se3-relative-camera-encoding]]: Continuous SE(3) camera geometry via PRoPE - [[clean-conditioning-mask]]: Binary mask distinguishing reference vs target tokens - [[three-stage-curriculum-training]]: Stage 1 pose following → Stage 2 dynamic refs → Stage 3 temporal rollout - [[temporal-rollout]]: Chunked inference with teacher-forced history overlap - [[multi-view-captioning]]: Gemini 3 Flash generated appearance captions ## Results - DNA-Rendering: +1.21 dB PSNR over Diffuman4D-GT-skeleton (25.44 dB) - Zero-shot ActorsHQ: +3.35 dB PSNR over Diffuman4D-mono-skeleton (21.32 dB) - Generalizes to animals (DFA) after fine-tuning - Robust to reference view azimuth (<1 dB variation) - Monotonically improves with more reference views