40 lines
2.9 KiB
Markdown
40 lines
2.9 KiB
Markdown
---
|
|
source_url: https://arxiv.org/abs/2606.13655
|
|
ingested: 2026-06-13
|
|
sha256: flex4dhuman-raw-v1
|
|
---
|
|
|
|
# Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction
|
|
|
|
**arXiv:** 2606.13655
|
|
**Authors:** Jen-Hao Cheng (UW), Yipeng Wang (World Labs), Hao Zhang (World Labs), Gengshan Yang (World Labs), Jenq-Neng Hwang (UW)
|
|
**Categories:** cs.CV, cs.GR
|
|
**Published:** 2026-06-11
|
|
|
|
## Abstract
|
|
|
|
We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on Wan 2.1's 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats.
|
|
|
|
## Key Contributions
|
|
|
|
1. **Multi-view video diffusion without explicit geometry priors** — Adapts Wan 2.1 using only relative camera-pose positional encoding
|
|
2. **Flexible synchronized generation** — Supports monocular and variable sparse-view inputs, arbitrary target viewpoints, and temporal rollout
|
|
3. **Monocular video to 4D Gaussian splats** — Generated multi-view videos feed into FreeTimeGS for dynamic reconstruction
|
|
|
|
## Key Concepts
|
|
|
|
- [[five-axis-positional-encoding]]: (time, view, SE(3), h, w) RoPE extension
|
|
- [[se3-relative-camera-encoding]]: Continuous SE(3) camera geometry via PRoPE
|
|
- [[clean-conditioning-mask]]: Binary mask distinguishing reference vs target tokens
|
|
- [[three-stage-curriculum-training]]: Stage 1 pose following → Stage 2 dynamic refs → Stage 3 temporal rollout
|
|
- [[temporal-rollout]]: Chunked inference with teacher-forced history overlap
|
|
- [[multi-view-captioning]]: Gemini 3 Flash generated appearance captions
|
|
|
|
## Results
|
|
|
|
- DNA-Rendering: +1.21 dB PSNR over Diffuman4D-GT-skeleton (25.44 dB)
|
|
- Zero-shot ActorsHQ: +3.35 dB PSNR over Diffuman4D-mono-skeleton (21.32 dB)
|
|
- Generalizes to animals (DFA) after fine-tuning
|
|
- Robust to reference view azimuth (<1 dB variation)
|
|
- Monotonically improves with more reference views
|