---
source_url: https://arxiv.org/abs/2606.13655
ingested: 2026-06-13
sha256: flex4dhuman-raw-v1
---

# Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

**arXiv:** 2606.13655
**Authors:** Jen-Hao Cheng (UW), Yipeng Wang (World Labs), Hao Zhang (World Labs), Gengshan Yang (World Labs), Jenq-Neng Hwang (UW)
**Categories:** cs.CV, cs.GR
**Published:** 2026-06-11

## Abstract

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on Wan 2.1's 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats.

## Key Contributions

1. **Multi-view video diffusion without explicit geometry priors** — Adapts Wan 2.1 using only relative camera-pose positional encoding
2. **Flexible synchronized generation** — Supports monocular and variable sparse-view inputs, arbitrary target viewpoints, and temporal rollout
3. **Monocular video to 4D Gaussian splats** — Generated multi-view videos feed into FreeTimeGS for dynamic reconstruction

## Key Concepts

- [[five-axis-positional-encoding]]: (time, view, SE(3), h, w) RoPE extension
- [[se3-relative-camera-encoding]]: Continuous SE(3) camera geometry via PRoPE
- [[clean-conditioning-mask]]: Binary mask distinguishing reference vs target tokens
- [[three-stage-curriculum-training]]: Stage 1 pose following → Stage 2 dynamic refs → Stage 3 temporal rollout
- [[temporal-rollout]]: Chunked inference with teacher-forced history overlap
- [[multi-view-captioning]]: Gemini 3 Flash generated appearance captions

## Results

- DNA-Rendering: +1.21 dB PSNR over Diffuman4D-GT-skeleton (25.44 dB)
- Zero-shot ActorsHQ: +3.35 dB PSNR over Diffuman4D-mono-skeleton (21.32 dB)
- Generalizes to animals (DFA) after fine-tuning
- Robust to reference view azimuth (<1 dB variation)
- Monotonically improves with more reference views