SidneyZhang/myWiki

Fork 0

Files

Sidney Zhang 91fac5b6fc

20260617:目前有914 页

2026-06-17 15:02:40 +08:00

2.9 KiB

Raw Blame History

source_url, ingested, sha256

source_url	ingested	sha256
https://arxiv.org/abs/2606.13655	2026-06-13	flex4dhuman-raw-v1

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

arXiv: 2606.13655 Authors: Jen-Hao Cheng (UW), Yipeng Wang (World Labs), Hao Zhang (World Labs), Gengshan Yang (World Labs), Jenq-Neng Hwang (UW) Categories: cs.CV, cs.GR Published: 2026-06-11

Abstract

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on Wan 2.1's 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats.

Key Contributions

Multi-view video diffusion without explicit geometry priors — Adapts Wan 2.1 using only relative camera-pose positional encoding
Flexible synchronized generation — Supports monocular and variable sparse-view inputs, arbitrary target viewpoints, and temporal rollout
Monocular video to 4D Gaussian splats — Generated multi-view videos feed into FreeTimeGS for dynamic reconstruction

Key Concepts

five-axis-positional-encoding: (time, view, SE(3), h, w) RoPE extension
se3-relative-camera-encoding: Continuous SE(3) camera geometry via PRoPE
clean-conditioning-mask: Binary mask distinguishing reference vs target tokens
three-stage-curriculum-training: Stage 1 pose following → Stage 2 dynamic refs → Stage 3 temporal rollout
temporal-rollout: Chunked inference with teacher-forced history overlap
multi-view-captioning: Gemini 3 Flash generated appearance captions

Results

DNA-Rendering: +1.21 dB PSNR over Diffuman4D-GT-skeleton (25.44 dB)
Zero-shot ActorsHQ: +3.35 dB PSNR over Diffuman4D-mono-skeleton (21.32 dB)
Generalizes to animals (DFA) after fine-tuning
Robust to reference view azimuth (<1 dB variation)
Monotonically improves with more reference views

2.9 KiB Raw Blame History