Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

arXiv: 2606.25041
Published: 2026-06-23
Authors: Lianghua Huang, Zhifan Wu, Wei Wang, Yupeng Shi, Mengyang Feng, Junjie He, Chenwei Xie, Yu Liu, Jingren Zhou, Ang Wang, Bang Zhang, Baole Ai, Chen Liang, Cheng Yu, Chongyang Zhong, Jinwei Qi, Kai Zhu, Pandeng Li, Peng Zhang, Wenyuan Zhang, Xinhua Cheng, Yitong Huang, Yun Zheng, Zoubin Bi (Wan Team, Alibaba Group)
Categories: cs.CV, cs.AI, cs.GR, cs.SD
Website: https://wan-streamer.com
Source: https://arxiv.org/abs/2606.25041

Abstract

Wan-Streamer is a native-streaming, end-to-end interactive foundation model for real-time, low-latency, full-duplex audio-visual interaction. It models language, audio, and video as both input and output within a single Transformer using block-causal attention for incremental streaming. Unlike cascaded systems relying on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer jointly learns perception, reasoning, generation, response timing, turn management, and cross-modal synchronization within one unified model, reducing pipeline latency and error accumulation. Streaming units are as short as 160 ms at 25 fps, with ~200 ms model-side response latency and ~550 ms total interaction latency.

Key Contributions

End-to-end multimodal interactive foundation model — language, audio, video as both input and output in one Transformer
Fully causal multimodal architecture: causal audio/video VAEs, causal encoders/decoders, block-causal attention, full-history autoregressive streaming
Thinker-performer inference pipeline with KV-cache exchange, ~200ms model-side latency, ~550ms total

1.7 KiB Raw Blame History

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Abstract

Key Contributions

1.7 KiB

Raw Blame History