myWiki/maineCoon-social-world-model-2026.md at 6021dea160092d29b1a1f202917027e29f55465d

SidneyZhang/myWiki

Fork 0

Files

Sidney Zhang 6021dea160

20260625:很多新内容

2026-06-25 14:08:47 +08:00

4.1 KiB

Raw Blame History

title, created, source, authors, venue, date, project, type

title	created	source	authors	venue	date	project	type
MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model	2026-06-20	arXiv:2606.17800	Lichen Bai, Tianhao Zhang, Shitong Shao, Dingwei Tan, Qiyu Zhong, Zhengpeng Xie, Haopeng Li, Qinghao Huang, Dandan Shen, Tengjiao Ji, Wei Wang, Peicheng Wu, Yuxuan Zhao, Xiangyu Zhu, Welly Luo, Shurui Yang, Zeke Xie	arXiv preprint (cs.CV)	2026-06-16	https://mainecoon.tech/	paper

Catnip AI Team · arXiv:2606.17800 · 32 pages, 13 figures, 3 tables

Abstract

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked. We define the position of social world models and build MaineCoon as the first step — a 22B real-time audio-visual autoregressive model capable of streaming generation and sub-second interaction at up to 47.5 FPS on a single GPU.

Key innovations:

Self-resampling: exposes model to degraded self-history during training
Cross-modal representation alignment: token relation distillation with V-JEPA 2
Domain-aware preference optimization: multi-domain LoRA DPO experts
Reinforced online-policy distillation (ROPD): consolidates domain experts into one deployable policy
Agentic streaming inference: training-free framework with planner/observer, cache manager, buffer controller

MaineCoon supports thousand-second-scale generation while mitigating drift, and sets SOTA on the new SocialVideo Bench (9 evaluation metrics).

核心问题

全球大多数视频在社交平台上被消费，但现有视频生成模型（如 DiT 扩散模型）存在三大局限：

离线非流式：双向时间注意力导致无法实时输出
忽略音频：社交视频的语音、唇音同步、情感共鸣是关键
缺乏长时稳定性：分钟级自回归生成的内容漂移

方法论

Training Pipeline (Section 3)

Native Streaming AR Training (3.1): 因果逐块自回归训练，通过 self-resampling 让模型适应自身产生的退化历史
Cross-modal Representation Alignment (3.2): 利用 jepa teacher 的 token relation distillation 加速训练
Post-training (3.3): domain-aware-preference-optimization 训练域专家，reinforced-online-policy-distillation 将专家合并为单一策略
Step Distillation: DMD-based 四步蒸馏，实现近乎无损的快速推理

Agentic Streaming Inference (Section 4)

训练无关的推理框架，三个控制器包裹冻结生成器：

agentic-streaming-inference (Planner & Observer): Gemma 4 26B agent 写 prompt 流 + 观察生成质量
agentic-cache-manager: 管理 KV-cache 的 keep-set + drift control
look-ahead-buffer-controller: 控制生成与播放之间的 lead

Data Pipeline (Section 2)

Synthetic data via LTX-2.3 teacher + director-style LM scenario planning (225 scenes × 15 styles × 12 shots)
Real social video curation: SCRFD face detection → SyncNet lip-sync verification → quality filtering
日处理能力：十万视频规模

关键结果

47.5 FPS on single H100 GPU
<$0.001 per second generation cost
45 minutes continuous streaming without measurable degradation
SOTA on SocialVideo Bench across 9 metrics vs. 7 open-source baselines
训练效率：<10K GPU hours, <1M clips

4.1 KiB Raw Blame History Unescape Escape

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model