Files
myWiki/raw/papers/maineCoon-social-world-model-2026.md

4.1 KiB
Raw Blame History

title, created, source, authors, venue, date, project, type
title created source authors venue date project type
MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model 2026-06-20 arXiv:2606.17800 Lichen Bai, Tianhao Zhang, Shitong Shao, Dingwei Tan, Qiyu Zhong, Zhengpeng Xie, Haopeng Li, Qinghao Huang, Dandan Shen, Tengjiao Ji, Wei Wang, Peicheng Wu, Yuxuan Zhao, Xiangyu Zhu, Welly Luo, Shurui Yang, Zeke Xie arXiv preprint (cs.CV) 2026-06-16 https://mainecoon.tech/ paper

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Catnip AI Team · arXiv:2606.17800 · 32 pages, 13 figures, 3 tables

Abstract

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked. We define the position of social world models and build MaineCoon as the first step — a 22B real-time audio-visual autoregressive model capable of streaming generation and sub-second interaction at up to 47.5 FPS on a single GPU.

Key innovations:

  • Self-resampling: exposes model to degraded self-history during training
  • Cross-modal representation alignment: token relation distillation with V-JEPA 2
  • Domain-aware preference optimization: multi-domain LoRA DPO experts
  • Reinforced online-policy distillation (ROPD): consolidates domain experts into one deployable policy
  • Agentic streaming inference: training-free framework with planner/observer, cache manager, buffer controller

MaineCoon supports thousand-second-scale generation while mitigating drift, and sets SOTA on the new SocialVideo Bench (9 evaluation metrics).

核心问题

全球大多数视频在社交平台上被消费,但现有视频生成模型(如 DiT 扩散模型)存在三大局限:

  1. 离线非流式:双向时间注意力导致无法实时输出
  2. 忽略音频:社交视频的语音、唇音同步、情感共鸣是关键
  3. 缺乏长时稳定性:分钟级自回归生成的内容漂移

方法论

Training Pipeline (Section 3)

  • Native Streaming AR Training (3.1): 因果逐块自回归训练,通过 self-resampling 让模型适应自身产生的退化历史
  • Cross-modal Representation Alignment (3.2): 利用 jepa teacher 的 token relation distillation 加速训练
  • Post-training (3.3): domain-aware-preference-optimization 训练域专家,reinforced-online-policy-distillation 将专家合并为单一策略
  • Step Distillation: DMD-based 四步蒸馏,实现近乎无损的快速推理

Agentic Streaming Inference (Section 4)

训练无关的推理框架,三个控制器包裹冻结生成器:

Data Pipeline (Section 2)

  • Synthetic data via LTX-2.3 teacher + director-style LM scenario planning (225 scenes × 15 styles × 12 shots)
  • Real social video curation: SCRFD face detection → SyncNet lip-sync verification → quality filtering
  • 日处理能力:十万视频规模

关键结果

  • 47.5 FPS on single H100 GPU
  • <$0.001 per second generation cost
  • 45 minutes continuous streaming without measurable degradation
  • SOTA on SocialVideo Bench across 9 metrics vs. 7 open-source baselines
  • 训练效率:<10K GPU hours, <1M clips

相关概念