20260625:很多新内容

2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions
--- a/concepts/audio-visual-representation-alignment.md
+++ b/concepts/audio-visual-representation-alignment.md
@@ -0,0 +1,57 @@
+---
+title: "Audio-Visual Representation Alignment"
+created: 2026-06-20
+updated: 2026-06-20
+type: concept
+tags: ["representation", "alignment", "audio-visual", "training", "jepa"]
+sources: ["https://arxiv.org/abs/2606.17800"]
+---
+
+# Audio-Visual Representation Alignment (音视频表示对齐)
+
+**Audio-Visual Representation Alignment** 是 [[maineCoon|MaineCoon]] 中通过 [[jepa|V-JEPA 2]] teacher 的 **token relation distillation** 加速流式音视频训练的技术。
+
+> 注意：此概念不同于 LLM 中的 [[representation-alignment|表示对齐]]（TST 中的 embedding 不变性）。此处特指音视频扩散模型中的中间层特征对齐。
+
+## 动机：流式训练的可视语义获取缓慢
+
+从零训练大规模音视频 DiT 时，[[flow-matching|Flow Matching]] loss 仅监督低级重建，对语义结构仅施加弱压力。连贯运动和音视频对应关系在训练后期才涌现。
+
+## Token Relation Distillation
+
+MaineCoon 采用 VideoREPA 的**关系对齐**策略：
+
+### 1. Teacher 特征提取
+- Teacher: 冻结的 V-JEPA 2 编码器
+- 对训练 clip 采样帧，resize 使其 patch grid 与 visual latent grid 对齐
+- 输出特征体 `Y ∈ R^{F×S×d_tea}`，与 visual latent token 一一对应
+
+### 2. 关系矩阵匹配
+在选定的中间层，将 noisy visual target hidden states 投影到 teacher space，然后匹配**成对 token 关系矩阵**：
+```
+R(a)_{mn} = a_m^T a_n / (‖a_m‖₂ ‖a_n‖₂)
+```
+对齐关系而非绝对特征值 — 让 generator 保留自己的表示基。
+
+### 3. Hinge-Margin Loss
+```
+L_TRD = (1/N²) Σ ReLU(R(Ŷ)_{mn} - R(Y)_{mn} - γ)
+```
+margin γ 忽略小的关系差异，更稳定。
+
+## 与 Native Streaming Training 的集成
+
+- 对齐 loss 作为辅助目标加入
+- 仅在 visual target half 上计算（audio stream 不约束）
+- 仅在 main gradient forward pass 上启用（self-resampling rollout 禁用）
+- Teacher 冻结且特征预计算，训练时无额外 teacher forward pass
+
+## 效果
+- 大幅减少达到连贯运动和 AV 对应所需的训练步数
+- 提升最终生成质量
+
+## 参考
+- [[maineCoon|MaineCoon 论文]] Section 3.2
+- [[jepa|V-JEPA 2]]
+- [[representation-alignment|LLM Representation Alignment]]（不同含义）
+- VideoREPA (Zhao et al.)