20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/concepts/input-superposition.md
+++ b/concepts/input-superposition.md
@@ -0,0 +1,42 @@
+---
+title: "Input Superposition"
+created: 2026-05-29
+updated: 2026-05-29
+type: concept
+tags: ["pre-training", "embedding", "tokenization"]
+sources: ["https://arxiv.org/abs/2605.06546"]
+---
+
+# Input Superposition
+
+**Input Superposition** 是 [[token-superposition-training|TST]] 中输入侧的操作：将连续 s 个 token 的 embedding 取平均，形成单个 latent "s-token"。由 Peng, Gigant & Quesnelle (2026) 在 TST 中系统研究。
+
+## 操作
+
+设 token 序列为 $t_1, t_2, \dots, t_L$，bag size = s：
+1. 分组：$\{t_1, \dots, t_s\}, \{t_{s+1}, \dots, t_{2s}\}, \dots$
+2. 对每个 bag，计算平均 embedding：$e'_j = \frac{1}{s} \sum_{k=1}^s e(t_{(j-1)s+k})$
+3. LLM 在缩短 s× 的序列上运算
+
+## 效果
+
+- 序列长度 L → L/s，每个训练 step 的 FLOPs **不变**（因为 s-token 序列更短但每个 s-token 的表示维度不变）
+- **等 FLOPs** 下吞入 s× 更多数据 token
+
+## 增益来源（开放问题）
+
+论文提出了两种解释：
+1. **预-预训练假说**：粗粒度 token 保留了文本的局部统计结构（topic, co-occurrence），模型先学习这些粗结构
+2. **Embedding 正则化假说**：在 embedding 空间中对随机 s-gram 取平均，隐式正则化了 embedding 几何
+
+## 跨模态关联
+
+Input superposition 体现的 **粗→细粒度调度**（[[coarse-to-fine-granularity]]）原则在多模态中也有先例：
+- ViT 中 patch size 从粗到细的调度（Anagnostidis et al.）
+- Byte-level → subword 的恢复训练（Minixhofer et al.）
+
+## 相关
+
+- [[token-superposition-training]] — 完整方法
+- [[multi-hot-cross-entropy]] — 输出侧配合的损失函数
+- [[coarse-to-fine-granularity]] — 底层设计原则