20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/concepts/throughput-hypothesis.md
+++ b/concepts/throughput-hypothesis.md
@@ -0,0 +1,41 @@
+---
+title: "Throughput Hypothesis (吞吐量假说)"
+created: 2026-05-29
+updated: 2026-05-29
+type: concept
+tags: ["pre-training", "efficiency", "tokenization", "hypothesis"]
+sources: ["https://arxiv.org/abs/2605.06546"]
+---
+
+# Throughput Hypothesis (吞吐量假说)
+
+**Throughput Hypothesis** 由 Gigant et al. (2025) 提出并经 Peng et al. (2026) 在 [[token-superposition-training|TST]] 中进一步验证：**subword-level 模型相对于 byte-level 模型的性能优势，主要来源于 coarser token 带来的更高训练样本吞吐量，而非表示质量本身的差异。**
+
+## 核心主张
+
+在等 FLOPs 训练条件下：
+- Coarser tokenization (如 BPE) → 每个 step 处理更多原始字符 → 更高的"有效数据吞吐量"
+- 这一吞吐量差异足以解释 subword vs byte-level 的大部分性能差异
+
+## TST 的验证
+
+TST 将该假说推向了新方向：
+- 通过 **token 叠加**在训练时人为制造更粗的粒度
+- 发现即使 tokenizer 本身不变，仅提高训练时吞吐量即带来显著增益
+- 这证明吞吐量假说不仅适用于 tokenizer 选择，也适用于训练时表示的动态调度
+
+## 隐含推论
+
+1. 训练效率优化应关注 **每 FLOP 的数据吞吐量**，而非每 token 的信息密度
+2. 推理时的 token 粒度可以独立于训练时的粒度选择（TST 的关键优势）
+3. 在 compute-bound 场景下，牺牲训练时表示精度换取吞吐量是 Pareto-efficient
+
+## 局限
+
+这一假说依赖 LLM 预训练是 **compute-bound** 而非 **data-bound** 的前提。Kim et al. (2026) 预测未来可能转向 data-bound——此时 output-only superposition 可能更具优势（不增加数据消耗）。
+
+## 相关
+
+- [[token-superposition-training]] — TST 方法
+- [[coarse-to-fine-granularity]] — 吞吐量假说的具体实现模式
+- [[peng-tst-2026]] — 原始论文