20260601
This commit is contained in:
49
concepts/token-superposition-training.md
Normal file
49
concepts/token-superposition-training.md
Normal file
@@ -0,0 +1,49 @@
|
||||
---
|
||||
title: "Token Superposition Training (TST)"
|
||||
created: 2026-05-29
|
||||
updated: 2026-05-29
|
||||
type: concept
|
||||
tags: ["pre-training", "efficiency", "LLM"]
|
||||
sources: ["https://arxiv.org/abs/2605.06546"]
|
||||
---
|
||||
|
||||
# Token Superposition Training (TST)
|
||||
|
||||
**Token Superposition Training** 是一种两阶段的 LLM 预训练加速方法,由 Peng, Gigant & Quesnelle (Nous Research, 2026) 提出。核心思想:在训练初期用**粗粒度 token 叠加**提高数据吞吐量,后期回归标准训练。
|
||||
|
||||
## 机制
|
||||
|
||||
TST 不修改模型架构、tokenizer、优化器或并行策略——它是一个纯 drop-in 方法:
|
||||
|
||||
### 阶段一:叠加阶段
|
||||
- 将连续 s 个 token 的 embedding **取平均**形成一个 [[s-token]]
|
||||
- 用 [[multi-hot-cross-entropy|MCE]] 损失预测下一个 bag 的全部 token
|
||||
- 效果:序列长度缩短 s 倍 → 等 FLOPs 下吞入 s× 更多数据
|
||||
|
||||
### 阶段二:恢复阶段
|
||||
- 回归标准 causal next-token prediction
|
||||
- embedding 和 LM head **不重新初始化**
|
||||
|
||||
## 关键参数
|
||||
|
||||
| 参数 | 含义 | 推荐范围 |
|
||||
|------|------|----------|
|
||||
| s (bag size) | 每个 bag 的 token 数 | 4–8 |
|
||||
| r (step ratio) | 叠加步数占总步数的比例 | 0.2–0.4 |
|
||||
|
||||
## 性能
|
||||
|
||||
- 10B A1B MoE:等 loss 条件下 **2.5× 训练时间缩减**
|
||||
- 3B Dense:等 FLOPs 下最终 loss 更低,下游任务持平或更好
|
||||
|
||||
## 为什么有效
|
||||
|
||||
1. **粗→细粒度调度**([[coarse-to-fine-granularity]]):先学粗统计结构,后精调
|
||||
2. **表示对齐**([[representation-alignment]]):共享 embedding 跨越两阶段是关键
|
||||
3. **吞吐量假说**([[throughput-hypothesis]]):coarser tokens → 更高数据吞吐量 → 更好性能
|
||||
|
||||
## 相关
|
||||
|
||||
- [[peng-tst-2026]] — 原始论文
|
||||
- [[multi-hot-cross-entropy]] — 核心损失函数
|
||||
- [[two-phase-pretraining]] — 两阶段训练范式
|
||||
Reference in New Issue
Block a user