20260625:很多新内容

2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions
--- a/concepts/mamba-2.md
+++ b/concepts/mamba-2.md
@@ -0,0 +1,48 @@
+---
+title: "Mamba-2"
+created: 2026-06-18
+updated: 2026-06-18
+type: concept
+tags: [ssm, architecture, mamba, efficiency]
+sources:
+  - dao-transformers-are-ssms-2024
+---
+
+# Mamba-2
+
+Mamba-2 是 Dao & Gu (2024) 基于 [[structured-state-space-duality|SSD 框架]] 设计的新架构——核心层是 [[mamba-ssm|Mamba]] 选择性 SSM 的改进版，**2-8x 更快**。
+
+## 相对于 Mamba 的改进
+
+### 架构层面
+| 组件 | Mamba (2023) | Mamba-2 (2024) |
+|------|:---:|:---:|
+| A 矩阵 | 对角矩阵 | 标量 × 单位矩阵 |
+| Head 维度 P | 1 | 64/128 |
+| Head 结构 | 多输入 SSM (MIS) | 分组值注意力 (GVA) |
+| 并行性 | 不支持 TP | 原生 Tensor Parallelism |
+
+### 效率层面
+- **SSD 算法**：利用 [[semiseparable-matrices|半可分矩阵]] 的块分解，部分用循环（O(T)）、部分用矩阵乘法（GPU 优化）
+- 比 Mamba 的 selective scan 快 **2-8x**
+- 支持 **8x** 更大的状态大小（N），几乎无减速
+- 序列长度 16K 时比 FlashAttention-2 快 **6x**
+
+## Chinchilla 缩放定律
+
+在 Pile 数据集的 Chinchilla 设置下，Mamba-2 **Pareto 支配** Mamba 和 Transformer++：
+- 2.7B 参数 / 300B tokens 训练 → 超越 Pythia-2.8B 和 Pythia-6.9B
+
+## 关键设计决策
+
+1. **张量并行友好**：将所有数据依赖投影移到块开头并行执行，减少同步点
+2. **GVA Head 结构**：分组值注意力 — 介于 MHA 和 MQA 之间
+3. **变长序列支持**：无需 padding tokens，通过传递循环状态实现
+
+## 参考
+
+- [[structured-state-space-duality|SSD]]
+- [[ssd-algorithm|SSD 算法]]
+- [[mamba-ssm|Mamba]]
+- [[head-structure-ssm|SSM 多头结构]]
+- [[dao-transformers-are-ssms-2024|论文]]