20260625:很多新内容
This commit is contained in:
90
papers/dao-transformers-are-ssms-2024.md
Normal file
90
papers/dao-transformers-are-ssms-2024.md
Normal file
@@ -0,0 +1,90 @@
|
||||
---
|
||||
title: "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality"
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
type: paper
|
||||
authors:
|
||||
- Tri Dao (Princeton University)
|
||||
- Albert Gu (Carnegie Mellon University)
|
||||
source: arXiv
|
||||
source_id: 2405.21060
|
||||
published: 2024-05-31
|
||||
venue: ICML 2024
|
||||
categories:
|
||||
- cs.LG
|
||||
---
|
||||
|
||||
# Transformers are SSMs
|
||||
|
||||
> Dao & Gu (2024) — arXiv:2405.21060, **ICML 2024**
|
||||
|
||||
## 核心命题
|
||||
|
||||
**Transformer 和 SSM 本质上是同一类模型的对偶形式。** 通过 [[semiseparable-matrices|半可分矩阵]] 这一数学桥梁,Dao & Gu 构建了统一框架——[[structured-state-space-duality|结构化状态空间对偶(SSD)]]。
|
||||
|
||||
## SSD 框架:三重视角
|
||||
|
||||
```
|
||||
SSM (线性/循环) ────→ 半可分矩阵 ←──── Attention (二次/并行)
|
||||
O(T) 训练 M_ij 结构 O(T²) 训练
|
||||
常数状态推理 GPU Tensor Core
|
||||
```
|
||||
|
||||
两种互补的数学视角:
|
||||
1. **矩阵变换视角**:SSM = 参数化矩阵乘法 Y = M·X
|
||||
2. **[[tensor-contraction-duality|张量收缩视角]]**:导出 SSM ↔ Attention 的对偶关系
|
||||
|
||||
## SSD 层的双重计算
|
||||
|
||||
### 循环形式(线性复杂度)
|
||||
- [[selective-state-space-models|选择性 SSM]] 的简化:A 从对角阵退化为标量
|
||||
- Head 维度 P = 64/128(类似 Transformer)
|
||||
|
||||
### 对偶形式(二次复杂度)
|
||||
```
|
||||
Y = (L ○ QK^T) · V
|
||||
L_ij = a_i × ... × a_{j+1}
|
||||
```
|
||||
- 去掉 Softmax,增加**数据依赖的位置掩码** L
|
||||
- L 替代启发式位置编码:a_t 在信息密集处接近 0(重置)
|
||||
|
||||
## 核心贡献:[[ssd-algorithm|SSD 算法]]
|
||||
|
||||
利用半可分矩阵的**块分解**实现最优权衡:
|
||||
- **块内**:矩阵乘法(GPU Tensor Core 优化)
|
||||
- **块间**:循环传播(保持线性复杂度)
|
||||
|
||||
| 指标 | vs Mamba | vs FlashAttention-2 |
|
||||
|------|:--:|:--:|
|
||||
| 速度 | **2-8x** | 16K 时 **6x** |
|
||||
| 状态大小 | **8x** 支持 | — |
|
||||
| 交叉点 | — | 2K 序列 |
|
||||
|
||||
## [[mamba-2|Mamba-2 架构]]
|
||||
|
||||
基于 SSD 原则设计的新架构:
|
||||
- [[head-structure-ssm|GVA Head 结构]]:分组值注意力,介于 MHA 和 MQA 之间
|
||||
- **Tensor Parallelism 原生支持**:同步点减半
|
||||
- **变长序列训练**:无需 padding
|
||||
- **Chinchilla 缩放**:2.7B 参数 → 超越 Pythia-2.8B 和 6.9B
|
||||
|
||||
## 概念网络
|
||||
|
||||
```
|
||||
state-space-models ──→ selective-state-space-models ──→ mamba-ssm
|
||||
↓ ↓ ↓
|
||||
semiseparable-matrices ←── structured-state-space-duality ──→ mamba-2
|
||||
↓ ↓ ↓
|
||||
structured-masked-attention tensor-contraction-duality ssd-algorithm
|
||||
↓ ↓ ↓
|
||||
linear-attention matrix-transformation head-structure-ssm
|
||||
(GVA/MIS/MVA)
|
||||
```
|
||||
|
||||
## 影响力
|
||||
|
||||
这是连接 SSM 和 Attention 两大范式的**里程碑工作**(ICML 2024)。不仅在理论上统一了两者,更展示了"理论→工程"的直接转化——SSD 算法让 SSM 能用上 Transformer 生态积累的硬件优化(Tensor Core, TP, FlashAttention 模式),推动了 Mamba-2 实现 2-8x 的加速。
|
||||
|
||||
## 来源
|
||||
|
||||
[arXiv:2405.21060](https://arxiv.org/abs/2405.21060) | [代码: state-spaces/mamba](https://github.com/state-spaces/mamba) | [原始存档](raw/papers/dao-transformers-are-ssms-2024.md)
|
||||
Reference in New Issue
Block a user