20260625:很多新内容

2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions
--- a/concepts/structured-masked-attention.md
+++ b/concepts/structured-masked-attention.md
@@ -0,0 +1,49 @@
+---
+title: "结构化掩码注意力 (Structured Masked Attention)"
+created: 2026-06-18
+updated: 2026-06-18
+type: concept
+tags: [attention, ssm, linear-attention, mask]
+sources:
+  - dao-transformers-are-ssms-2024
+---
+
+# 结构化掩码注意力 (SMA)
+
+SMA 是 Dao & Gu (2024) 对 [[linear-attention|线性注意力]] 的推广——在因果注意力矩阵上引入**数据依赖的结构化掩码 L**。
+
+## 形式定义
+
+```
+Y = (L ○ QK^T) · V
+```
+
+其中 L 是下三角矩阵，满足：
+- L 由数据依赖的标量 a_t ∈ [0,1] 参数化
+- L_ij = a_i × a_{i-1} × ... × a_{j+1}（当 i ≥ j）
+- a_t 控制信息如何在时间维度上衰减/保留
+
+## 与 Softmax Attention 的区别
+
+| | Softmax Attention | SMA (SSD 对偶形式) |
+|---|---|---|
+| 激活 | Softmax(QK^T) | L ○ QK^T |
+| 位置信息 | 位置编码（启发式） | 数据依赖的衰减掩码 L |
+| 复杂度 | O(T²) | O(T²)（但可转化为 O(T) SSM） |
+
+## 为什么重要
+
+1. **去掉 Softmax**：避免了 "attention sink" 现象
+2. **数据依赖的位置掩码**：L 替代了启发式位置编码——a_t 在信息密集处接近 0（重置），在平稳处接近 1（保留）
+3. **可逆性**：SMA ⇔ SSM 的对偶关系意味着 SMA 也有 O(T) 的快速循环算法
+
+## SMA 是 SSM 的必要条件
+
+Dao & Gu 证明：任何具有快速循环形式的核注意力方法**必然是**一个 SSM。SMA 是连接两者的最广框架。
+
+## 参考
+
+- [[linear-attention|线性注意力]]
+- [[structured-state-space-duality|SSD]]
+- [[semiseparable-matrices|半可分矩阵]]
+- [[dao-transformers-are-ssms-2024|论文]]