20260625:很多新内容
This commit is contained in:
57
concepts/wkv-time-mixing.md
Normal file
57
concepts/wkv-time-mixing.md
Normal file
@@ -0,0 +1,57 @@
|
||||
---
|
||||
title: "WKV Time Mixing"
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
type: concept
|
||||
tags: ["rwkv", "attention", "linear-complexity", "time-mixing"]
|
||||
sources: ["https://arxiv.org/abs/2503.14456"]
|
||||
---
|
||||
|
||||
# WKV Time Mixing
|
||||
|
||||
## 定义
|
||||
|
||||
WKV(Weighted Key Value)Time Mixing 是 RWKV 架构的核心时间混合算子,可视为线性注意力的 RNN 变体。它负责将历史信息与当前 token 信息进行加权融合,是 RWKV 对标准注意力机制的 O(n²) → O(n) 替代。
|
||||
|
||||
## 核心形式
|
||||
|
||||
WKV 的一般模式:
|
||||
```
|
||||
w_t = f_w(x_t) # 输入依赖的衰减权重
|
||||
k_t = W_k · x_t # Key 投影
|
||||
v_t = W_v · x_t # Value 投影
|
||||
r_t = W_r · x_t # Receptance(门控)
|
||||
state_t = w_t ⊙ state_{t-1} + v_t^T · k_t
|
||||
output = r_t ⊙ (W_o · state_t)
|
||||
```
|
||||
|
||||
## 从 RWKV-4 到 RWKV-7 的演化
|
||||
|
||||
| 版本 | WKV 形式 | 状态维度 |
|
||||
|------|---------|---------|
|
||||
| RWKV-4 | `state_t = e^{-w} · state_{t-1} + e^{k_t} · v_t` | 向量 |
|
||||
| RWKV-5/6 | `S_t = S_{t-1} · diag(w_t) + v_t^T · k_t` | 矩阵 |
|
||||
| **RWKV-7** | `S_t = S_{t-1} · (diag(w_t) - κ̂^T(a_t⊙κ̂)) + v_t^T·k_t` | 矩阵 + Delta |
|
||||
|
||||
关键趋势:WKV 从简单的指数衰减(RWKV-4)→ 逐通道动态衰减(RWKV-5/6)→ **梯度下降式选择性更新(RWKV-7)**。
|
||||
|
||||
## 与注意力的关系
|
||||
|
||||
| 算子 | 机制 | 复杂度 | 状态 |
|
||||
|------|------|--------|------|
|
||||
| Softmax Attention | Q-K^T 全对全交互 | O(n²) | KV cache 线性增长 |
|
||||
| WKV (RWKV) | 循环式加权累积 | O(n) | 固定大小状态 |
|
||||
|
||||
WKV 可以理解为将注意力的"查询所有历史 token"压缩为"将历史压缩进状态再查询"。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[token-shift]] — WKV 的局部时序注入
|
||||
- [[rwkv]] — WKV 所在的架构系列
|
||||
- [[linear-attention-methods]] — 线性注意力的其他方案
|
||||
- [[generalized-delta-rule]] — RWKV-7 的 WKV 升级
|
||||
- [[peng-rwkv7|RWKV-7 论文]]
|
||||
|
||||
## 参考
|
||||
|
||||
- [[peng-rwkv7|RWKV-7 "Goose"]] (Peng et al., 2025)
|
||||
Reference in New Issue
Block a user