20260625:很多新内容
This commit is contained in:
39
concepts/constant-kv-cache.md
Normal file
39
concepts/constant-kv-cache.md
Normal file
@@ -0,0 +1,39 @@
|
||||
---
|
||||
title: "Constant KV Cache"
|
||||
created: 2026-06-24
|
||||
updated: 2026-06-24
|
||||
type: concept
|
||||
tags: ["kv-cache", "efficient-inference", "attention-mechanism"]
|
||||
sources:
|
||||
- "[[unlimited-ocr-works-2026]]"
|
||||
---
|
||||
|
||||
# Constant KV Cache
|
||||
|
||||
Constant KV Cache 是 R-SWA 注意力机制的核心性质:KV cache 大小在全部解码过程中保持有界常数 Lm + n,不随输出长度 T 增长。
|
||||
|
||||
## 定义
|
||||
|
||||
$$C_{R\text{-}SWA}(T) = L_m + \min(n, T) \leq L_m + n$$
|
||||
|
||||
其中 Lm 为前缀 token 数(固定),n 为滑动窗口宽度(默认 128)。
|
||||
|
||||
## 与标准 MHA 的对比
|
||||
|
||||
| 机制 | KV Cache 增长 | 无穷 T 时 |
|
||||
|------|-------------|----------|
|
||||
| MHA | O(T) 线性 | ∞ |
|
||||
| R-SWA | O(1) 常数 | Lm + n |
|
||||
|
||||
Cache 压缩比:$\rho(T) = \frac{L_m + n}{L_m + T} \to 0$
|
||||
|
||||
## 工程意义
|
||||
|
||||
- GPU 显存恒定,不随输出长度增长
|
||||
- 推理速度(TPS)恒定(Flash Attention v3 核函数延迟稳定)
|
||||
- 使单次前向解析数十页成为可能
|
||||
|
||||
## 参考
|
||||
- [[unlimited-ocr-works-2026]]
|
||||
- [[reference-sliding-window-attention]]
|
||||
- [[kv-cache]]
|
||||
Reference in New Issue
Block a user