KV Cache

定义

KV Cache（Key-Value Cache）是 Transformer 解码器在自回归生成过程中缓存的历史 Key 和 Value 向量，用于避免在每个解码步骤重新计算所有历史 token 的注意力。随着序列长度增长，KV Cache 的内存消耗线性增加，成为大模型长上下文推理的核心瓶颈。

核心机制

在因果注意力（causal attention）中，解码步骤 k 的 query q_k 只需与位置 j <= k 的 key、value 进行计算。将已计算的 (k_j, v_j) 存入缓存，后续步骤可直接读取，避免 O(n) 次重复计算。

内存分析

对于 L 层、H 个头、d_h 维度的模型，序列长度 T 的 KV Cache 内存为：

Memory = 2 * L * H * d_h * T * sizeof(dtype)

对于 Llama-3.1-8B（L=32, H=32, d_h=128, FP16），128K tokens 的 KV Cache 约 64 GB。

Thinker-Performer KV-Cache 交换

wan-streamer 的 thinker-performer-pipeline 将 KV-cache 作为 Thinker 和 Performer 之间的状态交换协议：Thinker 每步计算当前 KV-cache slice 并发送给 Performer，Performer 追加到自己的全历史缓存中运行 flow-matching 求解。这种设计使得感知更新和潜变量生成可以在不同 GPU 上流水线重叠，同时维持统一的因果交互状态。

参考

原始注意力机制论文：Vaswani et al., 2017
StreamingLLM (Xiao et al., 2024) — 发现 attention sink 现象
tang-lukv — 基于全局组合优化的 KV Cache 驱逐

2.0 KiB Raw Blame History Unescape Escape

KV Cache

定义

核心机制

内存分析

Thinker-Performer KV-Cache 交换

相关概念

参考

2.0 KiB

Raw Blame History