SidneyZhang/myWiki

Files

Sidney Zhang dd8345a6ea

20260420:first commit

2026-04-20 11:42:41 +08:00

1.4 KiB

Raw Blame History

title, created, updated, type, tags, sources

title

created

updated

type

tags

sources

KVCache 传输与优化

2026-04-19

2026-04-19

concept

inference

system-design

performance

raw/papers/qin-prfaas-cross-datacenter-2026.md

KVCache 传输与优化 (KVCache Transfer)

定义

KVCache 是 LLM 推理过程中缓存的 Key-Value 状态，用于避免重复计算。KVCache 传输指在分离式推理架构中将 prefill 阶段生成的 KVCache 移动到 decode 节点的过程。

传输瓶颈

体积巨大：Dense-attention 模型的 KVCache 大小与序列长度和模型参数量成正比
带宽要求：传统架构依赖 RDMA 等低延迟高带宽网络
延迟敏感：传输延迟直接影响 TTFT（Time to First Token）

优化方向

模型侧

混合注意力架构：通过结构化状态空间或线性注意力减少 KVCache 大小
KVCache 压缩：量化、稀疏化或蒸馏技术
前缀缓存共享：多请求共享公共前缀的 KVCache

系统侧

选择性传输：仅传输必要的 KVCache 层或 token
带宽感知调度：根据网络状态动态调整传输策略
PrfaaS 架构：结合模型效率与系统调度，实现跨数据中心传输

相关概念

prefill-as-a-service — PrfaaS 架构中的 KVCache 传输
prefill-decode-disaggregation — PD 分离架构
inference-optimization — 推理优化技术