Files
myWiki/concepts/multi-query-attention.md

30 lines
862 B
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Multi-Query Attention (MQA)
created: 2025-04-15
updated: 2026-05-01
type: concept
tags: []
sources: []
---
# Multi-Query Attention (MQA)
**多查询注意力**,由 Shazeer 2019 年提出,所有 Q 头共享单个 KV 头。
## 定义
MQA 是 [[multi-head-attention|MHA]] 的最激进简化:保留多个 Q 头以维持表达能力,但所有头共享同一对 K 和 V。KV 缓存缩减为 MHA 的 1/h。
## 质量权衡
- **优势**: KV 缓存极低,推理内存大幅减少
- **劣势**: 表达能力受损,训练不稳定,需要额外优化
- **应用**: PaLM 采用 MQA但后续模型多转向 [[grouped-query-attention|GQA]]
## 相关概念
- [[multi-head-attention]] — MHA 基线
- [[grouped-query-attention]] — GQA 折中方案
- [[kv-cache-bottleneck]] — 缓存瓶颈
- [[llm-attention-survey-2026]] — 综述参考