Grouped-Query Attention (GQA)

组数 G: G = h → MHA<span class="ambiguous-code-point" data-tooltip-content="； [U+FF1B] can be confused with ; [U+003B]"> ； G = 1 → MQA<span class="ambiguous-code-point" data-tooltip-content="； [U+FF1B] can be confused with ; [U+003B]"> ； 1 &lt; G &lt; h → GQA
缓存减少: KV 缓存缩减为 MHA 的 G/h<span class="ambiguous-code-point" data-tooltip-content="， [U+FF0C] can be confused with , [U+002C]"> ， 典型的 8 分组可将缓存减少 87.5%
质量: G = 4~8 时质量与 MHA 接近

分组查询注意力，在 MHA 和 MQA 之间的折中方案，由 Ainslie 等 2023 年提出。