20260514:增加新内容

This commit is contained in:
2026-05-14 13:54:52 +08:00
parent 56c4d3ef7c
commit b116710e4c
294 changed files with 10682 additions and 255 deletions

View File

@@ -0,0 +1,33 @@
---
arxiv: "2309.17453"
title: "Efficient Streaming Language Models with Attention Sinks"
authors: ["Guangxuan Xiao", "Yuandong Tian", "Beidi Chen", "Song Han", "Mike Lewis"]
venue: "ICLR 2024"
affiliations: ["MIT", "Meta AI", "CMU", "NVIDIA"]
year: 2024
url: "https://arxiv.org/abs/2309.17453"
code: "https://github.com/mit-han-lab/streaming-llm"
type: paper
---
# Efficient Streaming Language Models with Attention Sinks
## Abstract
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach — but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely **attention sink**, that keeping the KV of initial tokens will largely recover the performance of window attention. We first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce **StreamingLLM**, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2× speedup.
## Key Contributions
1. **Attention Sink Discovery**: Initial tokens receive disproportionately high attention scores across all layers and heads, not due to semantics but due to their absolute position — they serve as "sinks" for excess attention that the SoftMax function forces to be allocated somewhere.
2. **StreamingLLM Framework**: A simple, training-free method that keeps attention sink tokens' KV (just 4 initial tokens suffice) together with a sliding window of recent tokens, enabling infinite-length streaming inference.
3. **Sink Token Pre-training**: Demonstrates that pre-training with a dedicated learnable sink token allows models to use a single token as the attention sink, eliminating the need for multiple initial tokens.
4. **Universal Validation**: Tested across Llama-2 (7/13/70B), MPT (7/30B), Falcon (7/40B), Pythia (2.9/6.9/12B) with both RoPE and ALiBi position encodings, achieving stable perplexity on up to 4M tokens.
## Core Mechanism
The SoftMax function in attention requires all attention scores to sum to 1. When the current query has no strong semantic match, the model still needs to allocate residual attention values somewhere. Initial tokens, being visible to all subsequent tokens (due to autoregressive nature), become naturally trained as attention sinks.
StreamingLLM's KV cache has two components: (1) **Attention Sinks** (4 initial tokens) for stable attention computation, and (2) **Rolling KV Cache** (most recent tokens) for language modeling. Positions are assigned within the cache rather than the original text, which is crucial for performance.