20260617:目前有914 页
This commit is contained in:
25
raw/papers/niu-stem-causal-sparse-attention-2026.md
Normal file
25
raw/papers/niu-stem-causal-sparse-attention-2026.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# Stem: Rethinking Causal Information Flow in Sparse Attention
|
||||
|
||||
**Authors:** Lin Niu\*, Xin Luo\*, Linchuan Xie, Yifu Sun, Guanghua Yu, Jianchen Zhu, S Kevin Zhou
|
||||
**Affiliations:** Tencent, University of Science and Technology of China (USTC)
|
||||
**arXiv:** [2603.06274](https://arxiv.org/abs/2603.06274) (v1, March 2026)
|
||||
**Venue:** cs.LG / cs.AI
|
||||
**Implementation:** Triton-based Block Sparse Attention kernel (open-source)
|
||||
|
||||
---
|
||||
|
||||
## Abstract
|
||||
|
||||
The quadratic computational complexity of self-attention remains a fundamental bottleneck for scaling LLMs to long contexts, particularly during the **pre-filling phase**. In this paper, we rethink the causal attention mechanism from the perspective of **information flow**. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a **uniform top-k selection** across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose **Stem**, a novel, plug-and-play sparsity module aligned with information flow:
|
||||
|
||||
1. **Token Position-Decay (TPD)**: position-dependent top-k within each layer — larger budget for initial tokens, aggressive sparsification for later tokens
|
||||
2. **Output-Aware Metric (OAM)**: prioritizes high-impact tokens based on approximate output magnitude (incorporating Value information), not just attention scores
|
||||
|
||||
Stem is **training-free** and can also be integrated into training-based sparse models (DeepSeek-V3.2, MiniCPM-4.1) to further compress the sparse budget. Evaluated on RULER and LongBench with Llama3.1-8B and Qwen3-8B, Stem achieves superior accuracy with reduced pre-filling latency.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- [[stem-sparse-attention]] — the Stem framework
|
||||
- [[causal-information-flow]] — the theoretical perspective
|
||||
- [[token-position-decay]] — position-dependent sparse budget allocation
|
||||
- [[output-aware-metric]] — value-aware token selection
|
||||
Reference in New Issue
Block a user