26 lines
2.0 KiB
Markdown
26 lines
2.0 KiB
Markdown
# Stem: Rethinking Causal Information Flow in Sparse Attention
|
|
|
|
**Authors:** Lin Niu\*, Xin Luo\*, Linchuan Xie, Yifu Sun, Guanghua Yu, Jianchen Zhu, S Kevin Zhou
|
|
**Affiliations:** Tencent, University of Science and Technology of China (USTC)
|
|
**arXiv:** [2603.06274](https://arxiv.org/abs/2603.06274) (v1, March 2026)
|
|
**Venue:** cs.LG / cs.AI
|
|
**Implementation:** Triton-based Block Sparse Attention kernel (open-source)
|
|
|
|
---
|
|
|
|
## Abstract
|
|
|
|
The quadratic computational complexity of self-attention remains a fundamental bottleneck for scaling LLMs to long contexts, particularly during the **pre-filling phase**. In this paper, we rethink the causal attention mechanism from the perspective of **information flow**. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a **uniform top-k selection** across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose **Stem**, a novel, plug-and-play sparsity module aligned with information flow:
|
|
|
|
1. **Token Position-Decay (TPD)**: position-dependent top-k within each layer — larger budget for initial tokens, aggressive sparsification for later tokens
|
|
2. **Output-Aware Metric (OAM)**: prioritizes high-impact tokens based on approximate output magnitude (incorporating Value information), not just attention scores
|
|
|
|
Stem is **training-free** and can also be integrated into training-based sparse models (DeepSeek-V3.2, MiniCPM-4.1) to further compress the sparse budget. Evaluated on RULER and LongBench with Llama3.1-8B and Qwen3-8B, Stem achieves superior accuracy with reduced pre-filling latency.
|
|
|
|
## Key Concepts
|
|
|
|
- [[stem-sparse-attention]] — the Stem framework
|
|
- [[causal-information-flow]] — the theoretical perspective
|
|
- [[token-position-decay]] — position-dependent sparse budget allocation
|
|
- [[output-aware-metric]] — value-aware token selection
|