20260617:目前有914 页

This commit is contained in:
2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions

View File

@@ -0,0 +1,25 @@
# Stem: Rethinking Causal Information Flow in Sparse Attention
**Authors:** Lin Niu\*, Xin Luo\*, Linchuan Xie, Yifu Sun, Guanghua Yu, Jianchen Zhu, S Kevin Zhou
**Affiliations:** Tencent, University of Science and Technology of China (USTC)
**arXiv:** [2603.06274](https://arxiv.org/abs/2603.06274) (v1, March 2026)
**Venue:** cs.LG / cs.AI
**Implementation:** Triton-based Block Sparse Attention kernel (open-source)
---
## Abstract
The quadratic computational complexity of self-attention remains a fundamental bottleneck for scaling LLMs to long contexts, particularly during the **pre-filling phase**. In this paper, we rethink the causal attention mechanism from the perspective of **information flow**. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a **uniform top-k selection** across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose **Stem**, a novel, plug-and-play sparsity module aligned with information flow:
1. **Token Position-Decay (TPD)**: position-dependent top-k within each layer — larger budget for initial tokens, aggressive sparsification for later tokens
2. **Output-Aware Metric (OAM)**: prioritizes high-impact tokens based on approximate output magnitude (incorporating Value information), not just attention scores
Stem is **training-free** and can also be integrated into training-based sparse models (DeepSeek-V3.2, MiniCPM-4.1) to further compress the sparse budget. Evaluated on RULER and LongBench with Llama3.1-8B and Qwen3-8B, Stem achieves superior accuracy with reduced pre-filling latency.
## Key Concepts
- [[stem-sparse-attention]] — the Stem framework
- [[causal-information-flow]] — the theoretical perspective
- [[token-position-decay]] — position-dependent sparse budget allocation
- [[output-aware-metric]] — value-aware token selection