20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/raw/papers/niu-stem-causal-sparse-attention-2026.md
+++ b/raw/papers/niu-stem-causal-sparse-attention-2026.md
@@ -0,0 +1,25 @@
+# Stem: Rethinking Causal Information Flow in Sparse Attention
+
+**Authors:** Lin Niu\*, Xin Luo\*, Linchuan Xie, Yifu Sun, Guanghua Yu, Jianchen Zhu, S Kevin Zhou  
+**Affiliations:** Tencent, University of Science and Technology of China (USTC)  
+**arXiv:** [2603.06274](https://arxiv.org/abs/2603.06274) (v1, March 2026)  
+**Venue:** cs.LG / cs.AI  
+**Implementation:** Triton-based Block Sparse Attention kernel (open-source)
+
+---
+
+## Abstract
+
+The quadratic computational complexity of self-attention remains a fundamental bottleneck for scaling LLMs to long contexts, particularly during the **pre-filling phase**. In this paper, we rethink the causal attention mechanism from the perspective of **information flow**. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a **uniform top-k selection** across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose **Stem**, a novel, plug-and-play sparsity module aligned with information flow:
+
+1. **Token Position-Decay (TPD)**: position-dependent top-k within each layer — larger budget for initial tokens, aggressive sparsification for later tokens
+2. **Output-Aware Metric (OAM)**: prioritizes high-impact tokens based on approximate output magnitude (incorporating Value information), not just attention scores
+
+Stem is **training-free** and can also be integrated into training-based sparse models (DeepSeek-V3.2, MiniCPM-4.1) to further compress the sparse budget. Evaluated on RULER and LongBench with Llama3.1-8B and Qwen3-8B, Stem achieves superior accuracy with reduced pre-filling latency.
+
+## Key Concepts
+
+- [[stem-sparse-attention]] — the Stem framework
+- [[causal-information-flow]] — the theoretical perspective
+- [[token-position-decay]] — position-dependent sparse budget allocation
+- [[output-aware-metric]] — value-aware token selection