myWiki/raw/papers/niu-stem-causal-sparse-attention-2026.md

# Stem: Rethinking Causal Information Flow in Sparse Attention

**Authors:** Lin Niu\*, Xin Luo\*, Linchuan Xie, Yifu Sun, Guanghua Yu, Jianchen Zhu, S Kevin Zhou
**Affiliations:** Tencent, University of Science and Technology of China (USTC)
**arXiv:** [2603.06274](https://arxiv.org/abs/2603.06274) (v1, March 2026)
**Venue:** cs.LG / cs.AI
**Implementation:** Triton-based Block Sparse Attention kernel (open-source)

---

## Abstract

The quadratic computational complexity of self-attention remains a fundamental bottleneck for scaling LLMs to long contexts, particularly during the **pre-filling phase**. In this paper, we rethink the causal attention mechanism from the perspective of **information flow**. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a **uniform top-k selection** across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose **Stem**, a novel, plug-and-play sparsity module aligned with information flow:

1. **Token Position-Decay (TPD)**: position-dependent top-k within each layer — larger budget for initial tokens, aggressive sparsification for later tokens
2. **Output-Aware Metric (OAM)**: prioritizes high-impact tokens based on approximate output magnitude (incorporating Value information), not just attention scores

Stem is **training-free** and can also be integrated into training-based sparse models (DeepSeek-V3.2, MiniCPM-4.1) to further compress the sparse budget. Evaluated on RULER and LongBench with Llama3.1-8B and Qwen3-8B, Stem achieves superior accuracy with reduced pre-filling latency.

## Key Concepts

- [[stem-sparse-attention]] — the Stem framework
- [[causal-information-flow]] — the theoretical perspective
- [[token-position-decay]] — position-dependent sparse budget allocation
- [[output-aware-metric]] — value-aware token selection