20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/raw/papers/agarwal-bayesian-attention-geometry-2026.md
+++ b/raw/papers/agarwal-bayesian-attention-geometry-2026.md
@@ -0,0 +1,61 @@
+---
+title: "The Bayesian Geometry of Transformer Attention"
+authors: "Naman Agarwal, Siddhartha R. Dalal, Vishal Misra"
+arxiv: "2512.22471"
+venue: "arXiv (cs.LG)"
+date: "2026-05"
+type: "paper"
+series: "Bayesian Attention Trilogy, Paper I"
+---
+
+# The Bayesian Geometry of Transformer Attention
+
+**Paper I of the Bayesian Attention Trilogy**
+
+**Authors**: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)
+
+## TL;DR
+
+Small transformers achieve exact Bayesian posteriors (10⁻³–10⁻⁴ bit accuracy) in **Bayesian wind tunnels** — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.
+
+## Core Framework: Bayesian Wind Tunnels
+
+Controlled prediction tasks where:
+1. Analytic posterior is known exactly at each step
+2. Hypothesis space is too large for memorization
+3. In-context prediction requires genuine probabilistic inference
+
+Converts "does it do Bayes?" into a quantitative test: **does the model's predictive entropy match the analytic posterior entropy?**
+
+## Three Inference Primitives
+
+| Primitive | Definition | Required for |
+|-----------|-----------|-------------|
+| Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM |
+| Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering |
+| Random-Access Binding | Retrieving by content, not position | Associative recall |
+
+## Architectural Realizability
+
+| Architecture | Accumulation | Transport | Binding | Status |
+|-------------|:---:|:---:|:---:|--------|
+| Transformer | ✅ | ✅ | ✅ | Full primitive completeness |
+| Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding |
+| LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics |
+| MLP | ❌ | ❌ | ❌ | Fails uniformly |
+
+## Key Geometric Findings
+
+- **Orthogonal key bases** in attention heads
+- **Low-dimensional value manifold** parameterized by posterior entropy
+- Mamba's final layer organizes into **5 clusters** — one per HMM hidden state (corner geometry of belief simplex)
+
+## Structural Theorem
+
+> The dominance of transformers in reasoning tasks arises not from scale alone, but from **primitive completeness**: they are the minimal architecture realizing the full set of inference primitives.
+
+## Trilogy Context
+
+- **Paper I** (this): Existence + internal geometry of exact Bayesian inference in transformers
+- **Paper II**: Bayesian geometry arises generically from gradient dynamics under cross-entropy
+- **Paper III**: How primitives compose in partially observed settings (closer to natural language)