--- title: "The Bayesian Geometry of Transformer Attention" authors: "Naman Agarwal, Siddhartha R. Dalal, Vishal Misra" arxiv: "2512.22471" venue: "arXiv (cs.LG)" date: "2026-05" type: "paper" series: "Bayesian Attention Trilogy, Paper I" --- # The Bayesian Geometry of Transformer Attention **Paper I of the Bayesian Attention Trilogy** **Authors**: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia) ## TL;DR Small transformers achieve exact Bayesian posteriors (10⁻³–10⁻⁴ bit accuracy) in **Bayesian wind tunnels** — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude. ## Core Framework: Bayesian Wind Tunnels Controlled prediction tasks where: 1. Analytic posterior is known exactly at each step 2. Hypothesis space is too large for memorization 3. In-context prediction requires genuine probabilistic inference Converts "does it do Bayes?" into a quantitative test: **does the model's predictive entropy match the analytic posterior entropy?** ## Three Inference Primitives | Primitive | Definition | Required for | |-----------|-----------|-------------| | Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM | | Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering | | Random-Access Binding | Retrieving by content, not position | Associative recall | ## Architectural Realizability | Architecture | Accumulation | Transport | Binding | Status | |-------------|:---:|:---:|:---:|--------| | Transformer | ✅ | ✅ | ✅ | Full primitive completeness | | Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding | | LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics | | MLP | ❌ | ❌ | ❌ | Fails uniformly | ## Key Geometric Findings - **Orthogonal key bases** in attention heads - **Low-dimensional value manifold** parameterized by posterior entropy - Mamba's final layer organizes into **5 clusters** — one per HMM hidden state (corner geometry of belief simplex) ## Structural Theorem > The dominance of transformers in reasoning tasks arises not from scale alone, but from **primitive completeness**: they are the minimal architecture realizing the full set of inference primitives. ## Trilogy Context - **Paper I** (this): Existence + internal geometry of exact Bayesian inference in transformers - **Paper II**: Bayesian geometry arises generically from gradient dynamics under cross-entropy - **Paper III**: How primitives compose in partially observed settings (closer to natural language)