2.6 KiB
title, authors, arxiv, venue, date, type, series
| title | authors | arxiv | venue | date | type | series |
|---|---|---|---|---|---|---|
| The Bayesian Geometry of Transformer Attention | Naman Agarwal, Siddhartha R. Dalal, Vishal Misra | 2512.22471 | arXiv (cs.LG) | 2026-05 | paper | Bayesian Attention Trilogy, Paper I |
The Bayesian Geometry of Transformer Attention
Paper I of the Bayesian Attention Trilogy
Authors: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)
TL;DR
Small transformers achieve exact Bayesian posteriors (10⁻³–10⁻⁴ bit accuracy) in Bayesian wind tunnels — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.
Core Framework: Bayesian Wind Tunnels
Controlled prediction tasks where:
- Analytic posterior is known exactly at each step
- Hypothesis space is too large for memorization
- In-context prediction requires genuine probabilistic inference
Converts "does it do Bayes?" into a quantitative test: does the model's predictive entropy match the analytic posterior entropy?
Three Inference Primitives
| Primitive | Definition | Required for |
|---|---|---|
| Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM |
| Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering |
| Random-Access Binding | Retrieving by content, not position | Associative recall |
Architectural Realizability
| Architecture | Accumulation | Transport | Binding | Status |
|---|---|---|---|---|
| Transformer | ✅ | ✅ | ✅ | Full primitive completeness |
| Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding |
| LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics |
| MLP | ❌ | ❌ | ❌ | Fails uniformly |
Key Geometric Findings
- Orthogonal key bases in attention heads
- Low-dimensional value manifold parameterized by posterior entropy
- Mamba's final layer organizes into 5 clusters — one per HMM hidden state (corner geometry of belief simplex)
Structural Theorem
The dominance of transformers in reasoning tasks arises not from scale alone, but from primitive completeness: they are the minimal architecture realizing the full set of inference primitives.
Trilogy Context
- Paper I (this): Existence + internal geometry of exact Bayesian inference in transformers
- Paper II: Bayesian geometry arises generically from gradient dynamics under cross-entropy
- Paper III: How primitives compose in partially observed settings (closer to natural language)