62 lines
2.6 KiB
Markdown
62 lines
2.6 KiB
Markdown
---
|
||
title: "The Bayesian Geometry of Transformer Attention"
|
||
authors: "Naman Agarwal, Siddhartha R. Dalal, Vishal Misra"
|
||
arxiv: "2512.22471"
|
||
venue: "arXiv (cs.LG)"
|
||
date: "2026-05"
|
||
type: "paper"
|
||
series: "Bayesian Attention Trilogy, Paper I"
|
||
---
|
||
|
||
# The Bayesian Geometry of Transformer Attention
|
||
|
||
**Paper I of the Bayesian Attention Trilogy**
|
||
|
||
**Authors**: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)
|
||
|
||
## TL;DR
|
||
|
||
Small transformers achieve exact Bayesian posteriors (10⁻³–10⁻⁴ bit accuracy) in **Bayesian wind tunnels** — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.
|
||
|
||
## Core Framework: Bayesian Wind Tunnels
|
||
|
||
Controlled prediction tasks where:
|
||
1. Analytic posterior is known exactly at each step
|
||
2. Hypothesis space is too large for memorization
|
||
3. In-context prediction requires genuine probabilistic inference
|
||
|
||
Converts "does it do Bayes?" into a quantitative test: **does the model's predictive entropy match the analytic posterior entropy?**
|
||
|
||
## Three Inference Primitives
|
||
|
||
| Primitive | Definition | Required for |
|
||
|-----------|-----------|-------------|
|
||
| Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM |
|
||
| Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering |
|
||
| Random-Access Binding | Retrieving by content, not position | Associative recall |
|
||
|
||
## Architectural Realizability
|
||
|
||
| Architecture | Accumulation | Transport | Binding | Status |
|
||
|-------------|:---:|:---:|:---:|--------|
|
||
| Transformer | ✅ | ✅ | ✅ | Full primitive completeness |
|
||
| Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding |
|
||
| LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics |
|
||
| MLP | ❌ | ❌ | ❌ | Fails uniformly |
|
||
|
||
## Key Geometric Findings
|
||
|
||
- **Orthogonal key bases** in attention heads
|
||
- **Low-dimensional value manifold** parameterized by posterior entropy
|
||
- Mamba's final layer organizes into **5 clusters** — one per HMM hidden state (corner geometry of belief simplex)
|
||
|
||
## Structural Theorem
|
||
|
||
> The dominance of transformers in reasoning tasks arises not from scale alone, but from **primitive completeness**: they are the minimal architecture realizing the full set of inference primitives.
|
||
|
||
## Trilogy Context
|
||
|
||
- **Paper I** (this): Existence + internal geometry of exact Bayesian inference in transformers
|
||
- **Paper II**: Bayesian geometry arises generically from gradient dynamics under cross-entropy
|
||
- **Paper III**: How primitives compose in partially observed settings (closer to natural language)
|