This commit is contained in:
2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions

View File

@@ -0,0 +1,61 @@
---
title: "The Bayesian Geometry of Transformer Attention"
authors: "Naman Agarwal, Siddhartha R. Dalal, Vishal Misra"
arxiv: "2512.22471"
venue: "arXiv (cs.LG)"
date: "2026-05"
type: "paper"
series: "Bayesian Attention Trilogy, Paper I"
---
# The Bayesian Geometry of Transformer Attention
**Paper I of the Bayesian Attention Trilogy**
**Authors**: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)
## TL;DR
Small transformers achieve exact Bayesian posteriors (10⁻³10⁻⁴ bit accuracy) in **Bayesian wind tunnels** — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.
## Core Framework: Bayesian Wind Tunnels
Controlled prediction tasks where:
1. Analytic posterior is known exactly at each step
2. Hypothesis space is too large for memorization
3. In-context prediction requires genuine probabilistic inference
Converts "does it do Bayes?" into a quantitative test: **does the model's predictive entropy match the analytic posterior entropy?**
## Three Inference Primitives
| Primitive | Definition | Required for |
|-----------|-----------|-------------|
| Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM |
| Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering |
| Random-Access Binding | Retrieving by content, not position | Associative recall |
## Architectural Realizability
| Architecture | Accumulation | Transport | Binding | Status |
|-------------|:---:|:---:|:---:|--------|
| Transformer | ✅ | ✅ | ✅ | Full primitive completeness |
| Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding |
| LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics |
| MLP | ❌ | ❌ | ❌ | Fails uniformly |
## Key Geometric Findings
- **Orthogonal key bases** in attention heads
- **Low-dimensional value manifold** parameterized by posterior entropy
- Mamba's final layer organizes into **5 clusters** — one per HMM hidden state (corner geometry of belief simplex)
## Structural Theorem
> The dominance of transformers in reasoning tasks arises not from scale alone, but from **primitive completeness**: they are the minimal architecture realizing the full set of inference primitives.
## Trilogy Context
- **Paper I** (this): Existence + internal geometry of exact Bayesian inference in transformers
- **Paper II**: Bayesian geometry arises generically from gradient dynamics under cross-entropy
- **Paper III**: How primitives compose in partially observed settings (closer to natural language)