20260601
This commit is contained in:
61
raw/papers/agarwal-bayesian-attention-geometry-2026.md
Normal file
61
raw/papers/agarwal-bayesian-attention-geometry-2026.md
Normal file
@@ -0,0 +1,61 @@
|
||||
---
|
||||
title: "The Bayesian Geometry of Transformer Attention"
|
||||
authors: "Naman Agarwal, Siddhartha R. Dalal, Vishal Misra"
|
||||
arxiv: "2512.22471"
|
||||
venue: "arXiv (cs.LG)"
|
||||
date: "2026-05"
|
||||
type: "paper"
|
||||
series: "Bayesian Attention Trilogy, Paper I"
|
||||
---
|
||||
|
||||
# The Bayesian Geometry of Transformer Attention
|
||||
|
||||
**Paper I of the Bayesian Attention Trilogy**
|
||||
|
||||
**Authors**: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)
|
||||
|
||||
## TL;DR
|
||||
|
||||
Small transformers achieve exact Bayesian posteriors (10⁻³–10⁻⁴ bit accuracy) in **Bayesian wind tunnels** — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.
|
||||
|
||||
## Core Framework: Bayesian Wind Tunnels
|
||||
|
||||
Controlled prediction tasks where:
|
||||
1. Analytic posterior is known exactly at each step
|
||||
2. Hypothesis space is too large for memorization
|
||||
3. In-context prediction requires genuine probabilistic inference
|
||||
|
||||
Converts "does it do Bayes?" into a quantitative test: **does the model's predictive entropy match the analytic posterior entropy?**
|
||||
|
||||
## Three Inference Primitives
|
||||
|
||||
| Primitive | Definition | Required for |
|
||||
|-----------|-----------|-------------|
|
||||
| Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM |
|
||||
| Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering |
|
||||
| Random-Access Binding | Retrieving by content, not position | Associative recall |
|
||||
|
||||
## Architectural Realizability
|
||||
|
||||
| Architecture | Accumulation | Transport | Binding | Status |
|
||||
|-------------|:---:|:---:|:---:|--------|
|
||||
| Transformer | ✅ | ✅ | ✅ | Full primitive completeness |
|
||||
| Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding |
|
||||
| LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics |
|
||||
| MLP | ❌ | ❌ | ❌ | Fails uniformly |
|
||||
|
||||
## Key Geometric Findings
|
||||
|
||||
- **Orthogonal key bases** in attention heads
|
||||
- **Low-dimensional value manifold** parameterized by posterior entropy
|
||||
- Mamba's final layer organizes into **5 clusters** — one per HMM hidden state (corner geometry of belief simplex)
|
||||
|
||||
## Structural Theorem
|
||||
|
||||
> The dominance of transformers in reasoning tasks arises not from scale alone, but from **primitive completeness**: they are the minimal architecture realizing the full set of inference primitives.
|
||||
|
||||
## Trilogy Context
|
||||
|
||||
- **Paper I** (this): Existence + internal geometry of exact Bayesian inference in transformers
|
||||
- **Paper II**: Bayesian geometry arises generically from gradient dynamics under cross-entropy
|
||||
- **Paper III**: How primitives compose in partially observed settings (closer to natural language)
|
||||
Reference in New Issue
Block a user