Files
myWiki/raw/papers/agarwal-bayesian-attention-geometry-2026.md
2026-06-01 10:46:01 +08:00

62 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "The Bayesian Geometry of Transformer Attention"
authors: "Naman Agarwal, Siddhartha R. Dalal, Vishal Misra"
arxiv: "2512.22471"
venue: "arXiv (cs.LG)"
date: "2026-05"
type: "paper"
series: "Bayesian Attention Trilogy, Paper I"
---
# The Bayesian Geometry of Transformer Attention
**Paper I of the Bayesian Attention Trilogy**
**Authors**: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)
## TL;DR
Small transformers achieve exact Bayesian posteriors (10⁻³10⁻⁴ bit accuracy) in **Bayesian wind tunnels** — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.
## Core Framework: Bayesian Wind Tunnels
Controlled prediction tasks where:
1. Analytic posterior is known exactly at each step
2. Hypothesis space is too large for memorization
3. In-context prediction requires genuine probabilistic inference
Converts "does it do Bayes?" into a quantitative test: **does the model's predictive entropy match the analytic posterior entropy?**
## Three Inference Primitives
| Primitive | Definition | Required for |
|-----------|-----------|-------------|
| Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM |
| Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering |
| Random-Access Binding | Retrieving by content, not position | Associative recall |
## Architectural Realizability
| Architecture | Accumulation | Transport | Binding | Status |
|-------------|:---:|:---:|:---:|--------|
| Transformer | ✅ | ✅ | ✅ | Full primitive completeness |
| Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding |
| LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics |
| MLP | ❌ | ❌ | ❌ | Fails uniformly |
## Key Geometric Findings
- **Orthogonal key bases** in attention heads
- **Low-dimensional value manifold** parameterized by posterior entropy
- Mamba's final layer organizes into **5 clusters** — one per HMM hidden state (corner geometry of belief simplex)
## Structural Theorem
> The dominance of transformers in reasoning tasks arises not from scale alone, but from **primitive completeness**: they are the minimal architecture realizing the full set of inference primitives.
## Trilogy Context
- **Paper I** (this): Existence + internal geometry of exact Bayesian inference in transformers
- **Paper II**: Bayesian geometry arises generically from gradient dynamics under cross-entropy
- **Paper III**: How primitives compose in partially observed settings (closer to natural language)