myWiki/raw/papers/agarwal-bayesian-attention-geometry-2026.md

---
title: "The Bayesian Geometry of Transformer Attention"
authors: "Naman Agarwal, Siddhartha R. Dalal, Vishal Misra"
arxiv: "2512.22471"
venue: "arXiv (cs.LG)"
date: "2026-05"
type: "paper"
series: "Bayesian Attention Trilogy, Paper I"
---

# The Bayesian Geometry of Transformer Attention

**Paper I of the Bayesian Attention Trilogy**

**Authors**: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)

## TL;DR

Small transformers achieve exact Bayesian posteriors (10⁻³–10⁻⁴ bit accuracy) in **Bayesian wind tunnels** — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.

## Core Framework: Bayesian Wind Tunnels

Controlled prediction tasks where:
1. Analytic posterior is known exactly at each step
2. Hypothesis space is too large for memorization
3. In-context prediction requires genuine probabilistic inference

Converts "does it do Bayes?" into a quantitative test: **does the model's predictive entropy match the analytic posterior entropy?**

## Three Inference Primitives

| Primitive | Definition | Required for |
|-----------|-----------|-------------|
| Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM |
| Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering |
| Random-Access Binding | Retrieving by content, not position | Associative recall |

## Architectural Realizability

| Architecture | Accumulation | Transport | Binding | Status |
|-------------|:---:|:---:|:---:|--------|
| Transformer | ✅ | ✅ | ✅ | Full primitive completeness |
| Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding |
| LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics |
| MLP | ❌ | ❌ | ❌ | Fails uniformly |

## Key Geometric Findings

- **Orthogonal key bases** in attention heads
- **Low-dimensional value manifold** parameterized by posterior entropy
- Mamba's final layer organizes into **5 clusters** — one per HMM hidden state (corner geometry of belief simplex)

## Structural Theorem

> The dominance of transformers in reasoning tasks arises not from scale alone, but from **primitive completeness**: they are the minimal architecture realizing the full set of inference primitives.

## Trilogy Context

- **Paper I** (this): Existence + internal geometry of exact Bayesian inference in transformers
- **Paper II**: Bayesian geometry arises generically from gradient dynamics under cross-entropy
- **Paper III**: How primitives compose in partially observed settings (closer to natural language)