SidneyZhang/myWiki

Fork 0

Files

Sidney Zhang e96b955fda

20260601

2026-06-01 10:46:01 +08:00

2.6 KiB

Raw Blame History

title, authors, arxiv, venue, date, type, series

title	authors	arxiv	venue	date	type	series
The Bayesian Geometry of Transformer Attention	Naman Agarwal, Siddhartha R. Dalal, Vishal Misra	2512.22471	arXiv (cs.LG)	2026-05	paper	Bayesian Attention Trilogy, Paper I

The Bayesian Geometry of Transformer Attention

Paper I of the Bayesian Attention Trilogy

Authors: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)

TL;DR

Small transformers achieve exact Bayesian posteriors (10⁻³–10⁻⁴ bit accuracy) in Bayesian wind tunnels — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.

Core Framework: Bayesian Wind Tunnels

Controlled prediction tasks where:

Analytic posterior is known exactly at each step
Hypothesis space is too large for memorization
In-context prediction requires genuine probabilistic inference

Converts "does it do Bayes?" into a quantitative test: does the model's predictive entropy match the analytic posterior entropy?

Three Inference Primitives

Primitive	Definition	Required for
Belief Accumulation	Integrating evidence into running posterior	Bijection learning, HMM
Belief Transport	Propagating beliefs through stochastic dynamics	HMM filtering
Random-Access Binding	Retrieving by content, not position	Associative recall

Architectural Realizability

Architecture	Accumulation	Transport	Binding	Status
Transformer	✅	✅	✅	Full primitive completeness
Mamba (SSM)	✅	✅	❌	SOTA on HMM filtering; fails binding
LSTM	✅	❌	❌	Only static sufficient statistics
MLP	❌	❌	❌	Fails uniformly

Key Geometric Findings

Orthogonal key bases in attention heads
Low-dimensional value manifold parameterized by posterior entropy
Mamba's final layer organizes into 5 clusters — one per HMM hidden state (corner geometry of belief simplex)

Structural Theorem

The dominance of transformers in reasoning tasks arises not from scale alone, but from primitive completeness: they are the minimal architecture realizing the full set of inference primitives.

Trilogy Context

Paper I (this): Existence + internal geometry of exact Bayesian inference in transformers
Paper II: Bayesian geometry arises generically from gradient dynamics under cross-entropy
Paper III: How primitives compose in partially observed settings (closer to natural language)

2.6 KiB Raw Blame History Unescape Escape