Files
myWiki/raw/papers/agarwal-bayesian-attention-geometry-2026.md
2026-06-01 10:46:01 +08:00

2.6 KiB
Raw Blame History

title, authors, arxiv, venue, date, type, series
title authors arxiv venue date type series
The Bayesian Geometry of Transformer Attention Naman Agarwal, Siddhartha R. Dalal, Vishal Misra 2512.22471 arXiv (cs.LG) 2026-05 paper Bayesian Attention Trilogy, Paper I

The Bayesian Geometry of Transformer Attention

Paper I of the Bayesian Attention Trilogy

Authors: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)

TL;DR

Small transformers achieve exact Bayesian posteriors (10⁻³10⁻⁴ bit accuracy) in Bayesian wind tunnels — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.

Core Framework: Bayesian Wind Tunnels

Controlled prediction tasks where:

  1. Analytic posterior is known exactly at each step
  2. Hypothesis space is too large for memorization
  3. In-context prediction requires genuine probabilistic inference

Converts "does it do Bayes?" into a quantitative test: does the model's predictive entropy match the analytic posterior entropy?

Three Inference Primitives

Primitive Definition Required for
Belief Accumulation Integrating evidence into running posterior Bijection learning, HMM
Belief Transport Propagating beliefs through stochastic dynamics HMM filtering
Random-Access Binding Retrieving by content, not position Associative recall

Architectural Realizability

Architecture Accumulation Transport Binding Status
Transformer Full primitive completeness
Mamba (SSM) SOTA on HMM filtering; fails binding
LSTM Only static sufficient statistics
MLP Fails uniformly

Key Geometric Findings

  • Orthogonal key bases in attention heads
  • Low-dimensional value manifold parameterized by posterior entropy
  • Mamba's final layer organizes into 5 clusters — one per HMM hidden state (corner geometry of belief simplex)

Structural Theorem

The dominance of transformers in reasoning tasks arises not from scale alone, but from primitive completeness: they are the minimal architecture realizing the full set of inference primitives.

Trilogy Context

  • Paper I (this): Existence + internal geometry of exact Bayesian inference in transformers
  • Paper II: Bayesian geometry arises generically from gradient dynamics under cross-entropy
  • Paper III: How primitives compose in partially observed settings (closer to natural language)