Files
myWiki/raw/papers/dao-transformers-are-ssms-2024.md

1.6 KiB

title, source, source_id, authors, published, venue, categories
title source source_id authors published venue categories
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality arXiv 2405.21060
Tri Dao (Princeton University)
Albert Gu (Carnegie Mellon University)
2024-05-31 ICML 2024
cs.LG

Transformers are SSMs

Abstract

While Transformers dominate language modeling, state-space models (SSMs) such as Mamba have matched or outperformed them at small-to-medium scale. This paper shows these model families are closely related through structured state space duality (SSD), connected via semiseparable matrices. The SSD framework enables Mamba-2, a refined selective SSM that is 2-8x faster than Mamba while competitive with Transformers.

Core Contributions

  1. SSD Framework: Equivalence between SSMs and semiseparable matrices → connects SSM recurrence with attention-like quadratic forms
  2. Structured Masked Attention (SMA): Generalizes linear attention with data-dependent position masks
  3. SSD Algorithm: Block decomposition of semiseparable matrices, leveraging both linear (recurrent) and quadratic (attention-like) forms
  4. Mamba-2 Architecture: Multi-head SSM design with tensor parallelism support
  5. Systems Optimizations: TP, sequence parallelism, variable-length training

Key Concepts

  • Structured State Space Duality (SSD), Semiseparable Matrices
  • Structured Masked Attention (SMA), Linear Attention
  • Selective SSMs, Scalar SSM, Head Structure for SSMs (MIS/MVA/GVA)
  • SSD Algorithm, Block Decomposition, Tensor Contraction Duality

URL

https://arxiv.org/abs/2405.21060