Files
myWiki/raw/papers/peng-tst-2026.md
2026-06-01 10:46:01 +08:00

2.2 KiB
Raw Blame History

title, created, type, arxiv, authors, venue, affiliation, tags
title created type arxiv authors venue affiliation tags
Efficient Pre-Training with Token Superposition 2026-05-29 paper-raw 2605.06546
Bowen Peng
Théo Gigant
Jeffrey Quesnelle
arXiv preprint (cs.CL), v2, May 2026 Nous Research
pre-training
efficiency
token-superposition
LLM

Efficient Pre-Training with Token Superposition

Authors: Bowen Peng*, Théo Gigant*, Jeffrey Quesnelle (* equal contribution) Affiliation: Nous Research arXiv: 2605.06546 (v2, 19 May 2026) Category: cs.CL (Computation and Language)

Abstract

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

Key Contributions

  1. Token-Superposition Training (TST): A two-phase drop-in method that increases token throughput s× per FLOP without modifying model architecture
  2. Multi-hot Cross-Entropy (MCE): Novel loss function for predicting bags of tokens simultaneously
  3. Representation Alignment Hypothesis: Shared embeddings across phases are critical — re-initialization destroys gains
  4. Extensive scaling validation: 270M→600M→3B→10B A1B MoE, with 2.5× speedup at largest scale