Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
Brian Bartoldson
Siddarth Venkatraman
James Diffenderfer
Moksh Jain
Tal Ben-Nun
Seanie Lee
Minsu Kim
Johan Obando-Ceron
Yoshua Bengio
Bhavya Kailkhura
2025
2503.18929
NeurIPS 2025
Lawrence Livermore National Laboratory
Mila
Université de Montréal
KAIST
CIFAR
paper
2026-05-12
reinforcement-learning
llm-post-training
gflownet
asynchronous-rl
off-policy
Trajectory Balance with Asynchrony
Authors: Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
Venue: NeurIPS 2025
arXiv: 2503.18929Code: https://github.com/bbartoldson/TBA
Abstract
TBA introduces an asynchronous RL framework for LLM post-training using the off-policy Trajectory Balance (TB) objective from GFlowNets. By decoupling Searcher nodes (exploration) from Trainer nodes (policy updates) and using a replay buffer, TBA achieves 4×–50× speedups while matching or exceeding on-policy baselines (PPO, GRPO, RLOO, Online DPO).
Key Findings
TB objective enables principled off-policy learning, resistant to the staleness of asynchronous data
Recency + reward sampling from replay buffer balances exploration and exploitation
TBA creates new Pareto frontiers: KL vs. win-rate (preference tuning), diversity vs. toxicity (red-teaming)
On MATH with Qwen 2.5 7B, TBA′ outperforms Dr. GRPO in highly off-policy settings
Scaling searchers improves red-teaming performance (more attacks + more diverse)
Key Concepts
tba — Trajectory Balance with Asynchrony framework