Files
myWiki/raw/papers/bartoldson-tba-2025.md

2.4 KiB
Raw Permalink Blame History

title, authors, year, arxiv, venue, institutions, type, created, tags
title authors year arxiv venue institutions type created tags
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
Brian Bartoldson
Siddarth Venkatraman
James Diffenderfer
Moksh Jain
Tal Ben-Nun
Seanie Lee
Minsu Kim
Johan Obando-Ceron
Yoshua Bengio
Bhavya Kailkhura
2025 2503.18929 NeurIPS 2025
Lawrence Livermore National Laboratory
Mila
Université de Montréal
KAIST
CIFAR
paper 2026-05-12
reinforcement-learning
llm-post-training
gflownet
asynchronous-rl
off-policy

Trajectory Balance with Asynchrony

Authors: Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura Venue: NeurIPS 2025 arXiv: 2503.18929 Code: https://github.com/bbartoldson/TBA

Abstract

TBA introduces an asynchronous RL framework for LLM post-training using the off-policy Trajectory Balance (TB) objective from GFlowNets. By decoupling Searcher nodes (exploration) from Trainer nodes (policy updates) and using a replay buffer, TBA achieves 4×50× speedups while matching or exceeding on-policy baselines (PPO, GRPO, RLOO, Online DPO).

Key Findings

  • TB objective enables principled off-policy learning, resistant to the staleness of asynchronous data
  • Recency + reward sampling from replay buffer balances exploration and exploitation
  • TBA creates new Pareto frontiers: KL vs. win-rate (preference tuning), diversity vs. toxicity (red-teaming)
  • On MATH with Qwen 2.5 7B, TBA outperforms Dr. GRPO in highly off-policy settings
  • Scaling searchers improves red-teaming performance (more attacks + more diverse)

Key Concepts