Trajectory Balance with Asynchrony

Authors: Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura Venue: NeurIPS 2025 arXiv: 2503.18929 Code: https://github.com/bbartoldson/TBA

Abstract

TBA introduces an asynchronous RL framework for LLM post-training using the off-policy Trajectory Balance (TB) objective from GFlowNets. By decoupling Searcher nodes (exploration) from Trainer nodes (policy updates) and using a replay buffer, TBA achieves 4×–50× speedups while matching or exceeding on-policy baselines (PPO, GRPO, RLOO, Online DPO).

Key Findings

TB objective enables principled off-policy learning, resistant to the staleness of asynchronous data
Recency + reward sampling from replay buffer balances exploration and exploitation
TBA creates new Pareto frontiers: KL vs. win-rate (preference tuning), diversity vs. toxicity (red-teaming)
On MATH with Qwen 2.5 7B, TBA′ outperforms Dr. GRPO in highly off-policy settings
Scaling searchers improves red-teaming performance (more attacks + more diverse)

Key Concepts

tba — Trajectory Balance with Asynchrony framework
trajectory-balance-objective — Off-policy TB objective from GFlowNets
asynchronous-rl-llm — Decoupling exploration from learning
off-policy-llm-post-training — Training on off-policy data
gflownet-fine-tuning — GFlowNets for LLM fine-tuning
replay-buffer-rl-llm — Replay buffer in LLM RL
searcher-trainer-decoupling — Architecture pattern
reward-recency-sampling — TBA's sampling strategy
grpo — On-policy baseline
rlvr-unified-framework — RLVR training paradigm

2.4 KiB Raw Blame History Unescape Escape

Trajectory Balance with Asynchrony

Abstract

Key Findings

Key Concepts

2.4 KiB

Raw Blame History