--- title: "Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training" authors: ["Brian Bartoldson", "Siddarth Venkatraman", "James Diffenderfer", "Moksh Jain", "Tal Ben-Nun", "Seanie Lee", "Minsu Kim", "Johan Obando-Ceron", "Yoshua Bengio", "Bhavya Kailkhura"] year: 2025 arxiv: "2503.18929" venue: "NeurIPS 2025" institutions: ["Lawrence Livermore National Laboratory", "Mila", "Université de Montréal", "KAIST", "CIFAR"] type: "paper" created: 2026-05-12 tags: ["reinforcement-learning", "llm-post-training", "gflownet", "asynchronous-rl", "off-policy"] --- # Trajectory Balance with Asynchrony **Authors**: Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura **Venue**: NeurIPS 2025 **arXiv**: [2503.18929](https://arxiv.org/abs/2503.18929) **Code**: https://github.com/bbartoldson/TBA ## Abstract TBA introduces an asynchronous RL framework for LLM post-training using the off-policy Trajectory Balance (TB) objective from GFlowNets. By decoupling Searcher nodes (exploration) from Trainer nodes (policy updates) and using a replay buffer, TBA achieves 4×–50× speedups while matching or exceeding on-policy baselines (PPO, GRPO, RLOO, Online DPO). ## Key Findings - TB objective enables principled off-policy learning, resistant to the staleness of asynchronous data - Recency + reward sampling from replay buffer balances exploration and exploitation - TBA creates new Pareto frontiers: KL vs. win-rate (preference tuning), diversity vs. toxicity (red-teaming) - On MATH with Qwen 2.5 7B, TBA′ outperforms Dr. GRPO in highly off-policy settings - Scaling searchers improves red-teaming performance (more attacks + more diverse) ## Key Concepts - [[tba|TBA]] — Trajectory Balance with Asynchrony framework - [[trajectory-balance-objective]] — Off-policy TB objective from GFlowNets - [[asynchronous-rl-llm]] — Decoupling exploration from learning - [[off-policy-llm-post-training]] — Training on off-policy data - [[gflownet-fine-tuning]] — GFlowNets for LLM fine-tuning - [[replay-buffer-rl-llm]] — Replay buffer in LLM RL - [[searcher-trainer-decoupling]] — Architecture pattern - [[reward-recency-sampling]] — TBA's sampling strategy - [[grpo]] — On-policy baseline - [[rlvr-unified-framework]] — RLVR training paradigm