myWiki/raw/papers/bartoldson-tba-2025.md

---
title: "Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training"
authors: ["Brian Bartoldson", "Siddarth Venkatraman", "James Diffenderfer", "Moksh Jain", "Tal Ben-Nun", "Seanie Lee", "Minsu Kim", "Johan Obando-Ceron", "Yoshua Bengio", "Bhavya Kailkhura"]
year: 2025
arxiv: "2503.18929"
venue: "NeurIPS 2025"
institutions: ["Lawrence Livermore National Laboratory", "Mila", "Université de Montréal", "KAIST", "CIFAR"]
type: "paper"
created: 2026-05-12
tags: ["reinforcement-learning", "llm-post-training", "gflownet", "asynchronous-rl", "off-policy"]
---

# Trajectory Balance with Asynchrony

**Authors**: Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
**Venue**: NeurIPS 2025
**arXiv**: [2503.18929](https://arxiv.org/abs/2503.18929)
**Code**: https://github.com/bbartoldson/TBA

## Abstract

TBA introduces an asynchronous RL framework for LLM post-training using the off-policy Trajectory Balance (TB) objective from GFlowNets. By decoupling Searcher nodes (exploration) from Trainer nodes (policy updates) and using a replay buffer, TBA achieves 4×–50× speedups while matching or exceeding on-policy baselines (PPO, GRPO, RLOO, Online DPO).

## Key Findings

- TB objective enables principled off-policy learning, resistant to the staleness of asynchronous data
- Recency + reward sampling from replay buffer balances exploration and exploitation
- TBA creates new Pareto frontiers: KL vs. win-rate (preference tuning), diversity vs. toxicity (red-teaming)
- On MATH with Qwen 2.5 7B, TBA′ outperforms Dr. GRPO in highly off-policy settings
- Scaling searchers improves red-teaming performance (more attacks + more diverse)

## Key Concepts

- [[tba|TBA]] — Trajectory Balance with Asynchrony framework
- [[trajectory-balance-objective]] — Off-policy TB objective from GFlowNets
- [[asynchronous-rl-llm]] — Decoupling exploration from learning
- [[off-policy-llm-post-training]] — Training on off-policy data
- [[gflownet-fine-tuning]] — GFlowNets for LLM fine-tuning
- [[replay-buffer-rl-llm]] — Replay buffer in LLM RL
- [[searcher-trainer-decoupling]] — Architecture pattern
- [[reward-recency-sampling]] — TBA's sampling strategy
- [[grpo]] — On-policy baseline
- [[rlvr-unified-framework]] — RLVR training paradigm