44 lines
2.4 KiB
Markdown
44 lines
2.4 KiB
Markdown
---
|
||
title: "Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training"
|
||
authors: ["Brian Bartoldson", "Siddarth Venkatraman", "James Diffenderfer", "Moksh Jain", "Tal Ben-Nun", "Seanie Lee", "Minsu Kim", "Johan Obando-Ceron", "Yoshua Bengio", "Bhavya Kailkhura"]
|
||
year: 2025
|
||
arxiv: "2503.18929"
|
||
venue: "NeurIPS 2025"
|
||
institutions: ["Lawrence Livermore National Laboratory", "Mila", "Université de Montréal", "KAIST", "CIFAR"]
|
||
type: "paper"
|
||
created: 2026-05-12
|
||
tags: ["reinforcement-learning", "llm-post-training", "gflownet", "asynchronous-rl", "off-policy"]
|
||
---
|
||
|
||
# Trajectory Balance with Asynchrony
|
||
|
||
**Authors**: Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
|
||
**Venue**: NeurIPS 2025
|
||
**arXiv**: [2503.18929](https://arxiv.org/abs/2503.18929)
|
||
**Code**: https://github.com/bbartoldson/TBA
|
||
|
||
## Abstract
|
||
|
||
TBA introduces an asynchronous RL framework for LLM post-training using the off-policy Trajectory Balance (TB) objective from GFlowNets. By decoupling Searcher nodes (exploration) from Trainer nodes (policy updates) and using a replay buffer, TBA achieves 4×–50× speedups while matching or exceeding on-policy baselines (PPO, GRPO, RLOO, Online DPO).
|
||
|
||
## Key Findings
|
||
|
||
- TB objective enables principled off-policy learning, resistant to the staleness of asynchronous data
|
||
- Recency + reward sampling from replay buffer balances exploration and exploitation
|
||
- TBA creates new Pareto frontiers: KL vs. win-rate (preference tuning), diversity vs. toxicity (red-teaming)
|
||
- On MATH with Qwen 2.5 7B, TBA′ outperforms Dr. GRPO in highly off-policy settings
|
||
- Scaling searchers improves red-teaming performance (more attacks + more diverse)
|
||
|
||
## Key Concepts
|
||
|
||
- [[tba|TBA]] — Trajectory Balance with Asynchrony framework
|
||
- [[trajectory-balance-objective]] — Off-policy TB objective from GFlowNets
|
||
- [[asynchronous-rl-llm]] — Decoupling exploration from learning
|
||
- [[off-policy-llm-post-training]] — Training on off-policy data
|
||
- [[gflownet-fine-tuning]] — GFlowNets for LLM fine-tuning
|
||
- [[replay-buffer-rl-llm]] — Replay buffer in LLM RL
|
||
- [[searcher-trainer-decoupling]] — Architecture pattern
|
||
- [[reward-recency-sampling]] — TBA's sampling strategy
|
||
- [[grpo]] — On-policy baseline
|
||
- [[rlvr-unified-framework]] — RLVR training paradigm
|