Files
myWiki/raw/papers/bartoldson-tba-2025.md

44 lines
2.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training"
authors: ["Brian Bartoldson", "Siddarth Venkatraman", "James Diffenderfer", "Moksh Jain", "Tal Ben-Nun", "Seanie Lee", "Minsu Kim", "Johan Obando-Ceron", "Yoshua Bengio", "Bhavya Kailkhura"]
year: 2025
arxiv: "2503.18929"
venue: "NeurIPS 2025"
institutions: ["Lawrence Livermore National Laboratory", "Mila", "Université de Montréal", "KAIST", "CIFAR"]
type: "paper"
created: 2026-05-12
tags: ["reinforcement-learning", "llm-post-training", "gflownet", "asynchronous-rl", "off-policy"]
---
# Trajectory Balance with Asynchrony
**Authors**: Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
**Venue**: NeurIPS 2025
**arXiv**: [2503.18929](https://arxiv.org/abs/2503.18929)
**Code**: https://github.com/bbartoldson/TBA
## Abstract
TBA introduces an asynchronous RL framework for LLM post-training using the off-policy Trajectory Balance (TB) objective from GFlowNets. By decoupling Searcher nodes (exploration) from Trainer nodes (policy updates) and using a replay buffer, TBA achieves 4×50× speedups while matching or exceeding on-policy baselines (PPO, GRPO, RLOO, Online DPO).
## Key Findings
- TB objective enables principled off-policy learning, resistant to the staleness of asynchronous data
- Recency + reward sampling from replay buffer balances exploration and exploitation
- TBA creates new Pareto frontiers: KL vs. win-rate (preference tuning), diversity vs. toxicity (red-teaming)
- On MATH with Qwen 2.5 7B, TBA outperforms Dr. GRPO in highly off-policy settings
- Scaling searchers improves red-teaming performance (more attacks + more diverse)
## Key Concepts
- [[tba|TBA]] — Trajectory Balance with Asynchrony framework
- [[trajectory-balance-objective]] — Off-policy TB objective from GFlowNets
- [[asynchronous-rl-llm]] — Decoupling exploration from learning
- [[off-policy-llm-post-training]] — Training on off-policy data
- [[gflownet-fine-tuning]] — GFlowNets for LLM fine-tuning
- [[replay-buffer-rl-llm]] — Replay buffer in LLM RL
- [[searcher-trainer-decoupling]] — Architecture pattern
- [[reward-recency-sampling]] — TBA's sampling strategy
- [[grpo]] — On-policy baseline
- [[rlvr-unified-framework]] — RLVR training paradigm