20260514:增加新内容

2026-05-14 13:54:52 +08:00
parent 56c4d3ef7c
commit b116710e4c
294 changed files with 10682 additions and 255 deletions
--- a/papers/bartoldson-tba-2025.md
+++ b/papers/bartoldson-tba-2025.md
@@ -0,0 +1,100 @@
+---
+title: "TBA: 异步轨迹平衡 — 解耦探索与学习以实现快速可扩展的 LLM 后训练"
+authors: ["Brian Bartoldson", "Siddarth Venkatraman", "James Diffenderfer", "Moksh Jain", "Tal Ben-Nun", "Seanie Lee", "Minsu Kim", "Johan Obando-Ceron", "Yoshua Bengio", "Bhavya Kailkhura"]
+year: 2025
+arxiv: "2503.18929"
+venue: "NeurIPS 2025"
+type: "paper"
+created: 2026-05-12
+tags: ["reinforcement-learning", "llm-post-training", "gflownet", "asynchronous-rl"]
+sources: ["https://arxiv.org/abs/2503.18929", "https://github.com/bbartoldson/TBA"]
+---
+
+# TBA: 异步轨迹平衡 — 解耦探索与学习
+
+> **"Decoupling Exploration and Learning"** — 用 GFlowNet 的 off-policy 目标实现 4×–50× 训练加速。
+
+## 核心问题
+
+标准 on-policy RL 方法（PPO、[[grpo|GRPO]]、RLOO）存在**串行瓶颈**：数据生成和政策更新必须顺序进行，GPU 利用率低。
+
+异步 RL 可解耦两者，但 off-policy 数据会导致性能下降——现有方法（Async DPO、Proximal RLOO）在策略偏离增大时性能显著衰退。
+
+## TBA 框架
+
+[[tba|TBA]] 将 [[gflownet-fine-tuning|GFlowNet]] 的 [[trajectory-balance-objective|Trajectory Balance (TB)]] 目标集成到 [[asynchronous-rl-llm|异步分布式 RL]] 框架中：
+
+```
+┌──────────────────────────────────────────┐
+│  S EARCHER 节点 (N个)    T RAINER 节点   │
+│  ┌─────────────┐        ┌─────────────┐  │
+│  │ vLLM 推理   │──◇──▶ │ Replay      │  │
+│  │ 本地策略πθ' │  轨迹  │ Buffer      │  │
+│  │ 奖励评估    │        │ (D_global)  │  │
+│  └─────────────┘        │    ↓        │  │
+│       ↑ 每k步同步        │ TB Loss更新  │  │
+│       └─────────────────┤ 策略权重    │  │
+│                         └─────────────┘  │
+└──────────────────────────────────────────┘
+```
+
+### 关键设计
+
+**1. Searcher-Trainer 解耦**：Searcher 持续生成响应（不等待训练），Trainer 持续训练（不等待生成），仅在每 k 步同步一次。
+
+**2. [[replay-buffer-rl-llm|Global Replay Buffer]]**：存储所有历史轨迹（x, y, r），Trainer 从中采样进行 off-policy 训练。
+
+**3. [[reward-recency-sampling|双重采样策略]]**：概率 m 采样最近（recency）数据 → 近似 on-policy；概率 1−m 采用奖励优先（reward-prioritized）采样 → 探索高奖励区域。
+
+### TB 目标公式
+
+$$L_{TB}(y, x; \theta) = \left(\log \frac{Z(x)\pi_\theta(y|x)}{R(y; x)}\right)^2$$
+
+其中 $R(y; x) = \pi_{ref}(y|x) \exp(\beta^{-1} r_\phi(y; x))$，$Z(x)$ 用 K-sample batch estimate（VarGrad）替代学习。
+
+**关键性质**：TB 是 **off-policy 兼容** 的——训练时 $y$ 可从任意分布采样。
+
+## 实验结果
+
+### 数学推理 (GSM8K, RhoMath-1B)
+| 方法 | 加速比 | 准确率 |
+|------|--------|--------|
+| VinePPO | — | ~53% |
+| TBA | **50×** | **55%** |
+
+### 偏好微调 (TL;DR, Pythia 410M)
+- TBA 在 16 步 off-policy 设置下 **超越 on-policy Online DPO**
+- 定义新的 KL vs. Win-Rate **Pareto 前沿**
+
+### 自动红队测试 (GPT-2, Llama 3.2 1B)
+- TBA 在 diversity-toxicity Pareto 前沿上达到 SOTA
+- 增加 Searcher 数量持续提升攻击成功率和多样性
+
+### 大规模模型 (MATH, Qwen 2.5 7B)
+- TBA′ 在高度 off-policy 设置下（10 步 stale）**显著优于 Dr. GRPO**
+
+## 概念网络
+
+```
+TBA 框架
+├── 算法基础
+│   ├── [[trajectory-balance-objective]]: Off-policy TB 目标
+│   │   └── 源自 [[gflownet-fine-tuning|GFlowNet fine-tuning]]
+│   └── KL 正则化 RL: π* ∝ π_ref · exp(r/β)
+├── 系统架构
+│   ├── [[asynchronous-rl-llm]]: 解耦探索与学习
+│   ├── [[searcher-trainer-decoupling]]: Searcher ↔ Trainer
+│   └── [[replay-buffer-rl-llm]]: Global replay buffer
+├── 采样策略
+│   └── [[reward-recency-sampling]]: 奖励 vs 最近度
+└── 对比基线
+    ├── [[grpo]]: On-policy 基线
+    └── [[off-policy-llm-post-training]]: Off-policy RL 范式
+```
+
+## 论文信息
+
+- **arXiv**: [2503.18929](https://arxiv.org/abs/2503.18929)
+- **代码**: [bbartoldson/TBA](https://github.com/bbartoldson/TBA)
+- **机构**: LLNL × Mila × Université de Montréal × KAIST × CIFAR
+- **发表**: NeurIPS 2025