20260514:增加新内容

2026-05-14 13:54:52 +08:00
parent 56c4d3ef7c
commit b116710e4c
294 changed files with 10682 additions and 255 deletions
--- a/concepts/asynchronous-rl-llm.md
+++ b/concepts/asynchronous-rl-llm.md
@@ -0,0 +1,67 @@
+---
+title: "异步强化学习与大语言模型后训练"
+created: 2026-05-12
+updated: 2026-05-12
+type: concept
+tags: ["reinforcement-learning", "llm-post-training", "distributed-systems"]
+sources: ["arxiv:2503.18929"]
+---
+
+# 异步强化学习与大语言模型后训练
+
+**异步 RL** 将数据生成（探索）与策略更新（学习）解耦，使两者可以**独立并行**进行，大幅提升计算资源利用率。
+
+## 串行瓶颈 (On-Policy)
+
+标准 on-policy RL 流程：
+```
+生成 rollouts → 计算奖励 → 更新策略 → 生成 rollouts → ...
+              ↑____________________________________↓
+                   每次更新后重新生成（串行等待）
+```
+
+瓶颈在于：
+- **Generation-bound**：训练等待推理完成
+- **Training-bound**：推理等待训练完成
+
+## 异步架构
+
+```
+Searcher 1 ────┐          ┌── Trainer
+Searcher 2 ────┤ Replay   │   ↓
+Searcher 3 ────┤ Buffer ──┤ TB Loss
+    ...         │          │ Policy Update
+Searcher N ────┘          └── ......
+    ↑ 每k步同步权重           ↓
+    └─────────────────────────┘
+```
+
+Searcher 和 Trainer **从不互相等待**，仅在同步点交换权重和数据。
+
+## 关键挑战
+
+On-policy 算法（PPO、GRPO、RLOO）对 **off-policyness** 敏感：
+- Async DPO 在策略偏离增大时性能显著下降
+- Proximal RLOO 用 IS ratio clipping 缓解但仍然受限
+
+## TBA 的解决方案
+
+[[tba|TBA]] 用 [[trajectory-balance-objective|TB 目标]] 替代 on-policy 目标——TB 天然 off-policy 兼容，使得 stale 数据（即使偏离当前策略很多步）仍然高效可用。
+
+**实验验证**：TBA 即使在 15 步 off-policy 设置下，性能仍超越 on-policy Online DPO。
+
+## 与分布式 RL 经典方法的关系
+
+| 方法 | 年份 | 通信方式 | LLM 适用性 |
+|------|------|---------|-----------|
+| A3C | 2016 | 梯度 | ❌ 需要 value function |
+| IMPALA | 2018 | 轨迹 (s,a,r) | ⚠️ V-trace 需要 V(s) |
+| TBA | 2025 | 轨迹 (x,y,r) | ✅ TB 无需 critic |
+
+## 相关概念
+
+- [[tba|TBA]] — 框架实现
+- [[searcher-trainer-decoupling]] — 架构模式
+- [[replay-buffer-rl-llm]] — Buffer 设计
+- [[off-policy-llm-post-training]] — Off-policy 范式
+- [[bartoldson-tba-2025|论文页面]]