20260514:增加新内容
This commit is contained in:
67
concepts/asynchronous-rl-llm.md
Normal file
67
concepts/asynchronous-rl-llm.md
Normal file
@@ -0,0 +1,67 @@
|
||||
---
|
||||
title: "异步强化学习与大语言模型后训练"
|
||||
created: 2026-05-12
|
||||
updated: 2026-05-12
|
||||
type: concept
|
||||
tags: ["reinforcement-learning", "llm-post-training", "distributed-systems"]
|
||||
sources: ["arxiv:2503.18929"]
|
||||
---
|
||||
|
||||
# 异步强化学习与大语言模型后训练
|
||||
|
||||
**异步 RL** 将数据生成(探索)与策略更新(学习)解耦,使两者可以**独立并行**进行,大幅提升计算资源利用率。
|
||||
|
||||
## 串行瓶颈 (On-Policy)
|
||||
|
||||
标准 on-policy RL 流程:
|
||||
```
|
||||
生成 rollouts → 计算奖励 → 更新策略 → 生成 rollouts → ...
|
||||
↑____________________________________↓
|
||||
每次更新后重新生成(串行等待)
|
||||
```
|
||||
|
||||
瓶颈在于:
|
||||
- **Generation-bound**:训练等待推理完成
|
||||
- **Training-bound**:推理等待训练完成
|
||||
|
||||
## 异步架构
|
||||
|
||||
```
|
||||
Searcher 1 ────┐ ┌── Trainer
|
||||
Searcher 2 ────┤ Replay │ ↓
|
||||
Searcher 3 ────┤ Buffer ──┤ TB Loss
|
||||
... │ │ Policy Update
|
||||
Searcher N ────┘ └── ......
|
||||
↑ 每k步同步权重 ↓
|
||||
└─────────────────────────┘
|
||||
```
|
||||
|
||||
Searcher 和 Trainer **从不互相等待**,仅在同步点交换权重和数据。
|
||||
|
||||
## 关键挑战
|
||||
|
||||
On-policy 算法(PPO、GRPO、RLOO)对 **off-policyness** 敏感:
|
||||
- Async DPO 在策略偏离增大时性能显著下降
|
||||
- Proximal RLOO 用 IS ratio clipping 缓解但仍然受限
|
||||
|
||||
## TBA 的解决方案
|
||||
|
||||
[[tba|TBA]] 用 [[trajectory-balance-objective|TB 目标]] 替代 on-policy 目标——TB 天然 off-policy 兼容,使得 stale 数据(即使偏离当前策略很多步)仍然高效可用。
|
||||
|
||||
**实验验证**:TBA 即使在 15 步 off-policy 设置下,性能仍超越 on-policy Online DPO。
|
||||
|
||||
## 与分布式 RL 经典方法的关系
|
||||
|
||||
| 方法 | 年份 | 通信方式 | LLM 适用性 |
|
||||
|------|------|---------|-----------|
|
||||
| A3C | 2016 | 梯度 | ❌ 需要 value function |
|
||||
| IMPALA | 2018 | 轨迹 (s,a,r) | ⚠️ V-trace 需要 V(s) |
|
||||
| TBA | 2025 | 轨迹 (x,y,r) | ✅ TB 无需 critic |
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[tba|TBA]] — 框架实现
|
||||
- [[searcher-trainer-decoupling]] — 架构模式
|
||||
- [[replay-buffer-rl-llm]] — Buffer 设计
|
||||
- [[off-policy-llm-post-training]] — Off-policy 范式
|
||||
- [[bartoldson-tba-2025|论文页面]]
|
||||
Reference in New Issue
Block a user