20260601
This commit is contained in:
65
concepts/dual-space-rl.md
Normal file
65
concepts/dual-space-rl.md
Normal file
@@ -0,0 +1,65 @@
|
||||
---
|
||||
title: "Dual Space RL (DSRL)"
|
||||
created: 2026-05-18
|
||||
type: concept
|
||||
tags: ["reinforcement-learning", "LLM", "policy-optimization", "GRPO"]
|
||||
sources: ["https://arxiv.org/abs/2604.14142"]
|
||||
---
|
||||
|
||||
# Dual Space RL (DSRL)
|
||||
|
||||
## 定义
|
||||
|
||||
DSRL 是一个两阶段的 RL 框架,通过 [[policy-reincarnation|策略转生]] 策略将 [[pre-train-space-reinforcement-learning|PreRL]] 与标准 RL 统一起来:
|
||||
|
||||
1. **Phase 1 (s ≤ S)**: NSR-PreRL — 在预训练空间中剪枝错误推理路径,扩展推理视野
|
||||
2. **Phase 2 (s > S)**: 标准 GRPO — 在 post-train 空间中进行细粒度策略优化
|
||||
|
||||
## 统一公式
|
||||
|
||||
```
|
||||
∇J_DSRL = E[∑∇log π(y_t | x^{I[s>S]}, y_{<t}) · R(y) · I[s>S ∨ R(y)<0]]
|
||||
```
|
||||
|
||||
- `x^{I[s>S]}`: Phase 1 时遮蔽 x(预训练空间),Phase 2 时恢复 x
|
||||
- `I[s>S ∨ R(y)<0]`: Phase 1 仅对负样本更新(NSR),Phase 2 使用全部样本
|
||||
|
||||
## 关键结果
|
||||
|
||||
### Main Results (Avg@32)
|
||||
|
||||
| 模型 | Baseline | GRPO | DSRL |
|
||||
|------|----------|------|------|
|
||||
| Qwen3-4B | 41.26 | 55.79 | **57.54** |
|
||||
| Qwen3-8B | 41.62 | 57.00 | **58.47** |
|
||||
|
||||
### 效率提升
|
||||
- 达到 45% 精度:**2.5×** 更少步数
|
||||
- 达到 58% 精度:**1.6×** 更少步数
|
||||
|
||||
### OOD 泛化
|
||||
- GPQA-Diamond: +3.79 (4B), +2.52 (8B)
|
||||
- MMLU-Pro: +5.37 (4B), +4.32 (8B)
|
||||
- HumanEval: +2.44 (8B)
|
||||
|
||||
## Warmup 步数消融
|
||||
|
||||
最优区间:S ∈ [10, 25] 步。过少(激励不足)或过多(过度探索)均导致性能下降。
|
||||
|
||||
## 推理行为演化
|
||||
|
||||
NSR-PreRL 阶段激发多种推理模式:
|
||||
- Subgoal Setting
|
||||
- Enumeration
|
||||
- Verification
|
||||
- Backtracking
|
||||
|
||||
所有模式在 DSRL 中均达到更高的频率上限。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[pre-train-space-reinforcement-learning|PreRL]]
|
||||
- [[post-train-space-rl|Post-train Space RL]]
|
||||
- [[negative-sample-reinforcement|NSR]]
|
||||
- [[policy-reincarnation|策略转生]]
|
||||
- [[endogenous-reasoning|内生推理]]
|
||||
Reference in New Issue
Block a user