66 lines
1.8 KiB
Markdown
66 lines
1.8 KiB
Markdown
---
|
||
title: "Dual Space RL (DSRL)"
|
||
created: 2026-05-18
|
||
type: concept
|
||
tags: ["reinforcement-learning", "LLM", "policy-optimization", "GRPO"]
|
||
sources: ["https://arxiv.org/abs/2604.14142"]
|
||
---
|
||
|
||
# Dual Space RL (DSRL)
|
||
|
||
## 定义
|
||
|
||
DSRL 是一个两阶段的 RL 框架,通过 [[policy-reincarnation|策略转生]] 策略将 [[pre-train-space-reinforcement-learning|PreRL]] 与标准 RL 统一起来:
|
||
|
||
1. **Phase 1 (s ≤ S)**: NSR-PreRL — 在预训练空间中剪枝错误推理路径,扩展推理视野
|
||
2. **Phase 2 (s > S)**: 标准 GRPO — 在 post-train 空间中进行细粒度策略优化
|
||
|
||
## 统一公式
|
||
|
||
```
|
||
∇J_DSRL = E[∑∇log π(y_t | x^{I[s>S]}, y_{<t}) · R(y) · I[s>S ∨ R(y)<0]]
|
||
```
|
||
|
||
- `x^{I[s>S]}`: Phase 1 时遮蔽 x(预训练空间),Phase 2 时恢复 x
|
||
- `I[s>S ∨ R(y)<0]`: Phase 1 仅对负样本更新(NSR),Phase 2 使用全部样本
|
||
|
||
## 关键结果
|
||
|
||
### Main Results (Avg@32)
|
||
|
||
| 模型 | Baseline | GRPO | DSRL |
|
||
|------|----------|------|------|
|
||
| Qwen3-4B | 41.26 | 55.79 | **57.54** |
|
||
| Qwen3-8B | 41.62 | 57.00 | **58.47** |
|
||
|
||
### 效率提升
|
||
- 达到 45% 精度:**2.5×** 更少步数
|
||
- 达到 58% 精度:**1.6×** 更少步数
|
||
|
||
### OOD 泛化
|
||
- GPQA-Diamond: +3.79 (4B), +2.52 (8B)
|
||
- MMLU-Pro: +5.37 (4B), +4.32 (8B)
|
||
- HumanEval: +2.44 (8B)
|
||
|
||
## Warmup 步数消融
|
||
|
||
最优区间:S ∈ [10, 25] 步。过少(激励不足)或过多(过度探索)均导致性能下降。
|
||
|
||
## 推理行为演化
|
||
|
||
NSR-PreRL 阶段激发多种推理模式:
|
||
- Subgoal Setting
|
||
- Enumeration
|
||
- Verification
|
||
- Backtracking
|
||
|
||
所有模式在 DSRL 中均达到更高的频率上限。
|
||
|
||
## 相关概念
|
||
|
||
- [[pre-train-space-reinforcement-learning|PreRL]]
|
||
- [[post-train-space-rl|Post-train Space RL]]
|
||
- [[negative-sample-reinforcement|NSR]]
|
||
- [[policy-reincarnation|策略转生]]
|
||
- [[endogenous-reasoning|内生推理]]
|