Files
myWiki/concepts/dual-space-rl.md
2026-06-01 10:46:01 +08:00

66 lines
1.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Dual Space RL (DSRL)"
created: 2026-05-18
type: concept
tags: ["reinforcement-learning", "LLM", "policy-optimization", "GRPO"]
sources: ["https://arxiv.org/abs/2604.14142"]
---
# Dual Space RL (DSRL)
## 定义
DSRL 是一个两阶段的 RL 框架,通过 [[policy-reincarnation|策略转生]] 策略将 [[pre-train-space-reinforcement-learning|PreRL]] 与标准 RL 统一起来:
1. **Phase 1 (s ≤ S)**: NSR-PreRL — 在预训练空间中剪枝错误推理路径,扩展推理视野
2. **Phase 2 (s > S)**: 标准 GRPO — 在 post-train 空间中进行细粒度策略优化
## 统一公式
```
∇J_DSRL = E[∑∇log π(y_t | x^{I[s>S]}, y_{<t}) · R(y) · I[s>S R(y)<0]]
```
- `x^{I[s>S]}`: Phase 1 时遮蔽 x预训练空间Phase 2 时恢复 x
- `I[s>S R(y)<0]`: Phase 1 仅对负样本更新NSRPhase 2 使用全部样本
## 关键结果
### Main Results (Avg@32)
| 模型 | Baseline | GRPO | DSRL |
|------|----------|------|------|
| Qwen3-4B | 41.26 | 55.79 | **57.54** |
| Qwen3-8B | 41.62 | 57.00 | **58.47** |
### 效率提升
- 达到 45% 精度:**2.5×** 更少步数
- 达到 58% 精度:**1.6×** 更少步数
### OOD 泛化
- GPQA-Diamond: +3.79 (4B), +2.52 (8B)
- MMLU-Pro: +5.37 (4B), +4.32 (8B)
- HumanEval: +2.44 (8B)
## Warmup 步数消融
最优区间S ∈ [10, 25] 步。过少(激励不足)或过多(过度探索)均导致性能下降。
## 推理行为演化
NSR-PreRL 阶段激发多种推理模式:
- Subgoal Setting
- Enumeration
- Verification
- Backtracking
所有模式在 DSRL 中均达到更高的频率上限。
## 相关概念
- [[pre-train-space-reinforcement-learning|PreRL]]
- [[post-train-space-rl|Post-train Space RL]]
- [[negative-sample-reinforcement|NSR]]
- [[policy-reincarnation|策略转生]]
- [[endogenous-reasoning|内生推理]]