Files
myWiki/papers/pre-train-space-reinforcement-learning.md
2026-06-01 10:46:01 +08:00

54 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Pre-train Space Reinforcement Learning (PreRL/DSRL)"
arxiv: "2604.14142"
authors: ["Yuqiao Tan", "Minzheng Wang", "Bo Liu", "Zichen Liu", "Tian Liang", "Shizhu He", "Jun Zhao", "Kang Liu"]
venue: "arXiv"
date: "2026-04-15"
created: "2026-05-18"
type: paper
tags: ["reinforcement-learning", "pre-training", "LLM-reasoning", "GRPO", "policy-optimization"]
sources: ["https://arxiv.org/abs/2604.14142"]
---
# Pre-train Space Reinforcement Learning (PreRL / DSRL)
**从 P(y|x) 到 P(y):在预训练空间中研究强化学习**
## 核心问题
标准 RLVR如 GRPO通过优化条件分布 P(y|x) 提升 LLM 推理能力但其上限被基座模型的已有输出分布所约束。PreRL 提出直接在 **预训练空间Pre-train Space** 中优化边缘分布 P(y),从根源上扩展推理能力的基础。
## 方法论贡献
### 1. Pre-train Space RL (PreRL)
将 RL 的优化目标从 P(y|x) 变为 P(y),在梯度更新时**遮蔽输入条件 x**。核心理论支撑是 [[gradient-alignment|梯度对齐]]:证明 log P(y) 和 log P(y|x) 的梯度内积始终非负(均值 +9.2),因此优化边际分布可以有效提升条件策略。
### 2. Negative Sample Reinforcement (NSR)
解剖 PreRL 中正负样本的作用,发现关键的不对称性:
- **PSR正样本强化** 在预训练空间中会退化为 on-policy collapse
- **NSR负样本强化** 通过剪枝错误推理路径,激发 [[endogenous-reasoning|内生推理能力]]transition 和 reflection 思维分别增长 **14.89×****6.54×**
### 3. Dual Space RL (DSRL)
采用 [[policy-reincarnation|策略转生]] 策略:先用 NSR-PreRL 扩展推理视野(消除根本性错误),再切换到标准 RL 进行细粒度优化。公式化为条件掩码的 phase-switching
∇J_DSRL = E[∑∇log π(y_t | x^{I[s>S]}, y_{<t}) · R(y) · I[s>S R(y)<0]]
## 关键发现
- DSRL Qwen3-4B/8B 上全面超越 GRPO/PPO/DAPO/Dr.GRPO
- AIME24: +4.69, AIME25: +2.50Qwen3-4B
- OOD 泛化GPQA-Diamond +3.79, MMLU-Pro +5.37
- 样本效率达到同等精度仅需 1.6×-2.5× 更少的训练步数
- Pass@K 在所有 K 值上均优于 GRPO
## 概念网络
- [[pre-train-space-reinforcement-learning|PreRL]] · [[post-train-space-rl|Post-train Space RL]] · [[dual-space-rl|DSRL]]
- [[negative-sample-reinforcement|NSR]] · [[positive-sample-reinforcement|PSR]]
- [[gradient-alignment|梯度对齐]] · [[shared-parameter-influence|共享参数影响]]
- [[policy-reincarnation|策略转生]] · [[endogenous-reasoning|内生推理]]
- [[distribution-shift|分布偏移]] · [[on-policy-learning-collapse|On-policy Collapse]]