20260601
This commit is contained in:
53
papers/pre-train-space-reinforcement-learning.md
Normal file
53
papers/pre-train-space-reinforcement-learning.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
title: "Pre-train Space Reinforcement Learning (PreRL/DSRL)"
|
||||
arxiv: "2604.14142"
|
||||
authors: ["Yuqiao Tan", "Minzheng Wang", "Bo Liu", "Zichen Liu", "Tian Liang", "Shizhu He", "Jun Zhao", "Kang Liu"]
|
||||
venue: "arXiv"
|
||||
date: "2026-04-15"
|
||||
created: "2026-05-18"
|
||||
type: paper
|
||||
tags: ["reinforcement-learning", "pre-training", "LLM-reasoning", "GRPO", "policy-optimization"]
|
||||
sources: ["https://arxiv.org/abs/2604.14142"]
|
||||
---
|
||||
|
||||
# Pre-train Space Reinforcement Learning (PreRL / DSRL)
|
||||
|
||||
**从 P(y|x) 到 P(y):在预训练空间中研究强化学习**
|
||||
|
||||
## 核心问题
|
||||
|
||||
标准 RLVR(如 GRPO)通过优化条件分布 P(y|x) 提升 LLM 推理能力,但其上限被基座模型的已有输出分布所约束。PreRL 提出直接在 **预训练空间(Pre-train Space)** 中优化边缘分布 P(y),从根源上扩展推理能力的基础。
|
||||
|
||||
## 方法论贡献
|
||||
|
||||
### 1. Pre-train Space RL (PreRL)
|
||||
|
||||
将 RL 的优化目标从 P(y|x) 变为 P(y),在梯度更新时**遮蔽输入条件 x**。核心理论支撑是 [[gradient-alignment|梯度对齐]]:证明 log P(y) 和 log P(y|x) 的梯度内积始终非负(均值 +9.2),因此优化边际分布可以有效提升条件策略。
|
||||
|
||||
### 2. Negative Sample Reinforcement (NSR)
|
||||
|
||||
解剖 PreRL 中正负样本的作用,发现关键的不对称性:
|
||||
- **PSR(正样本强化)** 在预训练空间中会退化为 on-policy collapse
|
||||
- **NSR(负样本强化)** 通过剪枝错误推理路径,激发 [[endogenous-reasoning|内生推理能力]],transition 和 reflection 思维分别增长 **14.89×** 和 **6.54×**
|
||||
|
||||
### 3. Dual Space RL (DSRL)
|
||||
|
||||
采用 [[policy-reincarnation|策略转生]] 策略:先用 NSR-PreRL 扩展推理视野(消除根本性错误),再切换到标准 RL 进行细粒度优化。公式化为条件掩码的 phase-switching:
|
||||
|
||||
∇J_DSRL = E[∑∇log π(y_t | x^{I[s>S]}, y_{<t}) · R(y) · I[s>S ∨ R(y)<0]]
|
||||
|
||||
## 关键发现
|
||||
|
||||
- DSRL 在 Qwen3-4B/8B 上全面超越 GRPO/PPO/DAPO/Dr.GRPO
|
||||
- AIME24: +4.69, AIME25: +2.50(Qwen3-4B)
|
||||
- OOD 泛化:GPQA-Diamond +3.79, MMLU-Pro +5.37
|
||||
- 样本效率:达到同等精度仅需 1.6×-2.5× 更少的训练步数
|
||||
- Pass@K 在所有 K 值上均优于 GRPO
|
||||
|
||||
## 概念网络
|
||||
|
||||
- [[pre-train-space-reinforcement-learning|PreRL]] · [[post-train-space-rl|Post-train Space RL]] · [[dual-space-rl|DSRL]]
|
||||
- [[negative-sample-reinforcement|NSR]] · [[positive-sample-reinforcement|PSR]]
|
||||
- [[gradient-alignment|梯度对齐]] · [[shared-parameter-influence|共享参数影响]]
|
||||
- [[policy-reincarnation|策略转生]] · [[endogenous-reasoning|内生推理]]
|
||||
- [[distribution-shift|分布偏移]] · [[on-policy-learning-collapse|On-policy Collapse]]
|
||||
Reference in New Issue
Block a user