20260601
This commit is contained in:
27
raw/papers/pre-train-space-reinforcement-learning-2026.md
Normal file
27
raw/papers/pre-train-space-reinforcement-learning-2026.md
Normal file
@@ -0,0 +1,27 @@
|
||||
---
|
||||
title: "Pre-train Space Reinforcement Learning: From P(y|x) to P(y)"
|
||||
arxiv: "2604.14142"
|
||||
authors: ["Yuqiao Tan", "Minzheng Wang", "Bo Liu", "Zichen Liu", "Tian Liang", "Shizhu He", "Jun Zhao", "Kang Liu"]
|
||||
venue: "arXiv preprint"
|
||||
date: "2026-04-15"
|
||||
type: paper
|
||||
tags: ["reinforcement-learning", "pre-training", "LLM", "reasoning", "GRPO"]
|
||||
---
|
||||
|
||||
# Pre-train Space Reinforcement Learning
|
||||
|
||||
> **arXiv**: [2604.14142](https://arxiv.org/abs/2604.14142)
|
||||
> **Authors**: Yuqiao Tan¹²*, Minzheng Wang¹²*, Bo Liu³, Zichen Liu³, Tian Liang⁴, Shizhu He¹²†, Jun Zhao¹², Kang Liu¹²
|
||||
> **Affiliations**: ¹ CASIA, ² UCAS, ³ NUS, ⁴ Tencent AI Lab
|
||||
> * Equal contribution, † Corresponding author
|
||||
|
||||
## Abstract
|
||||
|
||||
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89× and 6.54×, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization.
|
||||
|
||||
## Key Claims
|
||||
|
||||
1. **Gradient Alignment**: <∇log P(y), ∇log P(y|x)> ≥ 0 for all samples (empirically validated), confirming PreRL as a viable surrogate for standard RL
|
||||
2. **NSR > PSR in Pre-train Space**: Negative Sample Reinforcement (suppressing incorrect paths) is far more effective than Positive Sample Reinforcement in the pre-train space
|
||||
3. **DSRL outperforms GRPO**: Dual Space RL achieves +2-5 point improvement on benchmarks like AIME24/25, with 1.6×-2.5× sample efficiency
|
||||
4. **NSR-PreRL stimulates endogenous reasoning**: 14.89× more transition thoughts, 6.54× more reflection thoughts
|
||||
Reference in New Issue
Block a user