Files
myWiki/raw/papers/pre-train-space-reinforcement-learning-2026.md
2026-06-01 10:46:01 +08:00

28 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Pre-train Space Reinforcement Learning: From P(y|x) to P(y)"
arxiv: "2604.14142"
authors: ["Yuqiao Tan", "Minzheng Wang", "Bo Liu", "Zichen Liu", "Tian Liang", "Shizhu He", "Jun Zhao", "Kang Liu"]
venue: "arXiv preprint"
date: "2026-04-15"
type: paper
tags: ["reinforcement-learning", "pre-training", "LLM", "reasoning", "GRPO"]
---
# Pre-train Space Reinforcement Learning
> **arXiv**: [2604.14142](https://arxiv.org/abs/2604.14142)
> **Authors**: Yuqiao Tan¹²*, Minzheng Wang¹²*, Bo Liu³, Zichen Liu³, Tian Liang⁴, Shizhu He¹²†, Jun Zhao¹², Kang Liu¹²
> **Affiliations**: ¹ CASIA, ² UCAS, ³ NUS, ⁴ Tencent AI Lab
> * Equal contribution, † Corresponding author
## Abstract
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89× and 6.54×, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization.
## Key Claims
1. **Gradient Alignment**: <∇log P(y), ∇log P(y|x)> ≥ 0 for all samples (empirically validated), confirming PreRL as a viable surrogate for standard RL
2. **NSR > PSR in Pre-train Space**: Negative Sample Reinforcement (suppressing incorrect paths) is far more effective than Positive Sample Reinforcement in the pre-train space
3. **DSRL outperforms GRPO**: Dual Space RL achieves +2-5 point improvement on benchmarks like AIME24/25, with 1.6×-2.5× sample efficiency
4. **NSR-PreRL stimulates endogenous reasoning**: 14.89× more transition thoughts, 6.54× more reflection thoughts