20260601
This commit is contained in:
53
reviews/pretrain-space-rl-review-20260518.md
Normal file
53
reviews/pretrain-space-rl-review-20260518.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
title: "Review: Pre-train Space Reinforcement Learning"
|
||||
paper: "pre-train-space-reinforcement-learning"
|
||||
arxiv: "2604.14142"
|
||||
date: "2026-05-18"
|
||||
type: review
|
||||
---
|
||||
|
||||
# Review: Pre-train Space Reinforcement Learning
|
||||
|
||||
📌 **基本信息**
|
||||
- 论文标题:*Pre-train Space Reinforcement Learning: From P(y|x) to P(y)*
|
||||
- 作者:Yuqiao Tan, Minzheng Wang (CASIA/UCAS), Bo Liu, Zichen Liu (NUS), Tian Liang (Tencent AI Lab), Shizhu He†, Jun Zhao, Kang Liu (CASIA)
|
||||
- 领域:LLM Reasoning, Reinforcement Learning, Pre-training
|
||||
- arXiv: [2604.14142](https://arxiv.org/abs/2604.14142) | 2026-04-15
|
||||
- 添加时间:2026-05-18
|
||||
|
||||
🎯 **核心概念**
|
||||
|
||||
1. **PreRL(预训练空间 RL)** — 将 RL 优化目标从 P(y|x) 移至 P(y),梯度更新时遮蔽输入条件 x。基于梯度对齐(⟨∇log P(y), ∇log P(y|x)⟩ ≥ 0)证明为有效代理
|
||||
2. **NSR(负样本强化)** — 在预训练空间中剪枝错误推理路径;transition thoughts 增长 14.89×,reflection thoughts 增长 6.54×
|
||||
3. **DSRL(双空间 RL)** — 策略转生:先 NSR-PreRL 扩展推理视野(10-25 步),再切换标准 RL 进行细粒度优化
|
||||
4. **PSR 退化** — 正样本强化在预训练空间中导致 on-policy collapse,需 out-of-distribution 专家示范
|
||||
5. **内生推理** — NSR-PreRL 解锁模型预训练中已编码但被条件约束抑制的推理能力
|
||||
|
||||
🔗 **概念网络**
|
||||
|
||||
核心连接:
|
||||
```
|
||||
PreRL ←→ Post-train Space RL ←→ DSRL
|
||||
↓ ↓ ↓
|
||||
梯度对齐 P(y|x) 瓶颈 策略转生
|
||||
↓ ↓
|
||||
共享参数影响 NSR → PSR
|
||||
↓
|
||||
内生推理 ← on-policy collapse
|
||||
```
|
||||
|
||||
- 核心概念:11 个
|
||||
- 链接完整性:100% 无断链
|
||||
|
||||
📚 **Wiki 集成**
|
||||
- 新增页面:13 个(1 论文 + 1 raw + 11 概念)
|
||||
- 总规模:335 → 347 页
|
||||
- 网络完整性:100%
|
||||
|
||||
💡 **关键洞察**
|
||||
|
||||
1. **范式转折**:从"条件空间锐化分布"到"边际空间剪枝错误路径"——NSR 证明删除比添加更有效,这是 RL for LLM 中一个重要但被忽视的不对称性
|
||||
|
||||
2. **预训练空间的"负优化"优势**:PSR(正样本强化)在预训练空间中是退化的,而 NSR 极有效——这种不对称性暗示预训练空间的优化本质上是"约束释放"而非"能力注入"
|
||||
|
||||
3. **双空间协同**:DSRL 的优雅之处在于它认识到不同训练阶段需要不同的"优化空间"——初期在 P(y) 中消除根本性错误(全局剪枝),后期在 P(y|x) 中精调条件策略(局部优化),这类似于从 exploration 到 exploitation 的自然过渡
|
||||
Reference in New Issue
Block a user