Files
myWiki/concepts/post-train-space-rl.md
2026-06-01 10:46:01 +08:00

38 lines
1.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Post-train Space Reinforcement Learning"
created: 2026-05-18
type: concept
tags: ["reinforcement-learning", "LLM", "GRPO", "RLVR"]
sources: ["https://arxiv.org/abs/2604.14142"]
---
# Post-train Space Reinforcement Learning
## 定义
Post-train Space RL 是当前主流的 LLM 强化学习范式,优化**条件分布** P(y|x)。给定输入问题 x策略 π_θ 生成推理轨迹 y通过可验证奖励RLVR进行优化。
## 标准目标函数
```
J_RL(π_θ) = E_{x~X} E_{y~π_θ(·|x)} [R(y) - β·D_KL(π_θ||π_ref)]
```
梯度(β=0 时):
```
∇J = E_{x,y} [∑_{t=1}^{|y|} ∇log π_θ(y_t|x, y_{<t}) · R(y)]
```
## 内在局限
[[pre-train-space-reinforcement-learning|PreRL]] 论文指出的核心问题:
- Post-train space RL 被基座模型的已有输出分布所**根本性约束**Yue et al., 2025
- RLVR 仅仅是"锐化"已有分布,而非扩展推理能力的上限
- 条件约束限制了探索空间
## 相关概念
- [[pre-train-space-reinforcement-learning|PreRL]] — 在 P(y) 空间优化的替代方案
- [[dual-space-rl|DSRL]] — 结合 PreRL 和 Post-train RL
- [[gradient-alignment|梯度对齐]] — 证明 PreRL 可作为 Post-train RL 的有效代理