20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/concepts/positive-sample-reinforcement.md
+++ b/concepts/positive-sample-reinforcement.md
@@ -0,0 +1,40 @@
+---
+title: "Positive Sample Reinforcement (PSR)"
+created: 2026-05-18
+type: concept
+tags: ["reinforcement-learning", "LLM", "GRPO"]
+sources: ["https://arxiv.org/abs/2604.14142"]
+---
+
+# Positive Sample Reinforcement (PSR)
+
+## 定义
+
+PSR 是 RL 中针对**正样本**（获得正 advantage 的样本）进行强化的机制：通过最大化 log π(y|x) 来鼓励正确的推理轨迹。
+
+## PreRL 中的退化
+
+虽然 PSR 和 [[negative-sample-reinforcement|NSR]] 的梯度方向对齐（都指向提升条件策略），但在**预训练空间** P(y) 中：
+
+- **PSR-PreRL** 无法有效学习 self-generated on-policy trajectories
+- 尽管能增加 π_θ(y|x) 的条件概率（验证了梯度协同效应），但最终导致性能退化
+- 对比：QFFT 使用 teacher model 的 out-of-distribution long-CoT 轨迹成功优化了同一目标 max P(y)
+
+### 关键教训
+
+> 在预训练空间中最大化 P(y) **严格需要高质量、分布外的专家示范（expert demonstrations）**。这是 on-policy RL 在预训练空间的根本性限制。
+
+## PSR vs NSR
+
+| 维度 | PSR-PreRL | NSR-PreRL |
+|------|-----------|-----------|
+| 学习效果 | 退化 | 极有效 |
+| 推理激发 | 弱 | 14.89× transitions |
+| 输出长度 | 正常 | 逐渐过长（双刃剑） |
+| 机制 | 累积概率质量 | 重新分配概率质量 |
+
+## 相关概念
+
+- [[negative-sample-reinforcement|NSR]] — 负样本强化的不对称优势
+- [[on-policy-learning-collapse|On-policy Learning Collapse]]
+- [[pre-train-space-reinforcement-learning|PreRL]]