20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/concepts/on-policy-learning-collapse.md
+++ b/concepts/on-policy-learning-collapse.md
@@ -0,0 +1,38 @@
+---
+title: "On-policy Learning Collapse"
+created: 2026-05-18
+type: concept
+tags: ["reinforcement-learning", "LLM", "failure-mode"]
+sources: ["https://arxiv.org/abs/2604.14142"]
+---
+
+# On-policy Learning Collapse
+
+## 定义
+
+在 PreRL 框架中发现的特定失败模式：当 [[positive-sample-reinforcement|PSR]] 在预训练空间中作用于 self-generated（on-policy）轨迹时，模型无法有效学习，最终导致性能崩溃。
+
+## 表现
+
+- PSR-PreRL 在前 150 步表现接近标准 RL
+- 之后经历**显著的性能崩溃**（Figure 3a）
+- 尽管 P(y|x) 的条件概率确实在上升（梯度协同效应），但学习质量恶化
+
+## 原因分析
+
+与 QFFT（使用 teacher model 的 out-of-distribution long-CoT 轨迹）的对比揭示了关键条件：
+
+> 在预训练空间中最大化 P(y) **严格需要高质量、分布外的专家示范（expert demonstrations）**
+
+Self-generated on-policy 样本在 P(y) 空间中质量不足以支撑持续学习——模型会累积自身生成的概率质量，最终陷入自反馈退化循环。
+
+## 与 NSR 的对比
+
+- PSR-PreRL → 退化
+- [[negative-sample-reinforcement|NSR-PreRL]] → 极有效（剪枝而非积累）
+
+## 相关概念
+
+- [[positive-sample-reinforcement|PSR]]
+- [[negative-sample-reinforcement|NSR]]
+- [[pre-train-space-reinforcement-learning|PreRL]]