Files
myWiki/concepts/on-policy-learning-collapse.md
2026-06-01 10:46:01 +08:00

39 lines
1.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "On-policy Learning Collapse"
created: 2026-05-18
type: concept
tags: ["reinforcement-learning", "LLM", "failure-mode"]
sources: ["https://arxiv.org/abs/2604.14142"]
---
# On-policy Learning Collapse
## 定义
在 PreRL 框架中发现的特定失败模式:当 [[positive-sample-reinforcement|PSR]] 在预训练空间中作用于 self-generatedon-policy轨迹时模型无法有效学习最终导致性能崩溃。
## 表现
- PSR-PreRL 在前 150 步表现接近标准 RL
- 之后经历**显著的性能崩溃**Figure 3a
- 尽管 P(y|x) 的条件概率确实在上升(梯度协同效应),但学习质量恶化
## 原因分析
与 QFFT使用 teacher model 的 out-of-distribution long-CoT 轨迹)的对比揭示了关键条件:
> 在预训练空间中最大化 P(y) **严格需要高质量、分布外的专家示范expert demonstrations**
Self-generated on-policy 样本在 P(y) 空间中质量不足以支撑持续学习——模型会累积自身生成的概率质量,最终陷入自反馈退化循环。
## 与 NSR 的对比
- PSR-PreRL → 退化
- [[negative-sample-reinforcement|NSR-PreRL]] → 极有效(剪枝而非积累)
## 相关概念
- [[positive-sample-reinforcement|PSR]]
- [[negative-sample-reinforcement|NSR]]
- [[pre-train-space-reinforcement-learning|PreRL]]