SidneyZhang/myWiki

Files

Sidney Zhang e96b955fda

20260601

2026-06-01 10:46:01 +08:00

1.3 KiB

Raw Blame History

title, created, type, tags, sources

title

created

type

tags

sources

On-policy Learning Collapse

2026-05-18

concept

reinforcement-learning

LLM

failure-mode

https://arxiv.org/abs/2604.14142

On-policy Learning Collapse

定义

在 PreRL 框架中发现的特定失败模式：当 positive-sample-reinforcement 在预训练空间中作用于 self-generated（on-policy）轨迹时，模型无法有效学习，最终导致性能崩溃。

表现

PSR-PreRL 在前 150 步表现接近标准 RL
之后经历显著的性能崩溃（Figure 3a）
尽管 P(y|x) 的条件概率确实在上升（梯度协同效应），但学习质量恶化

原因分析

与 QFFT（使用 teacher model 的 out-of-distribution long-CoT 轨迹）的对比揭示了关键条件：

在预训练空间中最大化 P(y) 严格需要高质量、分布外的专家示范（expert demonstrations）

Self-generated on-policy 样本在 P(y) 空间中质量不足以支撑持续学习——模型会累积自身生成的概率质量，最终陷入自反馈退化循环。

与 NSR 的对比

PSR-PreRL → 退化
negative-sample-reinforcement → 极有效（剪枝而非积累）

相关概念