20260601
This commit is contained in:
38
concepts/policy-reincarnation.md
Normal file
38
concepts/policy-reincarnation.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Policy Reincarnation"
|
||||
created: 2026-05-18
|
||||
type: concept
|
||||
tags: ["reinforcement-learning", "transfer-learning", "policy-optimization"]
|
||||
sources: ["https://arxiv.org/abs/2604.14142", "https://proceedings.neurips.cc/paper_files/paper/2022/hash/ba1c5356d9164bb64c446a4b690226b0-Abstract-Conference.html"]
|
||||
---
|
||||
|
||||
# Policy Reincarnation(策略转生)
|
||||
|
||||
## 定义
|
||||
|
||||
Policy Reincarnation 是一种训练策略:在训练中途**替换基座模型为中间检查点**,然后重新启动 on-policy RL。核心思想是利用之前的计算(prior computation)来加速后续训练。
|
||||
|
||||
## 在 DSRL 中的应用
|
||||
|
||||
[[dual-space-rl|DSRL]] 采用 Policy Reincarnation 将 [[pre-train-space-reinforcement-learning|PreRL]] 和标准 RL 串联:
|
||||
|
||||
1. 用 NSR-PreRL 训练 10-25 步 → 获得 checkpoint
|
||||
2. 将该 checkpoint 作为新的"基座模型"
|
||||
3. 切换到标准 GRPO 在 Post-train Space 继续训练
|
||||
|
||||
## 为何有效
|
||||
|
||||
- NSR-PreRL checkpoint 已经**消除了根本性错误模式**
|
||||
- 分布 P(y) 已被剪枝,为 P(y|x) 的细粒度优化提供了更好的起点
|
||||
- 后续 RL 可以专注于问题特定的微妙差异,而非基本逻辑错误
|
||||
- 验证:DSRL 的 "Fully Solved" 问题数在 NSR-PreRL 阶段就已大幅攀升
|
||||
|
||||
## 转生时机
|
||||
|
||||
消融实验显示 S ∈ [10, 25] 为最优转生窗口。过晚转生 → NSR 的"过度探索"效应阻碍后续微调。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[dual-space-rl|DSRL]]
|
||||
- [[pre-train-space-reinforcement-learning|PreRL]]
|
||||
- [[negative-sample-reinforcement|NSR]]
|
||||
Reference in New Issue
Block a user