Files
myWiki/concepts/policy-reincarnation.md
2026-06-01 10:46:01 +08:00

39 lines
1.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Policy Reincarnation"
created: 2026-05-18
type: concept
tags: ["reinforcement-learning", "transfer-learning", "policy-optimization"]
sources: ["https://arxiv.org/abs/2604.14142", "https://proceedings.neurips.cc/paper_files/paper/2022/hash/ba1c5356d9164bb64c446a4b690226b0-Abstract-Conference.html"]
---
# Policy Reincarnation策略转生
## 定义
Policy Reincarnation 是一种训练策略:在训练中途**替换基座模型为中间检查点**,然后重新启动 on-policy RL。核心思想是利用之前的计算prior computation来加速后续训练。
## 在 DSRL 中的应用
[[dual-space-rl|DSRL]] 采用 Policy Reincarnation 将 [[pre-train-space-reinforcement-learning|PreRL]] 和标准 RL 串联:
1. 用 NSR-PreRL 训练 10-25 步 → 获得 checkpoint
2. 将该 checkpoint 作为新的"基座模型"
3. 切换到标准 GRPO 在 Post-train Space 继续训练
## 为何有效
- NSR-PreRL checkpoint 已经**消除了根本性错误模式**
- 分布 P(y) 已被剪枝,为 P(y|x) 的细粒度优化提供了更好的起点
- 后续 RL 可以专注于问题特定的微妙差异,而非基本逻辑错误
- 验证DSRL 的 "Fully Solved" 问题数在 NSR-PreRL 阶段就已大幅攀升
## 转生时机
消融实验显示 S ∈ [10, 25] 为最优转生窗口。过晚转生 → NSR 的"过度探索"效应阻碍后续微调。
## 相关概念
- [[dual-space-rl|DSRL]]
- [[pre-train-space-reinforcement-learning|PreRL]]
- [[negative-sample-reinforcement|NSR]]