20260625:很多新内容
This commit is contained in:
58
concepts/reinforced-online-policy-distillation.md
Normal file
58
concepts/reinforced-online-policy-distillation.md
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
title: "Reinforced Online-Policy Distillation (ROPD)"
|
||||
created: 2026-06-20
|
||||
updated: 2026-06-20
|
||||
type: concept
|
||||
tags: ["post-training", "distillation", "reinforcement", "policy", "consolidation"]
|
||||
sources: ["https://arxiv.org/abs/2606.17800"]
|
||||
---
|
||||
|
||||
# Reinforced Online-Policy Distillation (ROPD)
|
||||
|
||||
**Reinforced Online-Policy Distillation (ROPD)** 是 [[maineCoon|MaineCoon]] 提出的专家合并策略:将多个域特定的 LoRA DPO 专家合并为**单一可部署的流式策略**,由域 verifier 自动加权专家干预程度。
|
||||
|
||||
## 动机
|
||||
|
||||
[[domain-aware-preference-optimization|Domain-Aware DPO]] 为每个社交视频域(远镜、多人对话、运动等)训练了独立的 LoRA expert,但直接平均或路由多专家会增加部署复杂度。ROPD 在**训练时将专家合并**为统一策略,推理时无需路由。
|
||||
|
||||
## 工作流
|
||||
|
||||
### 1. 学生候选生成
|
||||
对于域 `d` 的样本,行为学生(当前 student policy)生成 `G` 个候选 chunk:
|
||||
```
|
||||
x̂_t^i ~ p_θ_old(x_t | x_<t, c), i=1,...,G
|
||||
```
|
||||
每个候选经历完整的 denoising trajectory。
|
||||
|
||||
### 2. 域 Verifier 评分
|
||||
域特定的 verifier 对每个候选打分 `R_i ∈ {0, 1}`,计算组成功率:
|
||||
```
|
||||
R̄ = (1/G) Σ R_i
|
||||
```
|
||||
|
||||
### 3. 自适应专家权重
|
||||
ROPD 的关键创新在于**自动调节专家干预程度**:
|
||||
```
|
||||
η_i = α(1 - R̄) / (R_i + 1 - R̄)
|
||||
```
|
||||
- 所有候选失败 (R̄=0):`η_i = α` — 最大专家权重
|
||||
- 所有候选成功 (R̄=1):`η_i = 0` — 零专家干预
|
||||
- 混合结果:失败候选获较大专家权重,成功候选获较小权重
|
||||
|
||||
### 4. 路径化蒸馏目标
|
||||
在 velocity space 中构造 proximal target:
|
||||
```
|
||||
ṽ = (1-η_i) · sg[f_θ_old] + η_i · sg[f_φ_d]
|
||||
```
|
||||
学生直接拟合该混合 velocity,无需 PPO 式策略梯度。
|
||||
|
||||
## 特点
|
||||
|
||||
- **自适应性**:权重由当前学生的**实际表现**动态决定,无需手动调节
|
||||
- **路径化优化**:直接在 student-visited denoising states 上优化,避免随机 transition ratios
|
||||
- **训练后丢弃**:部署时无需任何 domain adapter 或 verifier
|
||||
|
||||
## 参考
|
||||
- [[maineCoon|MaineCoon 论文]] Section 3.3
|
||||
- [[domain-aware-preference-optimization|Domain-Aware DPO]]
|
||||
- DiffusionOPD (Luo et al.)
|
||||
Reference in New Issue
Block a user