20260625:很多新内容

2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions
--- a/concepts/reinforced-online-policy-distillation.md
+++ b/concepts/reinforced-online-policy-distillation.md
@@ -0,0 +1,58 @@
+---
+title: "Reinforced Online-Policy Distillation (ROPD)"
+created: 2026-06-20
+updated: 2026-06-20
+type: concept
+tags: ["post-training", "distillation", "reinforcement", "policy", "consolidation"]
+sources: ["https://arxiv.org/abs/2606.17800"]
+---
+
+# Reinforced Online-Policy Distillation (ROPD)
+
+**Reinforced Online-Policy Distillation (ROPD)** 是 [[maineCoon|MaineCoon]] 提出的专家合并策略：将多个域特定的 LoRA DPO 专家合并为**单一可部署的流式策略**，由域 verifier 自动加权专家干预程度。
+
+## 动机
+
+[[domain-aware-preference-optimization|Domain-Aware DPO]] 为每个社交视频域（远镜、多人对话、运动等）训练了独立的 LoRA expert，但直接平均或路由多专家会增加部署复杂度。ROPD 在**训练时将专家合并**为统一策略，推理时无需路由。
+
+## 工作流
+
+### 1. 学生候选生成
+对于域 `d` 的样本，行为学生（当前 student policy）生成 `G` 个候选 chunk：
+```
+x̂_t^i ~ p_θ_old(x_t | x_<t, c),  i=1,...,G
+```
+每个候选经历完整的 denoising trajectory。
+
+### 2. 域 Verifier 评分
+域特定的 verifier 对每个候选打分 `R_i ∈ {0, 1}`，计算组成功率：
+```
+R̄ = (1/G) Σ R_i
+```
+
+### 3. 自适应专家权重
+ROPD 的关键创新在于**自动调节专家干预程度**：
+```
+η_i = α(1 - R̄) / (R_i + 1 - R̄)
+```
+- 所有候选失败 (R̄=0)：`η_i = α` — 最大专家权重
+- 所有候选成功 (R̄=1)：`η_i = 0` — 零专家干预
+- 混合结果：失败候选获较大专家权重，成功候选获较小权重
+
+### 4. 路径化蒸馏目标
+在 velocity space 中构造 proximal target：
+```
+ṽ = (1-η_i) · sg[f_θ_old] + η_i · sg[f_φ_d]
+```
+学生直接拟合该混合 velocity，无需 PPO 式策略梯度。
+
+## 特点
+
+- **自适应性**：权重由当前学生的**实际表现**动态决定，无需手动调节
+- **路径化优化**：直接在 student-visited denoising states 上优化，避免随机 transition ratios
+- **训练后丢弃**：部署时无需任何 domain adapter 或 verifier
+
+## 参考
+- [[maineCoon|MaineCoon 论文]] Section 3.3
+- [[domain-aware-preference-optimization|Domain-Aware DPO]]
+- DiffusionOPD (Luo et al.)