20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/specialize-then-unify-rl.md
+++ b/concepts/specialize-then-unify-rl.md
@@ -0,0 +1,46 @@
+---
+title: "Specialize-then-Unify RL"
+created: 2026-06-10
+updated: 2026-06-10
+type: concept
+tags: [reinforcement-learning, recommendation, training-strategy]
+sources: [raw/papers/onereason-team-onereason-2026.md]
+---
+
+# Specialize-then-Unify RL
+
+> OneReason 提出的强化学习训练策略：先在单域内专项优化 thinking mode，再做跨域平衡和精炼。
+
+## 动机
+
+OneReason 发现一个反直觉现象：
+
+- **多域混合 RL**：thinking mode 仍然落后于 non-thinking mode
+- **单域 RL**：thinking mode 一致超越 non-thinking mode
+
+这表明 thinking 优势对域混杂敏感——推理能力的跨域泛化需要先充分发育。
+
+## 两阶段策略
+
+### Phase 1: Specialize
+在单个推荐域内进行 RL，充分释放 thinking mode 的优势。
+
+- 每个域独立训练，不受其他域的数据分布干扰
+- thinking mode 获得充分的域内优化信号
+
+### Phase 2: Unify
+跨域平衡和精炼，两个可选方案：
+
+- **[[rejection-sampling-fine-tuning|Rejection Sampling Fine-tuning (RSFT)]]**：采样高质量 thinking 轨迹进行微调
+- **[[multi-teacher-on-policy-distillation|Multi-Teacher On-Policy Distillation (MODPO)]]**：多教师在线策略蒸馏
+
+## 核心洞察
+
+**先专后统**：推理能力的跨域泛化需要域内先充分发育作为前提。这与 LLM 中「先广泛预训练再专项微调」的模式形成有趣对照——推荐推理走的是「先专项再统一」的逆向路径。
+
+## 参考
+
+- [[onereason|OneReason]]
+- [[rejection-sampling-fine-tuning|Rejection Sampling FT]]
+- [[multi-teacher-on-policy-distillation|Multi-Teacher On-Policy Distillation]]
+- [[recommendation-reasoning|推荐推理]]