Files
myWiki/concepts/specialize-then-unify-rl.md

47 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Specialize-then-Unify RL"
created: 2026-06-10
updated: 2026-06-10
type: concept
tags: [reinforcement-learning, recommendation, training-strategy]
sources: [raw/papers/onereason-team-onereason-2026.md]
---
# Specialize-then-Unify RL
> OneReason 提出的强化学习训练策略:先在单域内专项优化 thinking mode再做跨域平衡和精炼。
## 动机
OneReason 发现一个反直觉现象:
- **多域混合 RL**thinking mode 仍然落后于 non-thinking mode
- **单域 RL**thinking mode 一致超越 non-thinking mode
这表明 thinking 优势对域混杂敏感——推理能力的跨域泛化需要先充分发育。
## 两阶段策略
### Phase 1: Specialize
在单个推荐域内进行 RL充分释放 thinking mode 的优势。
- 每个域独立训练,不受其他域的数据分布干扰
- thinking mode 获得充分的域内优化信号
### Phase 2: Unify
跨域平衡和精炼,两个可选方案:
- **[[rejection-sampling-fine-tuning|Rejection Sampling Fine-tuning (RSFT)]]**:采样高质量 thinking 轨迹进行微调
- **[[multi-teacher-on-policy-distillation|Multi-Teacher On-Policy Distillation (MODPO)]]**:多教师在线策略蒸馏
## 核心洞察
**先专后统**:推理能力的跨域泛化需要域内先充分发育作为前提。这与 LLM 中「先广泛预训练再专项微调」的模式形成有趣对照——推荐推理走的是「先专项再统一」的逆向路径。
## 参考
- [[onereason|OneReason]]
- [[rejection-sampling-fine-tuning|Rejection Sampling FT]]
- [[multi-teacher-on-policy-distillation|Multi-Teacher On-Policy Distillation]]
- [[recommendation-reasoning|推荐推理]]