20260617:目前有914 页
This commit is contained in:
46
concepts/specialize-then-unify-rl.md
Normal file
46
concepts/specialize-then-unify-rl.md
Normal file
@@ -0,0 +1,46 @@
|
||||
---
|
||||
title: "Specialize-then-Unify RL"
|
||||
created: 2026-06-10
|
||||
updated: 2026-06-10
|
||||
type: concept
|
||||
tags: [reinforcement-learning, recommendation, training-strategy]
|
||||
sources: [raw/papers/onereason-team-onereason-2026.md]
|
||||
---
|
||||
|
||||
# Specialize-then-Unify RL
|
||||
|
||||
> OneReason 提出的强化学习训练策略:先在单域内专项优化 thinking mode,再做跨域平衡和精炼。
|
||||
|
||||
## 动机
|
||||
|
||||
OneReason 发现一个反直觉现象:
|
||||
|
||||
- **多域混合 RL**:thinking mode 仍然落后于 non-thinking mode
|
||||
- **单域 RL**:thinking mode 一致超越 non-thinking mode
|
||||
|
||||
这表明 thinking 优势对域混杂敏感——推理能力的跨域泛化需要先充分发育。
|
||||
|
||||
## 两阶段策略
|
||||
|
||||
### Phase 1: Specialize
|
||||
在单个推荐域内进行 RL,充分释放 thinking mode 的优势。
|
||||
|
||||
- 每个域独立训练,不受其他域的数据分布干扰
|
||||
- thinking mode 获得充分的域内优化信号
|
||||
|
||||
### Phase 2: Unify
|
||||
跨域平衡和精炼,两个可选方案:
|
||||
|
||||
- **[[rejection-sampling-fine-tuning|Rejection Sampling Fine-tuning (RSFT)]]**:采样高质量 thinking 轨迹进行微调
|
||||
- **[[multi-teacher-on-policy-distillation|Multi-Teacher On-Policy Distillation (MODPO)]]**:多教师在线策略蒸馏
|
||||
|
||||
## 核心洞察
|
||||
|
||||
**先专后统**:推理能力的跨域泛化需要域内先充分发育作为前提。这与 LLM 中「先广泛预训练再专项微调」的模式形成有趣对照——推荐推理走的是「先专项再统一」的逆向路径。
|
||||
|
||||
## 参考
|
||||
|
||||
- [[onereason|OneReason]]
|
||||
- [[rejection-sampling-fine-tuning|Rejection Sampling FT]]
|
||||
- [[multi-teacher-on-policy-distillation|Multi-Teacher On-Policy Distillation]]
|
||||
- [[recommendation-reasoning|推荐推理]]
|
||||
Reference in New Issue
Block a user