SidneyZhang/myWiki

Files

Sidney Zhang 91fac5b6fc

20260617:目前有914 页

2026-06-17 15:02:40 +08:00

1.6 KiB

Raw Blame History

title, created, updated, type, tags, sources

title

created

updated

type

tags

sources

Specialize-then-Unify RL

2026-06-10

2026-06-10

concept

reinforcement-learning

recommendation

training-strategy

raw/papers/onereason-team-onereason-2026.md

Specialize-then-Unify RL

OneReason 提出的强化学习训练策略：先在单域内专项优化 thinking mode，再做跨域平衡和精炼。

动机

OneReason 发现一个反直觉现象：

多域混合 RL：thinking mode 仍然落后于 non-thinking mode
单域 RL：thinking mode 一致超越 non-thinking mode

这表明 thinking 优势对域混杂敏感——推理能力的跨域泛化需要先充分发育。

两阶段策略

Phase 1: Specialize

在单个推荐域内进行 RL，充分释放 thinking mode 的优势。

每个域独立训练，不受其他域的数据分布干扰
thinking mode 获得充分的域内优化信号

Phase 2: Unify

跨域平衡和精炼，两个可选方案：

rejection-sampling-fine-tuning：采样高质量 thinking 轨迹进行微调
multi-teacher-on-policy-distillation：多教师在线策略蒸馏

核心洞察

先专后统：推理能力的跨域泛化需要域内先充分发育作为前提。这与 LLM 中「先广泛预训练再专项微调」的模式形成有趣对照——推荐推理走的是「先专项再统一」的逆向路径。

参考