40 lines
1.4 KiB
Markdown
40 lines
1.4 KiB
Markdown
---
|
||
title: "Rejection Sampling Fine-tuning (RSFT)"
|
||
created: 2026-06-10
|
||
updated: 2026-06-10
|
||
type: concept
|
||
tags: [reinforcement-learning, fine-tuning, rejection-sampling]
|
||
sources: [raw/papers/onereason-team-onereason-2026.md]
|
||
---
|
||
|
||
# Rejection Sampling Fine-tuning (RSFT)
|
||
|
||
> 通过采样并筛选高质量模型输出进行监督微调的技术,在 OneReason 中用于 [[specialize-then-unify-rl|specialize-then-unify RL]] 的统一阶段。
|
||
|
||
## 核心思想
|
||
|
||
Rejection Sampling FT (Yuan et al., 2023) 的工作流程:
|
||
|
||
1. **采样**:从当前策略(或一组教师模型)采样大量推理轨迹
|
||
2. **筛选**:基于 reward 模型或 verifier 拒绝低质量轨迹
|
||
3. **微调**:仅用通过筛选的高质量轨迹进行 SFT
|
||
|
||
## 在 OneReason 中的应用
|
||
|
||
在 [[specialize-then-unify-rl|specialize-then-unify]] 的 Unify 阶段,RSFT 用于:
|
||
|
||
- 从各单域专项模型中采样 thinking 轨迹
|
||
- 筛选跨域一致的推理模式
|
||
- 微调得到统一的跨域推理能力
|
||
|
||
## 与其他方法的关系
|
||
|
||
- vs [[multi-teacher-on-policy-distillation|MODPO]]:RSFT 是离线方法(采样后微调),MODPO 是在线方法(训练中蒸馏)
|
||
- vs 普通 SFT:RSFT 的关键在于采样+筛选的闭环,确保训练数据质量
|
||
|
||
## 参考
|
||
|
||
- [[specialize-then-unify-rl|Specialize-then-Unify RL]]
|
||
- [[multi-teacher-on-policy-distillation|MODPO]]
|
||
- [[onereason|OneReason]]
|