Files
myWiki/concepts/rejection-sampling-fine-tuning.md

40 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Rejection Sampling Fine-tuning (RSFT)"
created: 2026-06-10
updated: 2026-06-10
type: concept
tags: [reinforcement-learning, fine-tuning, rejection-sampling]
sources: [raw/papers/onereason-team-onereason-2026.md]
---
# Rejection Sampling Fine-tuning (RSFT)
> 通过采样并筛选高质量模型输出进行监督微调的技术,在 OneReason 中用于 [[specialize-then-unify-rl|specialize-then-unify RL]] 的统一阶段。
## 核心思想
Rejection Sampling FT (Yuan et al., 2023) 的工作流程:
1. **采样**:从当前策略(或一组教师模型)采样大量推理轨迹
2. **筛选**:基于 reward 模型或 verifier 拒绝低质量轨迹
3. **微调**:仅用通过筛选的高质量轨迹进行 SFT
## 在 OneReason 中的应用
在 [[specialize-then-unify-rl|specialize-then-unify]] 的 Unify 阶段RSFT 用于:
- 从各单域专项模型中采样 thinking 轨迹
- 筛选跨域一致的推理模式
- 微调得到统一的跨域推理能力
## 与其他方法的关系
- vs [[multi-teacher-on-policy-distillation|MODPO]]RSFT 是离线方法采样后微调MODPO 是在线方法(训练中蒸馏)
- vs 普通 SFTRSFT 的关键在于采样+筛选的闭环,确保训练数据质量
## 参考
- [[specialize-then-unify-rl|Specialize-then-Unify RL]]
- [[multi-teacher-on-policy-distillation|MODPO]]
- [[onereason|OneReason]]