20260617:目前有914 页
This commit is contained in:
39
concepts/rejection-sampling-fine-tuning.md
Normal file
39
concepts/rejection-sampling-fine-tuning.md
Normal file
@@ -0,0 +1,39 @@
|
||||
---
|
||||
title: "Rejection Sampling Fine-tuning (RSFT)"
|
||||
created: 2026-06-10
|
||||
updated: 2026-06-10
|
||||
type: concept
|
||||
tags: [reinforcement-learning, fine-tuning, rejection-sampling]
|
||||
sources: [raw/papers/onereason-team-onereason-2026.md]
|
||||
---
|
||||
|
||||
# Rejection Sampling Fine-tuning (RSFT)
|
||||
|
||||
> 通过采样并筛选高质量模型输出进行监督微调的技术,在 OneReason 中用于 [[specialize-then-unify-rl|specialize-then-unify RL]] 的统一阶段。
|
||||
|
||||
## 核心思想
|
||||
|
||||
Rejection Sampling FT (Yuan et al., 2023) 的工作流程:
|
||||
|
||||
1. **采样**:从当前策略(或一组教师模型)采样大量推理轨迹
|
||||
2. **筛选**:基于 reward 模型或 verifier 拒绝低质量轨迹
|
||||
3. **微调**:仅用通过筛选的高质量轨迹进行 SFT
|
||||
|
||||
## 在 OneReason 中的应用
|
||||
|
||||
在 [[specialize-then-unify-rl|specialize-then-unify]] 的 Unify 阶段,RSFT 用于:
|
||||
|
||||
- 从各单域专项模型中采样 thinking 轨迹
|
||||
- 筛选跨域一致的推理模式
|
||||
- 微调得到统一的跨域推理能力
|
||||
|
||||
## 与其他方法的关系
|
||||
|
||||
- vs [[multi-teacher-on-policy-distillation|MODPO]]:RSFT 是离线方法(采样后微调),MODPO 是在线方法(训练中蒸馏)
|
||||
- vs 普通 SFT:RSFT 的关键在于采样+筛选的闭环,确保训练数据质量
|
||||
|
||||
## 参考
|
||||
|
||||
- [[specialize-then-unify-rl|Specialize-then-Unify RL]]
|
||||
- [[multi-teacher-on-policy-distillation|MODPO]]
|
||||
- [[onereason|OneReason]]
|
||||
Reference in New Issue
Block a user