22 lines
646 B
Markdown
22 lines
646 B
Markdown
---
|
||
title: "DPO (Direct Preference Optimization)"
|
||
created: 2026-06-03
|
||
updated: 2026-06-03
|
||
type: concept
|
||
tags: [DPO, alignment, LLM, training]
|
||
status: placeholder
|
||
---
|
||
|
||
# DPO (Direct Preference Optimization)
|
||
|
||
> ⚠️ 占位符页面 — 待完善
|
||
|
||
DPO 是一种直接偏好优化方法,通过重新参数化 RLHF 中的奖励函数,直接从偏好数据中优化策略,无需显式训练奖励模型。是 RLHF 的简化替代方案。
|
||
|
||
在 [[zhang-reconciling-sft-interaction-2026|Zhang et al. (2026)]] 的讨论中,RLHF/DPO 等替代性后训练范式与 SFT 形成对照。
|
||
|
||
## 相关概念
|
||
|
||
- [[rlhf]]
|
||
- [[supervised-fine-tuning|SFT]]
|