Files
myWiki/concepts/dpo.md

22 lines
646 B
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "DPO (Direct Preference Optimization)"
created: 2026-06-03
updated: 2026-06-03
type: concept
tags: [DPO, alignment, LLM, training]
status: placeholder
---
# DPO (Direct Preference Optimization)
> ⚠️ 占位符页面 — 待完善
DPO 是一种直接偏好优化方法,通过重新参数化 RLHF 中的奖励函数,直接从偏好数据中优化策略,无需显式训练奖励模型。是 RLHF 的简化替代方案。
在 [[zhang-reconciling-sft-interaction-2026|Zhang et al. (2026)]] 的讨论中RLHF/DPO 等替代性后训练范式与 SFT 形成对照。
## 相关概念
- [[rlhf]]
- [[supervised-fine-tuning|SFT]]