20260617:目前有914 页
This commit is contained in:
21
concepts/dpo.md
Normal file
21
concepts/dpo.md
Normal file
@@ -0,0 +1,21 @@
|
||||
---
|
||||
title: "DPO (Direct Preference Optimization)"
|
||||
created: 2026-06-03
|
||||
updated: 2026-06-03
|
||||
type: concept
|
||||
tags: [DPO, alignment, LLM, training]
|
||||
status: placeholder
|
||||
---
|
||||
|
||||
# DPO (Direct Preference Optimization)
|
||||
|
||||
> ⚠️ 占位符页面 — 待完善
|
||||
|
||||
DPO 是一种直接偏好优化方法,通过重新参数化 RLHF 中的奖励函数,直接从偏好数据中优化策略,无需显式训练奖励模型。是 RLHF 的简化替代方案。
|
||||
|
||||
在 [[zhang-reconciling-sft-interaction-2026|Zhang et al. (2026)]] 的讨论中,RLHF/DPO 等替代性后训练范式与 SFT 形成对照。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[rlhf]]
|
||||
- [[supervised-fine-tuning|SFT]]
|
||||
Reference in New Issue
Block a user