20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/dpo.md
+++ b/concepts/dpo.md
@@ -0,0 +1,21 @@
+---
+title: "DPO (Direct Preference Optimization)"
+created: 2026-06-03
+updated: 2026-06-03
+type: concept
+tags: [DPO, alignment, LLM, training]
+status: placeholder
+---
+
+# DPO (Direct Preference Optimization)
+
+> ⚠️ 占位符页面 — 待完善
+
+DPO 是一种直接偏好优化方法，通过重新参数化 RLHF 中的奖励函数，直接从偏好数据中优化策略，无需显式训练奖励模型。是 RLHF 的简化替代方案。
+
+在 [[zhang-reconciling-sft-interaction-2026|Zhang et al. (2026)]] 的讨论中，RLHF/DPO 等替代性后训练范式与 SFT 形成对照。
+
+## 相关概念
+
+- [[rlhf]]
+- [[supervised-fine-tuning|SFT]]