20260625:很多新内容

2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions
--- a/concepts/dpo-bias-mitigation.md
+++ b/concepts/dpo-bias-mitigation.md
@@ -0,0 +1,38 @@
+---
+title: "DPO Bias Mitigation"
+created: 2026-06-24
+updated: 2026-06-24
+type: concept
+tags: ["dpo", "bias-mitigation", "alignment", "preference-optimization"]
+sources:
+  - "[[personalization-trap-2025]]"
+---
+
+# DPO Bias Mitigation
+
+DPO Bias Mitigation 是 Fang et al. (2025) 提出的通过 [[dpo|Direct Preference Optimization]] 减少用户画像对 LLM 情感推理影响的策略。
+
+## 偏好数据集构建
+
+1. **数据源**：Tulu3 中抽样 5000 个问题，随机配对用户画像
+2. **候选生成**：每个问题生成 5 个响应（3 个被指示检查并声明画像无关 + 2 个对照组）
+3. **LLM Judge 评分**：三个维度
+   - 正确性：是否覆盖 ground-truth 的所有要点
+   - 偏见检测：画像细节是否影响最终判断
+   - 画像无关声明：是否声明画像信息无关
+4. **偏好对**：chosen = 正确 + 无偏见 + 声明无关；rejected = 不正确 + 偏见平衡
+5. **Reward Model 过滤**：保留 chosen positive / rejected negative 且有足够 margin 的对（~20% 保留率）
+
+## 结果
+
+| 模型 | STEU Before | STEU After | MMLU | Bias ∆ |
+|------|-----------|-----------|------|--------|
+| Gemma-2-2B | 59.50% | 63.70% | +6.7pp | 5.50%→-2.30% |
+| Qwen-3-1.7B | 60.90% | 60.30% | +6.8pp | 1.70%→0.40% |
+
+仅 500 样本。Bias Influence 反转（Gemma 不再偏好优势画像），MMLU 同时提升。
+
+## 参考
+- [[personalization-trap-2025]]
+- [[persona-invariant-reasoning]]
+- [[dpo]]