Files
myWiki/concepts/dpo-bias-mitigation.md

39 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "DPO Bias Mitigation"
created: 2026-06-24
updated: 2026-06-24
type: concept
tags: ["dpo", "bias-mitigation", "alignment", "preference-optimization"]
sources:
- "[[personalization-trap-2025]]"
---
# DPO Bias Mitigation
DPO Bias Mitigation 是 Fang et al. (2025) 提出的通过 [[dpo|Direct Preference Optimization]] 减少用户画像对 LLM 情感推理影响的策略。
## 偏好数据集构建
1. **数据源**Tulu3 中抽样 5000 个问题,随机配对用户画像
2. **候选生成**:每个问题生成 5 个响应3 个被指示检查并声明画像无关 + 2 个对照组)
3. **LLM Judge 评分**:三个维度
- 正确性:是否覆盖 ground-truth 的所有要点
- 偏见检测:画像细节是否影响最终判断
- 画像无关声明:是否声明画像信息无关
4. **偏好对**chosen = 正确 + 无偏见 + 声明无关rejected = 不正确 + 偏见平衡
5. **Reward Model 过滤**:保留 chosen positive / rejected negative 且有足够 margin 的对(~20% 保留率)
## 结果
| 模型 | STEU Before | STEU After | MMLU | Bias ∆ |
|------|-----------|-----------|------|--------|
| Gemma-2-2B | 59.50% | 63.70% | +6.7pp | 5.50%→-2.30% |
| Qwen-3-1.7B | 60.90% | 60.30% | +6.8pp | 1.70%→0.40% |
仅 500 样本。Bias Influence 反转Gemma 不再偏好优势画像MMLU 同时提升。
## 参考
- [[personalization-trap-2025]]
- [[persona-invariant-reasoning]]
- [[dpo]]