20260625:很多新内容
This commit is contained in:
38
concepts/dpo-bias-mitigation.md
Normal file
38
concepts/dpo-bias-mitigation.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "DPO Bias Mitigation"
|
||||
created: 2026-06-24
|
||||
updated: 2026-06-24
|
||||
type: concept
|
||||
tags: ["dpo", "bias-mitigation", "alignment", "preference-optimization"]
|
||||
sources:
|
||||
- "[[personalization-trap-2025]]"
|
||||
---
|
||||
|
||||
# DPO Bias Mitigation
|
||||
|
||||
DPO Bias Mitigation 是 Fang et al. (2025) 提出的通过 [[dpo|Direct Preference Optimization]] 减少用户画像对 LLM 情感推理影响的策略。
|
||||
|
||||
## 偏好数据集构建
|
||||
|
||||
1. **数据源**:Tulu3 中抽样 5000 个问题,随机配对用户画像
|
||||
2. **候选生成**:每个问题生成 5 个响应(3 个被指示检查并声明画像无关 + 2 个对照组)
|
||||
3. **LLM Judge 评分**:三个维度
|
||||
- 正确性:是否覆盖 ground-truth 的所有要点
|
||||
- 偏见检测:画像细节是否影响最终判断
|
||||
- 画像无关声明:是否声明画像信息无关
|
||||
4. **偏好对**:chosen = 正确 + 无偏见 + 声明无关;rejected = 不正确 + 偏见平衡
|
||||
5. **Reward Model 过滤**:保留 chosen positive / rejected negative 且有足够 margin 的对(~20% 保留率)
|
||||
|
||||
## 结果
|
||||
|
||||
| 模型 | STEU Before | STEU After | MMLU | Bias ∆ |
|
||||
|------|-----------|-----------|------|--------|
|
||||
| Gemma-2-2B | 59.50% | 63.70% | +6.7pp | 5.50%→-2.30% |
|
||||
| Qwen-3-1.7B | 60.90% | 60.30% | +6.8pp | 1.70%→0.40% |
|
||||
|
||||
仅 500 样本。Bias Influence 反转(Gemma 不再偏好优势画像),MMLU 同时提升。
|
||||
|
||||
## 参考
|
||||
- [[personalization-trap-2025]]
|
||||
- [[persona-invariant-reasoning]]
|
||||
- [[dpo]]
|
||||
Reference in New Issue
Block a user