Files
myWiki/concepts/dgpo.md

56 lines
2.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Difficulty-Aware Group Policy Optimization (DGPO)"
created: 2026-05-12
updated: 2026-05-12
type: concept
tags: ["grpo", "difficulty-aware", "reinforcement-learning", "policy-optimization"]
sources: ["arxiv:2601.20614"]
---
# Difficulty-Aware Group Policy Optimization (DGPO)
**DGPO** 是 [[mathforge|MathForge]] 框架的算法组件,通过两步策略解决 [[grpo|GRPO]] 的 [[update-magnitude-imbalance|难度不平衡问题]]。
## 优化目标
$$J_{DGPO}(\theta) = \mathbb{E} \frac{1}{\sum_{s=1}^{B_v} \sum_{i=1}^{G} |o_{si}|} \sum_{s=1}^{B_v} \lambda_s \sum_{i=1}^{G} \sum_{t=1}^{|o_{si}|} \min(I_{sit}A_{DG,si}, \text{clip}(...))$$
## 两步策略Balance-then-Reweight
### 第一步:[[dgae|DGAE]](平衡)
**MAD平均绝对偏差** 替代 std 作为优势归一化分母:
$$\hat{A}_{DG,i} = \frac{r_i - \text{mean}(\{r_i\})}{\text{MAD}(\{r_i\})}$$
**效果**:总更新幅度恒为 G与准确率 p 无关Theorem 2
### 第二步:[[dqw|DQW]](加权)
用 softmax 温度加权显式优先更难的问题:
$$\lambda_s = B_v \cdot \frac{\exp(D_s/T)}{\sum \exp(D_s/T)}, \quad D_s = -\text{mean}(\{r_{si}\})$$
**关键**Balance-then-reweight 提供比直接优势重加权(如 GRPO-AD更好的可解释性和可控性。
## 与 GRPO 的关键区别
| 组件 | GRPO | DGPO |
|------|------|------|
| 优势估计 | std 归一化 | **MAD 归一化** |
| 难度处理 | 隐式不平衡p=0.5 峰值) | **显式优先困难问题** |
| 问题权重 | 均等 | **softmax 难度加权** |
| Valid query | 全部 | **仅有效问题(非全对/全错)** |
## DGPO 与其他方法的组合
DGPO 可以与 GP6、DAPO、GSPO 等方法组合,详见论文 Appendix G。组合时 DQW 的难度分数 D_s 仅基于 accuracy reward 计算(排除 length penalty 等辅助信号)。
## 相关概念
- [[dgae|DGAE]] — 难度平衡优势估计
- [[dqw|DQW]] — 难度感知问题级加权
- [[grpo]] — 基线方法
- [[mathforge]] — 完整框架
- [[dai-mathforge-2026|论文页面]]