20260514:增加新内容
This commit is contained in:
48
concepts/dgae.md
Normal file
48
concepts/dgae.md
Normal file
@@ -0,0 +1,48 @@
|
||||
---
|
||||
title: "Difficulty-Balanced Group Advantage Estimation (DGAE)"
|
||||
created: 2026-05-12
|
||||
updated: 2026-05-12
|
||||
type: concept
|
||||
tags: ["grpo", "advantage-estimation", "reinforcement-learning"]
|
||||
sources: ["arxiv:2601.20614"]
|
||||
---
|
||||
|
||||
# Difficulty-Balanced Group Advantage Estimation (DGAE)
|
||||
|
||||
**DGAE** 是 [[dgpo|DGPO]] 的核心技术之一,通过将 GRPO 优势估计中的 std 分母替换为 MAD(平均绝对偏差),实现**难度平衡**的更新幅度。
|
||||
|
||||
## 公式对比
|
||||
|
||||
**GRPO (GRAE)**:
|
||||
$$\hat{A}_{GR,i} = \frac{r_i - \text{mean}(\{r_i\})}{\text{std}(\{r_i\})}$$
|
||||
|
||||
**DGAE**:
|
||||
$$\hat{A}_{DG,i} = \frac{r_i - \text{mean}(\{r_i\})}{\text{MAD}(\{r_i\})}, \quad \text{MAD}(\{r_i\}) = \frac{1}{G}\sum_{j=1}^{G}|r_j - \text{mean}(\{r_i\})|$$
|
||||
|
||||
## 关键定理
|
||||
|
||||
**Theorem 2**:使用 DGAE 时,单个问题的总更新幅度(无裁剪)恒为:
|
||||
|
||||
$$\sum_{i=1}^{G} |\hat{A}_{DG,i}| = G$$
|
||||
|
||||
与奖励分布无关——无论准确率 p 是多少,更新幅度恒定。
|
||||
|
||||
**对比 Theorem 1**(GRPO):总更新幅度 $\propto 2G\sqrt{p(1-p)}$,在 p=0.5 时最大。
|
||||
|
||||
## 为什么 MAD 优于 std?
|
||||
|
||||
- **std** 引入 $\sqrt{p(1-p)}$ 因子 → 更新幅度依赖准确率 → [[update-magnitude-imbalance|难度不平衡]]
|
||||
- **MAD = 2p(1-p)** 对于二元奖励 → 恰好消除 $p(1-p)$ 因子 → 难度平衡
|
||||
- MAD 的线性性质(vs std 的平方根)使得归一化后的总更新幅度恒定
|
||||
|
||||
## 泛化性
|
||||
|
||||
Theorem 2 **不要求奖励为二元值**(ri ∈ {0,1}),适用于任意奖励函数。这意味着 DGAE 可以用于更广泛的 RLVR 场景(如带 length penalty 的复合奖励)。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[dqw|DQW]] — 第二步:难度加权
|
||||
- [[dgpo|DGPO]] — 算法整体
|
||||
- [[update-magnitude-imbalance]] — 被解决的问题
|
||||
- [[grpo]] — 基线方法
|
||||
- [[dai-mathforge-2026|论文页面]]
|
||||
Reference in New Issue
Block a user