Files
myWiki/concepts/grpo.md

40 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Group Relative Policy Optimization (GRPO)"
created: 2025-04-15
updated: 2026-05-12
type: concept
tags: ["reinforcement-learning", "llm-training", "policy-optimization"]
sources: ["arxiv:2402.03300"]
---
# Group Relative Policy Optimization (GRPO)
**GRPO** 是 PPO 的一种变体,由 DeepSeekMath 提出,被 DeepSeek-R1 广泛采用。其核心创新是**消除 critic 模型**,通过在同一个问题的多组响应内部进行相对优势估计。
## 核心公式
对于问题 q 的 G 个响应GRPO 优化目标为:
$$\max_{\pi_\theta} \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min(I_{it}(\theta)\hat{A}_{GR,i}, \text{clip}(I_{it}(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_{GR,i})$$
其中组相对优势估计GRAE
$$\hat{A}_{GR,i} = \frac{r_i - \text{mean}(\{r_i\}_{i=1}^G)}{\text{std}(\{r_i\}_{i=1}^G)}$$
## 关键特性
- **无需 Critic**:通过对同问题响应的组内比较,避免了训练额外的价值函数模型
- **二元奖励兼容**:与基于规则的验证器(如数学正确/错误)天然兼容
- **GRPO 变体**GP6、DAPO 等移除了 KL 散度并采用 token-level loss
## 已知局限
GRPO 存在 [[update-magnitude-imbalance|隐含的难度不平衡]]:更新幅度在 p=0.5 时最大,对困难和简单问题都被抑制。[[dgpo|DGPO]] 通过 DGAE 解决了这一问题。
## 相关概念
- [[dgpo|DGPO]] — 难度感知 GRPO 改进
- [[dgae|DGAE]] — 难度平衡优势估计
- [[rlvr-unified-framework]] — RLVR 训练范式
- [[dai-mathforge-2026|MathForge]] — 难度感知数学推理框架