20260514:增加新内容
This commit is contained in:
39
concepts/grpo.md
Normal file
39
concepts/grpo.md
Normal file
@@ -0,0 +1,39 @@
|
||||
---
|
||||
title: "Group Relative Policy Optimization (GRPO)"
|
||||
created: 2025-04-15
|
||||
updated: 2026-05-12
|
||||
type: concept
|
||||
tags: ["reinforcement-learning", "llm-training", "policy-optimization"]
|
||||
sources: ["arxiv:2402.03300"]
|
||||
---
|
||||
|
||||
# Group Relative Policy Optimization (GRPO)
|
||||
|
||||
**GRPO** 是 PPO 的一种变体,由 DeepSeekMath 提出,被 DeepSeek-R1 广泛采用。其核心创新是**消除 critic 模型**,通过在同一个问题的多组响应内部进行相对优势估计。
|
||||
|
||||
## 核心公式
|
||||
|
||||
对于问题 q 的 G 个响应,GRPO 优化目标为:
|
||||
|
||||
$$\max_{\pi_\theta} \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min(I_{it}(\theta)\hat{A}_{GR,i}, \text{clip}(I_{it}(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_{GR,i})$$
|
||||
|
||||
其中组相对优势估计(GRAE):
|
||||
|
||||
$$\hat{A}_{GR,i} = \frac{r_i - \text{mean}(\{r_i\}_{i=1}^G)}{\text{std}(\{r_i\}_{i=1}^G)}$$
|
||||
|
||||
## 关键特性
|
||||
|
||||
- **无需 Critic**:通过对同问题响应的组内比较,避免了训练额外的价值函数模型
|
||||
- **二元奖励兼容**:与基于规则的验证器(如数学正确/错误)天然兼容
|
||||
- **GRPO 变体**:GP6、DAPO 等移除了 KL 散度并采用 token-level loss
|
||||
|
||||
## 已知局限
|
||||
|
||||
GRPO 存在 [[update-magnitude-imbalance|隐含的难度不平衡]]:更新幅度在 p=0.5 时最大,对困难和简单问题都被抑制。[[dgpo|DGPO]] 通过 DGAE 解决了这一问题。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[dgpo|DGPO]] — 难度感知 GRPO 改进
|
||||
- [[dgae|DGAE]] — 难度平衡优势估计
|
||||
- [[rlvr-unified-framework]] — RLVR 训练范式
|
||||
- [[dai-mathforge-2026|MathForge]] — 难度感知数学推理框架
|
||||
Reference in New Issue
Block a user