20260514:增加新内容
This commit is contained in:
60
raw/papers/dai-mathforge-2026.md
Normal file
60
raw/papers/dai-mathforge-2026.md
Normal file
@@ -0,0 +1,60 @@
|
||||
---
|
||||
title: "Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation"
|
||||
authors: ["Yanqi Dai", "Yuxiang Ji", "Xiao Zhang", "Yong Wang", "Xiangxiang Chu", "Zhiwu Lu"]
|
||||
year: 2026
|
||||
arxiv: "2601.20614"
|
||||
venue: "ICLR 2026"
|
||||
institutions: ["Renmin University", "AMAP Alibaba Group", "Xiamen University", "Dalian University of Technology"]
|
||||
type: "paper"
|
||||
created: 2026-05-12
|
||||
tags: ["mathematical-reasoning", "reinforcement-learning", "grpo", "difficulty-aware", "data-augmentation"]
|
||||
---
|
||||
|
||||
# Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
|
||||
|
||||
**Authors**: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
|
||||
**Venue**: ICLR 2026
|
||||
**arXiv**: [2601.20614](https://arxiv.org/abs/2601.20614)
|
||||
**Code**: https://github.com/AMAP-ML/MathForge
|
||||
|
||||
## Abstract
|
||||
|
||||
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, the authors identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives.
|
||||
|
||||
**Algorithmically**: GRPO suffers from an implicit imbalance — the magnitude of policy updates is lower for harder questions, peaking at p=0.5 accuracy rate.
|
||||
|
||||
**Data-wise**: Augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty.
|
||||
|
||||
**Solution: MathForge** — a two-dual framework comprising:
|
||||
1. **DGPO** (Difficulty-Aware Group Policy Optimization): rectifies GRPO's imbalance via difficulty-balanced group advantage estimation (DGAE) and difficulty-aware question-level weighting (DQW)
|
||||
2. **MQR** (Multi-Aspect Question Reformulation): reformulates questions across multiple aspects (Background, Term, Sub-Problem) to increase difficulty while preserving the original gold answer
|
||||
|
||||
## Key Findings
|
||||
|
||||
- GRPO's total update magnitude for a single question is ∝ 2G√(p(1-p)), peaking at p=0.5
|
||||
- DGAE replaces std with MAD, achieving constant update magnitude (G) regardless of accuracy
|
||||
- MathForge achieves 42.17% avg on 6 benchmarks vs 37.61% for GRPO (Qwen2.5-Math-7B)
|
||||
- MQR generates three types of reformulations with 97-99% answer preservation rate
|
||||
|
||||
## Core Equations
|
||||
|
||||
**GRPO Advantage (imbalanced)**:
|
||||
$$\hat{A}_{GR,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{std}(\{r_i\}_{i=1}^G)}$$
|
||||
|
||||
**DGAE Advantage (balanced)**:
|
||||
$$\hat{A}_{DG,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{MAD}(\{r_i\}_{i=1}^G)}$$
|
||||
|
||||
**DQW Weighting**:
|
||||
$$\lambda_s = B_v \cdot rac{\exp(D_s/T)}{\sum_{s=1}^{B_v} \exp(D_s/T)}, \quad D_s = - ext{mean}(\{r_{si}\}_{i=1}^G)$$
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- [[dgpo|DGPO]] — Difficulty-Aware GRPO algorithm
|
||||
- [[dgae|DGAE]] — Difficulty-Balanced Group Advantage Estimation
|
||||
- [[dqw|DQW]] — Difficulty-Aware Question-Level Weighting
|
||||
- [[mqr|MQR]] — Multi-Aspect Question Reformulation
|
||||
- [[mathforge]] — The complete MathForge framework
|
||||
- [[grpo]] — Group Relative Policy Optimization
|
||||
- [[update-magnitude-imbalance]] — GRPO's implicit difficulty imbalance
|
||||
- [[math-question-reformulation]] — MQR's three reformulation strategies
|
||||
- [[rlvr-unified-framework]] — RLVR training paradigm
|
||||
Reference in New Issue
Block a user