3.2 KiB
title, authors, year, arxiv, venue, institutions, type, created, tags
| title | authors | year | arxiv | venue | institutions | type | created | tags | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation |
|
2026 | 2601.20614 | ICLR 2026 |
|
paper | 2026-05-12 |
|
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
Authors: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu Venue: ICLR 2026 arXiv: 2601.20614 Code: https://github.com/AMAP-ML/MathForge
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, the authors identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives.
Algorithmically: GRPO suffers from an implicit imbalance — the magnitude of policy updates is lower for harder questions, peaking at p=0.5 accuracy rate.
Data-wise: Augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty.
Solution: MathForge — a two-dual framework comprising:
- DGPO (Difficulty-Aware Group Policy Optimization): rectifies GRPO's imbalance via difficulty-balanced group advantage estimation (DGAE) and difficulty-aware question-level weighting (DQW)
- MQR (Multi-Aspect Question Reformulation): reformulates questions across multiple aspects (Background, Term, Sub-Problem) to increase difficulty while preserving the original gold answer
Key Findings
- GRPO's total update magnitude for a single question is ∝ 2G√(p(1-p)), peaking at p=0.5
- DGAE replaces std with MAD, achieving constant update magnitude (G) regardless of accuracy
- MathForge achieves 42.17% avg on 6 benchmarks vs 37.61% for GRPO (Qwen2.5-Math-7B)
- MQR generates three types of reformulations with 97-99% answer preservation rate
Core Equations
GRPO Advantage (imbalanced):
\hat{A}_{GR,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{std}(\{r_i\}_{i=1}^G)}
DGAE Advantage (balanced):
\hat{A}_{DG,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{MAD}(\{r_i\}_{i=1}^G)}
DQW Weighting:
\lambda_s = B_v \cdot rac{\exp(D_s/T)}{\sum_{s=1}^{B_v} \exp(D_s/T)}, \quad D_s = - ext{mean}(\{r_{si}\}_{i=1}^G)
Key Concepts
- dgpo — Difficulty-Aware GRPO algorithm
- dgae — Difficulty-Balanced Group Advantage Estimation
- dqw — Difficulty-Aware Question-Level Weighting
- mqr — Multi-Aspect Question Reformulation
- mathforge — The complete MathForge framework
- grpo — Group Relative Policy Optimization
- update-magnitude-imbalance — GRPO's implicit difficulty imbalance
- math-question-reformulation — MQR's three reformulation strategies
- rlvr-unified-framework — RLVR training paradigm