Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Authors: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu Venue: ICLR 2026 arXiv: 2601.20614 Code: https://github.com/AMAP-ML/MathForge

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, the authors identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives.

Algorithmically: GRPO suffers from an implicit imbalance — the magnitude of policy updates is lower for harder questions, peaking at p=0.5 accuracy rate.

Data-wise: Augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty.

Solution: MathForge — a two-dual framework comprising:

DGPO (Difficulty-Aware Group Policy Optimization): rectifies GRPO's imbalance via difficulty-balanced group advantage estimation (DGAE) and difficulty-aware question-level weighting (DQW)
MQR (Multi-Aspect Question Reformulation): reformulates questions across multiple aspects (Background, Term, Sub-Problem) to increase difficulty while preserving the original gold answer

Key Findings

GRPO's total update magnitude for a single question is ∝ 2G√(p(1-p)), peaking at p=0.5
DGAE replaces std with MAD, achieving constant update magnitude (G) regardless of accuracy
MathForge achieves 42.17% avg on 6 benchmarks vs 37.61% for GRPO (Qwen2.5-Math-7B)
MQR generates three types of reformulations with 97-99% answer preservation rate

Core Equations

GRPO Advantage (imbalanced):

\hat{A}_{GR,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{std}(\{r_i\}_{i=1}^G)}

DGAE Advantage (balanced):

\hat{A}_{DG,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{MAD}(\{r_i\}_{i=1}^G)}

DQW Weighting:

\lambda_s = B_v \cdot rac{\exp(D_s/T)}{\sum_{s=1}^{B_v} \exp(D_s/T)}, \quad D_s = - ext{mean}(\{r_{si}\}_{i=1}^G)

Key Concepts

dgpo — Difficulty-Aware GRPO algorithm
dgae — Difficulty-Balanced Group Advantage Estimation
dqw — Difficulty-Aware Question-Level Weighting
mqr — Multi-Aspect Question Reformulation
mathforge — The complete MathForge framework
grpo — Group Relative Policy Optimization
update-magnitude-imbalance — GRPO's implicit difficulty imbalance
math-question-reformulation — MQR's three reformulation strategies
rlvr-unified-framework — RLVR training paradigm

3.2 KiB Raw Blame History Unescape Escape

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Abstract

Key Findings

Core Equations

Key Concepts

3.2 KiB

Raw Blame History