Files
myWiki/raw/papers/dai-mathforge-2026.md

3.2 KiB
Raw Permalink Blame History

title, authors, year, arxiv, venue, institutions, type, created, tags
title authors year arxiv venue institutions type created tags
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
Yanqi Dai
Yuxiang Ji
Xiao Zhang
Yong Wang
Xiangxiang Chu
Zhiwu Lu
2026 2601.20614 ICLR 2026
Renmin University
AMAP Alibaba Group
Xiamen University
Dalian University of Technology
paper 2026-05-12
mathematical-reasoning
reinforcement-learning
grpo
difficulty-aware
data-augmentation

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Authors: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu Venue: ICLR 2026 arXiv: 2601.20614 Code: https://github.com/AMAP-ML/MathForge

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, the authors identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives.

Algorithmically: GRPO suffers from an implicit imbalance — the magnitude of policy updates is lower for harder questions, peaking at p=0.5 accuracy rate.

Data-wise: Augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty.

Solution: MathForge — a two-dual framework comprising:

  1. DGPO (Difficulty-Aware Group Policy Optimization): rectifies GRPO's imbalance via difficulty-balanced group advantage estimation (DGAE) and difficulty-aware question-level weighting (DQW)
  2. MQR (Multi-Aspect Question Reformulation): reformulates questions across multiple aspects (Background, Term, Sub-Problem) to increase difficulty while preserving the original gold answer

Key Findings

  • GRPO's total update magnitude for a single question is ∝ 2G√(p(1-p)), peaking at p=0.5
  • DGAE replaces std with MAD, achieving constant update magnitude (G) regardless of accuracy
  • MathForge achieves 42.17% avg on 6 benchmarks vs 37.61% for GRPO (Qwen2.5-Math-7B)
  • MQR generates three types of reformulations with 97-99% answer preservation rate

Core Equations

GRPO Advantage (imbalanced):

\hat{A}_{GR,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{std}(\{r_i\}_{i=1}^G)}

DGAE Advantage (balanced):

\hat{A}_{DG,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{MAD}(\{r_i\}_{i=1}^G)}

DQW Weighting:

\lambda_s = B_v \cdot rac{\exp(D_s/T)}{\sum_{s=1}^{B_v} \exp(D_s/T)}, \quad D_s = - ext{mean}(\{r_{si}\}_{i=1}^G)

Key Concepts