myWiki/raw/papers/dai-mathforge-2026.md

---
title: "Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation"
authors: ["Yanqi Dai", "Yuxiang Ji", "Xiao Zhang", "Yong Wang", "Xiangxiang Chu", "Zhiwu Lu"]
year: 2026
arxiv: "2601.20614"
venue: "ICLR 2026"
institutions: ["Renmin University", "AMAP Alibaba Group", "Xiamen University", "Dalian University of Technology"]
type: "paper"
created: 2026-05-12
tags: ["mathematical-reasoning", "reinforcement-learning", "grpo", "difficulty-aware", "data-augmentation"]
---

# Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

**Authors**: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
**Venue**: ICLR 2026
**arXiv**: [2601.20614](https://arxiv.org/abs/2601.20614)
**Code**: https://github.com/AMAP-ML/MathForge

## Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, the authors identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives.

**Algorithmically**: GRPO suffers from an implicit imbalance — the magnitude of policy updates is lower for harder questions, peaking at p=0.5 accuracy rate.

**Data-wise**: Augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty.

**Solution: MathForge** — a two-dual framework comprising:
1. **DGPO** (Difficulty-Aware Group Policy Optimization): rectifies GRPO's imbalance via difficulty-balanced group advantage estimation (DGAE) and difficulty-aware question-level weighting (DQW)
2. **MQR** (Multi-Aspect Question Reformulation): reformulates questions across multiple aspects (Background, Term, Sub-Problem) to increase difficulty while preserving the original gold answer

## Key Findings

- GRPO's total update magnitude for a single question is ∝ 2G√(p(1-p)), peaking at p=0.5
- DGAE replaces std with MAD, achieving constant update magnitude (G) regardless of accuracy
- MathForge achieves 42.17% avg on 6 benchmarks vs 37.61% for GRPO (Qwen2.5-Math-7B)
- MQR generates three types of reformulations with 97-99% answer preservation rate

## Core Equations

**GRPO Advantage (imbalanced)**:
$$\hat{A}_{GR,i} = rac{r_i - 	ext{mean}(\{r_i\}_{i=1}^G)}{	ext{std}(\{r_i\}_{i=1}^G)}$$

**DGAE Advantage (balanced)**:
$$\hat{A}_{DG,i} = rac{r_i - 	ext{mean}(\{r_i\}_{i=1}^G)}{	ext{MAD}(\{r_i\}_{i=1}^G)}$$

**DQW Weighting**:
$$\lambda_s = B_v \cdot rac{\exp(D_s/T)}{\sum_{s=1}^{B_v} \exp(D_s/T)}, \quad D_s = -	ext{mean}(\{r_{si}\}_{i=1}^G)$$

## Key Concepts

- [[dgpo|DGPO]] — Difficulty-Aware GRPO algorithm
- [[dgae|DGAE]] — Difficulty-Balanced Group Advantage Estimation
- [[dqw|DQW]] — Difficulty-Aware Question-Level Weighting
- [[mqr|MQR]] — Multi-Aspect Question Reformulation
- [[mathforge]] — The complete MathForge framework
- [[grpo]] — Group Relative Policy Optimization
- [[update-magnitude-imbalance]] — GRPO's implicit difficulty imbalance
- [[math-question-reformulation]] — MQR's three reformulation strategies
- [[rlvr-unified-framework]] — RLVR training paradigm