Files
myWiki/raw/papers/dai-mathforge-2026.md

61 lines
3.2 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation"
authors: ["Yanqi Dai", "Yuxiang Ji", "Xiao Zhang", "Yong Wang", "Xiangxiang Chu", "Zhiwu Lu"]
year: 2026
arxiv: "2601.20614"
venue: "ICLR 2026"
institutions: ["Renmin University", "AMAP Alibaba Group", "Xiamen University", "Dalian University of Technology"]
type: "paper"
created: 2026-05-12
tags: ["mathematical-reasoning", "reinforcement-learning", "grpo", "difficulty-aware", "data-augmentation"]
---
# Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
**Authors**: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
**Venue**: ICLR 2026
**arXiv**: [2601.20614](https://arxiv.org/abs/2601.20614)
**Code**: https://github.com/AMAP-ML/MathForge
## Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, the authors identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives.
**Algorithmically**: GRPO suffers from an implicit imbalance — the magnitude of policy updates is lower for harder questions, peaking at p=0.5 accuracy rate.
**Data-wise**: Augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty.
**Solution: MathForge** — a two-dual framework comprising:
1. **DGPO** (Difficulty-Aware Group Policy Optimization): rectifies GRPO's imbalance via difficulty-balanced group advantage estimation (DGAE) and difficulty-aware question-level weighting (DQW)
2. **MQR** (Multi-Aspect Question Reformulation): reformulates questions across multiple aspects (Background, Term, Sub-Problem) to increase difficulty while preserving the original gold answer
## Key Findings
- GRPO's total update magnitude for a single question is ∝ 2G√(p(1-p)), peaking at p=0.5
- DGAE replaces std with MAD, achieving constant update magnitude (G) regardless of accuracy
- MathForge achieves 42.17% avg on 6 benchmarks vs 37.61% for GRPO (Qwen2.5-Math-7B)
- MQR generates three types of reformulations with 97-99% answer preservation rate
## Core Equations
**GRPO Advantage (imbalanced)**:
$$\hat{A}_{GR,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{std}(\{r_i\}_{i=1}^G)}$$
**DGAE Advantage (balanced)**:
$$\hat{A}_{DG,i} = rac{r_i - ext{mean}(\{r_i\}_{i=1}^G)}{ ext{MAD}(\{r_i\}_{i=1}^G)}$$
**DQW Weighting**:
$$\lambda_s = B_v \cdot rac{\exp(D_s/T)}{\sum_{s=1}^{B_v} \exp(D_s/T)}, \quad D_s = - ext{mean}(\{r_{si}\}_{i=1}^G)$$
## Key Concepts
- [[dgpo|DGPO]] — Difficulty-Aware GRPO algorithm
- [[dgae|DGAE]] — Difficulty-Balanced Group Advantage Estimation
- [[dqw|DQW]] — Difficulty-Aware Question-Level Weighting
- [[mqr|MQR]] — Multi-Aspect Question Reformulation
- [[mathforge]] — The complete MathForge framework
- [[grpo]] — Group Relative Policy Optimization
- [[update-magnitude-imbalance]] — GRPO's implicit difficulty imbalance
- [[math-question-reformulation]] — MQR's three reformulation strategies
- [[rlvr-unified-framework]] — RLVR training paradigm