20260514:增加新内容

2026-05-14 13:54:52 +08:00
parent 56c4d3ef7c
commit b116710e4c
294 changed files with 10682 additions and 255 deletions
--- a/raw/papers/dai-mathforge-2026.md
+++ b/raw/papers/dai-mathforge-2026.md
@@ -0,0 +1,60 @@
+---
+title: "Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation"
+authors: ["Yanqi Dai", "Yuxiang Ji", "Xiao Zhang", "Yong Wang", "Xiangxiang Chu", "Zhiwu Lu"]
+year: 2026
+arxiv: "2601.20614"
+venue: "ICLR 2026"
+institutions: ["Renmin University", "AMAP Alibaba Group", "Xiamen University", "Dalian University of Technology"]
+type: "paper"
+created: 2026-05-12
+tags: ["mathematical-reasoning", "reinforcement-learning", "grpo", "difficulty-aware", "data-augmentation"]
+---
+
+# Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
+
+**Authors**: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
+**Venue**: ICLR 2026
+**arXiv**: [2601.20614](https://arxiv.org/abs/2601.20614)
+**Code**: https://github.com/AMAP-ML/MathForge
+
+## Abstract
+
+Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, the authors identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives.
+
+**Algorithmically**: GRPO suffers from an implicit imbalance — the magnitude of policy updates is lower for harder questions, peaking at p=0.5 accuracy rate.
+
+**Data-wise**: Augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty.
+
+**Solution: MathForge** — a two-dual framework comprising:
+1. **DGPO** (Difficulty-Aware Group Policy Optimization): rectifies GRPO's imbalance via difficulty-balanced group advantage estimation (DGAE) and difficulty-aware question-level weighting (DQW)
+2. **MQR** (Multi-Aspect Question Reformulation): reformulates questions across multiple aspects (Background, Term, Sub-Problem) to increase difficulty while preserving the original gold answer
+
+## Key Findings
+
+- GRPO's total update magnitude for a single question is ∝ 2G√(p(1-p)), peaking at p=0.5
+- DGAE replaces std with MAD, achieving constant update magnitude (G) regardless of accuracy
+- MathForge achieves 42.17% avg on 6 benchmarks vs 37.61% for GRPO (Qwen2.5-Math-7B)
+- MQR generates three types of reformulations with 97-99% answer preservation rate
+
+## Core Equations
+
+**GRPO Advantage (imbalanced)**:
+$$\hat{A}_{GR,i} = rac{r_i - 	ext{mean}(\{r_i\}_{i=1}^G)}{	ext{std}(\{r_i\}_{i=1}^G)}$$
+
+**DGAE Advantage (balanced)**:
+$$\hat{A}_{DG,i} = rac{r_i - 	ext{mean}(\{r_i\}_{i=1}^G)}{	ext{MAD}(\{r_i\}_{i=1}^G)}$$
+
+**DQW Weighting**:
+$$\lambda_s = B_v \cdot rac{\exp(D_s/T)}{\sum_{s=1}^{B_v} \exp(D_s/T)}, \quad D_s = -	ext{mean}(\{r_{si}\}_{i=1}^G)$$
+
+## Key Concepts
+
+- [[dgpo|DGPO]] — Difficulty-Aware GRPO algorithm
+- [[dgae|DGAE]] — Difficulty-Balanced Group Advantage Estimation
+- [[dqw|DQW]] — Difficulty-Aware Question-Level Weighting
+- [[mqr|MQR]] — Multi-Aspect Question Reformulation
+- [[mathforge]] — The complete MathForge framework
+- [[grpo]] — Group Relative Policy Optimization
+- [[update-magnitude-imbalance]] — GRPO's implicit difficulty imbalance
+- [[math-question-reformulation]] — MQR's three reformulation strategies
+- [[rlvr-unified-framework]] — RLVR training paradigm