Files
myWiki/raw/papers/me2-trm-reasoning-2026.md

41 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Characterizing, Evaluating, and Optimizing Complex Reasoning (ME² + TRM)"
author: "Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng"
source: "arXiv 2602.08498v2"
date: "2026-02-09 (updated 2026-06-03)"
type: paper
venue: "ICML 2026 (cs.CL)"
tags: ["reasoning", "reward-model", "dag", "grpo", "test-time-scaling", "rl"]
code: "https://github.com/Simplified-Reasoning/TRM"
---
# Characterizing, Evaluating, and Optimizing Complex Reasoning
> Zhang, Li, Wang, Wang, Zhang, Qu, Cheng | SJTU / Shanghai AI Lab / CUHK / NJU / USTC / PKU
> ICML 2026 | arXiv:2602.08498v2 | cs.CL
## 三个核心问题
1. **Q1**:什么定义了高质量推理?
2. **Q2**:如何可靠评估长且隐式结构化的推理轨迹?
3. **Q3**:如何将此评估信号用于推理优化?
## 核心方案
### ME² 原则
沿两个正交轴表征推理质量:
- **Macro vs Micro**:全局结构组织 vs 局部步骤属性
- **Effectiveness vs Efficiency**:有效性 vs 效率
### DAG 推理建模
将推理轨迹抽象为有向无环图DAG显式建模推进、分支和合并。DAG 是树和完全图的实用折衷——捕获丰富结构,同时保持与生成顺序一致的拓扑排序。
### Thinking Reward Model (TRM)
- 基于 ME² + DAG pairwise evaluation 构建 TRM-Preference 数据集103K 训练对)
- 用 Bradley-Terry 目标训练轻量 TRMLlama-3.1-8B → scalar head
- 关键TRM 仅训练于 verified-correct reasoning 偏好对,与答案正确性监督解耦
### 优化信号
- Test-timeBest-of-N selection → +19.3%AIME24, Qwen3-8B
- TrainingTRM-guided GRPO with gated reward shaping → +3.9%