Files
myWiki/papers/me2-trm-reasoning-2026.md

78 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "ME² + TRM: Complex Reasoning Optimization (Zhang et al., ICML 2026)"
created: 2026-06-24
updated: 2026-06-24
type: paper
tags: ["reasoning", "reward-model", "dag", "grpo", "test-time-scaling"]
sources:
- "https://arxiv.org/abs/2602.08498"
code: "https://github.com/Simplified-Reasoning/TRM"
---
# ME² + TRM: 复杂推理的表征、评估与优化
> Zhang et al. | ICML 2026 | arXiv:2602.08498v2 | cs.CL
## 动机
[[large-reasoning-models|LRMs]] 的推理轨迹越来越长且结构复杂,但缺乏统一的答案回答三个问题:(1) 什么是高质量推理?(2) 如何可靠评估?(3) 如何用评估信号优化推理?
现有方法的局限PRMs 依赖步骤级绝对评分无法捕获长程依赖和非线性结构ORMs 设计用于对齐最终响应helpful/honest/harmless而非评估结构化推理质量。
## 核心框架
### [[me2-principle|ME² 原则]]
两个正交维度:
| | Macro全局 | Micro局部 |
|---|---|---|
| **Effectiveness** | 结构组织是否合理、无冗余分支 | 步骤是否正确、有逻辑 |
| **Efficiency** | 推理路径是否简洁、无绕路 | 步骤是否精简、无赘述 |
推理质量 = Macro-Effectiveness × Macro-Efficiency × Micro-Effectiveness × Micro-Efficiency
### [[dag-reasoning-evaluation|DAG 推理建模]]
将推理轨迹抽象为 DAG
- 节点:推理步骤
- 边:逻辑依赖关系
- DAG vs TreeTree 无法表达合并多前驱节点DAG 是表达力与可处理性的实用平衡
### [[thinking-reward-model|Thinking Reward Model (TRM)]]
训练流程:
1. 生成多条候选推理轨迹 → 构建 DAG → ME² pairwise preference 标注DeepSeek-V3.2
2. 构建 [[trm-preference-dataset|TRM-Preference]]103K 训练对1.5K 验证)
3. 训练 TRMLlama-3.1-8B + scalar headBradley-Terry loss
**核心设计**TRM 仅训练于 verified-correct 推理对——与答案正确性解耦,纯评估推理质量。
### [[reasoning-quality-optimization|推理质量优化]]
**Test-Time Scaling**TRM Best-of-N selection → +19.3%AIME24, N=16, Qwen3-8B: 44.7%→64.0%
**RL Training**TRM-guided GRPO with gated reward shaping
$$r = r_v \cdot (1 - \alpha + \alpha \cdot \text{Sigmoid}(r_t))$$
r_v = outcome reward, r_t = thinking reward, α = balance weight
→ +3.9% across diverse tasks
## 关键结果
| 方法 | 验证集准确率 |
|------|------------|
| Qwen2.5-Math-PRM-7B | 46.3% |
| ReasonFlux-PRM-7B | 62.5% |
| PromptOnly (DeepSeek-V3.2) | 78.6% |
| **TRM (ours)** | **88.6%** |
## 核心洞察
1. **将推理质量与答案正确性解耦** — TRM 仅训练于正确推理的偏好对,证明推理质量可独立于答案正确性评估
2. **DAG 比 Tree 更适合推理建模** — 推理中的合并多步归结为一个结论是常见模式Tree 无法表达
3. **Structural signals matter** — 直接 prompt-based 比较产生大量 ties (232/1497),但去除 ties 后准确率 93%。DAG 结构化后 ties 归零,证明结构信号是关键区分器
## 来源
[原始存档](raw/papers/me2-trm-reasoning-2026.md) | [arXiv](https://arxiv.org/abs/2602.08498) | [GitHub](https://github.com/Simplified-Reasoning/TRM)