Files
myWiki/raw/papers/me2-trm-reasoning-2026.md

1.6 KiB
Raw Blame History

title, author, source, date, type, venue, tags, code
title author source date type venue tags code
Characterizing, Evaluating, and Optimizing Complex Reasoning (ME² + TRM) Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng arXiv 2602.08498v2 2026-02-09 (updated 2026-06-03) paper ICML 2026 (cs.CL)
reasoning
reward-model
dag
grpo
test-time-scaling
rl
https://github.com/Simplified-Reasoning/TRM

Characterizing, Evaluating, and Optimizing Complex Reasoning

Zhang, Li, Wang, Wang, Zhang, Qu, Cheng | SJTU / Shanghai AI Lab / CUHK / NJU / USTC / PKU ICML 2026 | arXiv:2602.08498v2 | cs.CL

三个核心问题

  1. Q1:什么定义了高质量推理?
  2. Q2:如何可靠评估长且隐式结构化的推理轨迹?
  3. Q3:如何将此评估信号用于推理优化?

核心方案

ME² 原则

沿两个正交轴表征推理质量:

  • Macro vs Micro:全局结构组织 vs 局部步骤属性
  • Effectiveness vs Efficiency:有效性 vs 效率

DAG 推理建模

将推理轨迹抽象为有向无环图DAG显式建模推进、分支和合并。DAG 是树和完全图的实用折衷——捕获丰富结构,同时保持与生成顺序一致的拓扑排序。

Thinking Reward Model (TRM)

  • 基于 ME² + DAG pairwise evaluation 构建 TRM-Preference 数据集103K 训练对)
  • 用 Bradley-Terry 目标训练轻量 TRMLlama-3.1-8B → scalar head
  • 关键TRM 仅训练于 verified-correct reasoning 偏好对,与答案正确性监督解耦

优化信号

  • Test-timeBest-of-N selection → +19.3%AIME24, Qwen3-8B
  • TrainingTRM-guided GRPO with gated reward shaping → +3.9%