20260625:很多新内容

This commit is contained in:
2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions

View File

@@ -0,0 +1,40 @@
---
title: "Characterizing, Evaluating, and Optimizing Complex Reasoning (ME² + TRM)"
author: "Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng"
source: "arXiv 2602.08498v2"
date: "2026-02-09 (updated 2026-06-03)"
type: paper
venue: "ICML 2026 (cs.CL)"
tags: ["reasoning", "reward-model", "dag", "grpo", "test-time-scaling", "rl"]
code: "https://github.com/Simplified-Reasoning/TRM"
---
# Characterizing, Evaluating, and Optimizing Complex Reasoning
> Zhang, Li, Wang, Wang, Zhang, Qu, Cheng | SJTU / Shanghai AI Lab / CUHK / NJU / USTC / PKU
> ICML 2026 | arXiv:2602.08498v2 | cs.CL
## 三个核心问题
1. **Q1**:什么定义了高质量推理?
2. **Q2**:如何可靠评估长且隐式结构化的推理轨迹?
3. **Q3**:如何将此评估信号用于推理优化?
## 核心方案
### ME² 原则
沿两个正交轴表征推理质量:
- **Macro vs Micro**:全局结构组织 vs 局部步骤属性
- **Effectiveness vs Efficiency**:有效性 vs 效率
### DAG 推理建模
将推理轨迹抽象为有向无环图DAG显式建模推进、分支和合并。DAG 是树和完全图的实用折衷——捕获丰富结构,同时保持与生成顺序一致的拓扑排序。
### Thinking Reward Model (TRM)
- 基于 ME² + DAG pairwise evaluation 构建 TRM-Preference 数据集103K 训练对)
- 用 Bradley-Terry 目标训练轻量 TRMLlama-3.1-8B → scalar head
- 关键TRM 仅训练于 verified-correct reasoning 偏好对,与答案正确性监督解耦
### 优化信号
- Test-timeBest-of-N selection → +19.3%AIME24, Qwen3-8B
- TrainingTRM-guided GRPO with gated reward shaping → +3.9%