Files
myWiki/concepts/thinking-reward-model.md

40 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Thinking Reward Model (TRM)"
created: 2026-06-24
updated: 2026-06-24
type: concept
tags: ["reward-model", "reasoning", "preference-optimization"]
sources:
- "[[me2-trm-reasoning-2026]]"
---
# Thinking Reward Model (TRM)
TRM 是 Zhang et al. (ICML 2026) 提出的推理轨迹质量评估模型,基于 ME² 原则和 DAG 建模训练。
## 核心设计
- **仅评估推理质量**:训练于 verified-correct 推理对,与答案正确性解耦
- **Pairwise preference**Bradley-Terry 目标,不依赖绝对评分
- **轻量**Llama-3.1-8B + scalar value head 替换 LM head
- **训练数据**TRM-Preference 数据集103K 对)
## 与 PRM/ORM 的对比
| 维度 | PRM | ORM | TRM |
|------|-----|-----|-----|
| 评估粒度 | 步骤级 | 响应级 | 推理轨迹级 |
| 监督方式 | 绝对评分 | pairwise | pairwise |
| 长程依赖 | 弱 | N/A | 强DAG结构化 |
| 与答案解耦 | 否(通常纠缠) | 是 | 是 |
## 验证集性能
TRM: 88.6% vs ReasonFlux-PRM-7B: 62.5% vs Qwen2.5-Math-PRM-7B: 46.3%
## 参考
- [[me2-trm-reasoning-2026]]
- [[me2-principle]]
- [[dag-reasoning-evaluation]]
- [[reward-model]]