20260625:很多新内容
This commit is contained in:
39
concepts/thinking-reward-model.md
Normal file
39
concepts/thinking-reward-model.md
Normal file
@@ -0,0 +1,39 @@
|
||||
---
|
||||
title: "Thinking Reward Model (TRM)"
|
||||
created: 2026-06-24
|
||||
updated: 2026-06-24
|
||||
type: concept
|
||||
tags: ["reward-model", "reasoning", "preference-optimization"]
|
||||
sources:
|
||||
- "[[me2-trm-reasoning-2026]]"
|
||||
---
|
||||
|
||||
# Thinking Reward Model (TRM)
|
||||
|
||||
TRM 是 Zhang et al. (ICML 2026) 提出的推理轨迹质量评估模型,基于 ME² 原则和 DAG 建模训练。
|
||||
|
||||
## 核心设计
|
||||
|
||||
- **仅评估推理质量**:训练于 verified-correct 推理对,与答案正确性解耦
|
||||
- **Pairwise preference**:Bradley-Terry 目标,不依赖绝对评分
|
||||
- **轻量**:Llama-3.1-8B + scalar value head 替换 LM head
|
||||
- **训练数据**:TRM-Preference 数据集(103K 对)
|
||||
|
||||
## 与 PRM/ORM 的对比
|
||||
|
||||
| 维度 | PRM | ORM | TRM |
|
||||
|------|-----|-----|-----|
|
||||
| 评估粒度 | 步骤级 | 响应级 | 推理轨迹级 |
|
||||
| 监督方式 | 绝对评分 | pairwise | pairwise |
|
||||
| 长程依赖 | 弱 | N/A | 强(DAG结构化) |
|
||||
| 与答案解耦 | 否(通常纠缠) | 是 | 是 |
|
||||
|
||||
## 验证集性能
|
||||
|
||||
TRM: 88.6% vs ReasonFlux-PRM-7B: 62.5% vs Qwen2.5-Math-PRM-7B: 46.3%
|
||||
|
||||
## 参考
|
||||
- [[me2-trm-reasoning-2026]]
|
||||
- [[me2-principle]]
|
||||
- [[dag-reasoning-evaluation]]
|
||||
- [[reward-model]]
|
||||
Reference in New Issue
Block a user