43 lines
1.3 KiB
Markdown
43 lines
1.3 KiB
Markdown
---
|
||
title: "Reasoning Quality Optimization"
|
||
created: 2026-06-24
|
||
updated: 2026-06-24
|
||
type: concept
|
||
tags: ["reasoning", "optimization", "rl", "test-time-scaling"]
|
||
sources:
|
||
- "[[me2-trm-reasoning-2026]]"
|
||
---
|
||
|
||
# Reasoning Quality Optimization
|
||
|
||
将推理轨迹质量作为优化信号的方法论,由 Zhang et al. (ICML 2026) 在 ME² + TRM 框架中系统验证。
|
||
|
||
## 两种优化模式
|
||
|
||
### Test-Time Scaling (Best-of-N)
|
||
- TRM 为 N 条候选推理评分
|
||
- 选择与 ME² 原则最对齐的推理
|
||
- AIME24: Qwen3-8B 从 44.7% (N=1) → 64.0% (N=16),+19.3%
|
||
- 即使 TRM 未见答案正确性监督,更好的推理 → 更好的结果
|
||
|
||
### RL Training (GRPO + Thinking Rewards)
|
||
Gated reward shaping:
|
||
|
||
$$r = r_v \cdot (1 - \alpha + \alpha \cdot \text{Sigmoid}(r_t))$$
|
||
|
||
- r_v:verifiable reward(答案正确性,0或1)
|
||
- r_t:thinking reward(推理质量,TRM 输出)
|
||
- α:平衡权重
|
||
|
||
效果:+3.9% across diverse tasks
|
||
|
||
## 核心洞察
|
||
|
||
TRM 的训练数据仅包含 verified-correct 推理对——意味着 thinking reward 选择的是"正确的推理中更好的那个",而非"正确 vs 错误"。这在 GRPO 中自然地塑造了推理路径偏好,而无需额外答案信号。
|
||
|
||
## 参考
|
||
- [[me2-trm-reasoning-2026]]
|
||
- [[thinking-reward-model]]
|
||
- [[grpo]]
|
||
- [[me2-principle]]
|