Files
myWiki/concepts/reasoning-quality-optimization.md

43 lines
1.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Reasoning Quality Optimization"
created: 2026-06-24
updated: 2026-06-24
type: concept
tags: ["reasoning", "optimization", "rl", "test-time-scaling"]
sources:
- "[[me2-trm-reasoning-2026]]"
---
# Reasoning Quality Optimization
将推理轨迹质量作为优化信号的方法论,由 Zhang et al. (ICML 2026) 在 ME² + TRM 框架中系统验证。
## 两种优化模式
### Test-Time Scaling (Best-of-N)
- TRM 为 N 条候选推理评分
- 选择与 ME² 原则最对齐的推理
- AIME24: Qwen3-8B 从 44.7% (N=1) → 64.0% (N=16)+19.3%
- 即使 TRM 未见答案正确性监督,更好的推理 → 更好的结果
### RL Training (GRPO + Thinking Rewards)
Gated reward shaping
$$r = r_v \cdot (1 - \alpha + \alpha \cdot \text{Sigmoid}(r_t))$$
- r_vverifiable reward答案正确性0或1
- r_tthinking reward推理质量TRM 输出)
- α:平衡权重
效果:+3.9% across diverse tasks
## 核心洞察
TRM 的训练数据仅包含 verified-correct 推理对——意味着 thinking reward 选择的是"正确的推理中更好的那个",而非"正确 vs 错误"。这在 GRPO 中自然地塑造了推理路径偏好,而无需额外答案信号。
## 参考
- [[me2-trm-reasoning-2026]]
- [[thinking-reward-model]]
- [[grpo]]
- [[me2-principle]]