20260625:很多新内容
This commit is contained in:
44
concepts/hybrid-reasoning-models.md
Normal file
44
concepts/hybrid-reasoning-models.md
Normal file
@@ -0,0 +1,44 @@
|
||||
---
|
||||
title: "混合推理模型 (Hybrid Reasoning Models)"
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
type: concept
|
||||
tags: [reasoning, efficiency, rl, thinking]
|
||||
sources:
|
||||
- gan-thinking-based-non-thinking-2026
|
||||
---
|
||||
|
||||
# 混合推理模型 (Hybrid Reasoning Models)
|
||||
|
||||
混合推理模型是能**动态决定是否激活思考模式**的推理模型,根据查询复杂度在[[thinking-mode|思考模式]]和[[non-thinking-mode|非思考模式]]之间自动切换(Zhang et al., 2025; Fang et al., 2025; Tu et al., 2025)。
|
||||
|
||||
## 动机:解决 Overthinking
|
||||
|
||||
[[large-reasoning-models|大推理模型]]的卓越性能依赖长思维链([[chain-of-thought|CoT]]),但这导致**过度思考**([[overthinking|Overthinking]])——对简单问题产生冗长、重复的输出,大幅增加推理开销和延迟。
|
||||
|
||||
## 训练方法
|
||||
|
||||
### 强化学习(主流)
|
||||
- 为正确回答的非思考模式分配**更高奖励**
|
||||
- 激励模型在简单问题上跳过思考
|
||||
- 代表:Thinkless, AdaptThink, AutoThink, TNT
|
||||
|
||||
### 监督微调
|
||||
- 使用比 RL 数据集**大得多**的 SFT 数据集固定输出格式
|
||||
- Thinkless 等使用,但计算成本高
|
||||
|
||||
## 关键挑战
|
||||
|
||||
RL 训练的混合推理模型面临 **[[reward-hacking|Reward Hacking]]**——模型在非思考模式下嵌入思考内容以获取额外奖励。
|
||||
|
||||
## 模式判别方式
|
||||
|
||||
1. **基于首 token**:首 token 是否为 `</think>`(Zhang et al., Tu et al., TNT)
|
||||
2. **基于特殊 token**:首 token 是否为 `<short>`(Fang et al., Jiang et al.)
|
||||
|
||||
## 参考
|
||||
|
||||
- [[overthinking|过度思考]]
|
||||
- [[reward-hacking|Reward Hacking]]
|
||||
- [[thinking-mode|思考模式]] / [[non-thinking-mode|非思考模式]]
|
||||
- [[gan-thinking-based-non-thinking-2026|TNT 论文]]
|
||||
Reference in New Issue
Block a user