20260625:很多新内容
This commit is contained in:
60
reviews/gan-tnt-review-20260618.md
Normal file
60
reviews/gan-tnt-review-20260618.md
Normal file
@@ -0,0 +1,60 @@
|
||||
---
|
||||
title: "Review: Thinking-Based Non-Thinking (TNT)"
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
type: review
|
||||
source: gan-thinking-based-non-thinking-2026
|
||||
---
|
||||
|
||||
# 📌 基本信息
|
||||
|
||||
- **论文标题**:Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning
|
||||
- **作者**:Siyuan Gan, Jiaheng Liu, Boyan Wang 等(南京大学 + 九天研究院 + 上海 AI Lab)
|
||||
- **领域**:cs.AI
|
||||
- **arXiv ID**:2601.04805
|
||||
- **类型**:方法论文(RL + 混合推理训练优化)
|
||||
- **添加时间**:2026-06-18
|
||||
|
||||
# 🎯 核心概念
|
||||
|
||||
1. **[[hybrid-reasoning-models|混合推理模型]]** — 能根据查询复杂度自动选择思考/非思考模式的模型
|
||||
2. **[[reward-hacking|Reward Hacking]]** — RL 训练中模型在非思考格式嵌入思考内容以获取额外奖励
|
||||
3. **[[overthinking|过度思考]]** — LRM 对简单查询也产生冗长 CoT,浪费计算资源
|
||||
4. **[[thinking-based-non-thinking|TNT]]** — "基于思考的非思考":利用思考模式 solution 长度动态设定非思考模式 token 上限
|
||||
5. **[[dynamic-token-limit|动态 Token 限制]]** — 每个查询独立计算非思考模式最大 token,而非统一上限
|
||||
6. **[[ellipsis-prompt|省略号提示]]** — 无需修改 tokenizer 即可实现非思考模式采样的提示技术
|
||||
7. **[[large-reasoning-models|大推理模型]]** — DeepSeek-R1, OpenAI o1 等以 CoT 为核心的模型
|
||||
8. **[[token-level-policy-gradient|Token 级策略梯度]]** — GRPO 在 token 级的细粒度信用分配
|
||||
|
||||
# 🔗 概念网络
|
||||
|
||||
```
|
||||
overthinking reward-hacking
|
||||
↓ ↓
|
||||
hybrid-reasoning-models ←────── 混合推理的动机
|
||||
↓ ↓
|
||||
large-reasoning-models ──→ thinking-mode + non-thinking-mode
|
||||
↓
|
||||
ellipsis-prompt (实现)
|
||||
↓
|
||||
dynamic-token-limit ← thinking solution length
|
||||
↓
|
||||
thinking-based-non-thinking (TNT)
|
||||
↓
|
||||
token-level-policy-gradient → GRPO
|
||||
```
|
||||
|
||||
概念特点:围绕一个清晰的**优化链**展开——
|
||||
问题(overthinking)→ 方案方向(hybrid reasoning)→ 训练障碍(reward hacking)→ TNT 解决(dynamic token limit from thinking)→ RL 实现(token-level GRPO)
|
||||
|
||||
# 📚 Wiki 集成
|
||||
|
||||
- **新增页面**:11 个(1 论文 + 10 概念)
|
||||
- **复用页面**:4 个(token-efficiency, grpo, reinforcement-learning, chain-of-thought)
|
||||
- **总增量**:+11 页
|
||||
|
||||
# 💡 关键洞察
|
||||
|
||||
1. **优雅的对称性**:TNT 的方法论核心是 "用思考约束非思考"——thinking 模式的 solution 恰好是 non-thinking 的自然上限。这比 Adaptive Think 的统一 token 上限和 Thinkless 的大规模 SFT 都更简洁高效,且不引入额外训练阶段。
|
||||
|
||||
2. **奖励函数设计的精妙**:非思考 + hacking → -2(无论对错)的设计强力抑制 hacking 行为。这个惩罚力度足以覆盖"先思考再伪装"的收益(+2),与 token 级策略梯度结合形成细粒度的行为矫正。
|
||||
Reference in New Issue
Block a user