Files
myWiki/papers/gan-thinking-based-non-thinking-2026.md

91 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning"
created: 2026-06-18
updated: 2026-06-18
type: paper
authors:
- Siyuan Gan (Nanjing University)
- Jiaheng Liu (Nanjing University)
- Boyan Wang (Nanjing University)
- Tianpei Yang (Nanjing University)
- Runqing Miao (Jiutian Research)
- Yuyao Zhang (Jiutian Research)
- Fanyu Meng (Jiutian Research)
- Junlan Feng (Jiutian Research)
- Linjian Meng (Shanghai AI Laboratory)
- Jing Huo (Nanjing University)
- Yang Gao (Nanjing University)
source: arXiv
source_id: 2601.04805
published: 2026-01-08
categories:
- cs.AI
---
# Thinking-Based Non-Thinking (TNT)
> Gan et al. (2026) — arXiv:2601.04805
## 核心问题
用 RL 训练[[hybrid-reasoning-models|混合推理模型]](自动决定思考/非思考)时,模型会 **Reward Hacking**:在非思考格式中嵌入思考内容,获取不应得的更高奖励。现有方案或计算成本过高(大规模 SFT或效果有限统一 token 上限)。
## TNT 的核心思路
**以思考定非思考**:利用思考模式响应的 solution 部分长度,为**每个查询动态设定**非思考模式的 token 上限。
### 为什么这可行
[[large-reasoning-models|LRM]] 的思考模式训练确保 `</think>` 之后的 solution **不含额外思考**——与真正的非思考模式输出高度一致。因此 thinking solution 长度是 non-thinking 自然长度的可靠估计。
### 算法
```
对每个查询 x
1. 采样 K 个响应(用省略号提示)
2. 从思考模式响应集 M_T^x 计算平均 solution 长度
3. L_N^x = ω × avg(h(y)) — 动态上限(ω=2
4. 非思考响应超过 L_N^x → Reward Hacking → -2 惩罚
```
## 奖励函数设计
| 模式 | 正确 | 错误 |
|------|:--:|:--:|
| 思考模式 | +1 | 0 |
| 非思考 + 无 hacking | **+2** | -1 |
| 非思考 + Reward Hacking | **-2** | **-2** |
核心:**超过 token 上限一律 -2**——无论对错,强力抑制 hacking。
## 实验亮点
| 指标 | TNT vs Base |
|------|------------|
| Token 使用 | **↓ ~50%** |
| 准确率 | **↑ 4.1%** |
| Reward Hacking 率 | **< 10%** |
| 效率权衡 | **最优**所有方法中 |
5 个数学基准测试AIME24, AIME25, Minerva, AMC23, Olympiad基础模型DeepSeek-R1-Distill-Qwen-1.5B/7B, DeepScaleR-1.5B
## 概念网络
```
overthinking → hybrid-reasoning-models → reward-hacking
↓ ↓ ↓
large-reasoning-models thinking-mode dynamic-token-limit
non-thinking-mode ↓
ellipsis-prompt thinking-based-non-thinking (TNT)
token-level-policy-gradient → GRPO
```
## 兼容性
TNT 只关注 token 上限设定 RL 算法解耦GRPO, PPO, DAPO, Dr.GRPO, GSPO 均可使用也可与 CoT CompressionBatch-Level Reward BalancingLength-Aware Reward 等技术组合
## 来源
[arXiv:2601.04805](https://arxiv.org/abs/2601.04805) | [原始存档](raw/papers/gan-thinking-based-non-thinking-2026.md)