20260625:很多新内容
This commit is contained in:
90
papers/gan-thinking-based-non-thinking-2026.md
Normal file
90
papers/gan-thinking-based-non-thinking-2026.md
Normal file
@@ -0,0 +1,90 @@
|
||||
---
|
||||
title: "Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning"
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
type: paper
|
||||
authors:
|
||||
- Siyuan Gan (Nanjing University)
|
||||
- Jiaheng Liu (Nanjing University)
|
||||
- Boyan Wang (Nanjing University)
|
||||
- Tianpei Yang (Nanjing University)
|
||||
- Runqing Miao (Jiutian Research)
|
||||
- Yuyao Zhang (Jiutian Research)
|
||||
- Fanyu Meng (Jiutian Research)
|
||||
- Junlan Feng (Jiutian Research)
|
||||
- Linjian Meng (Shanghai AI Laboratory)
|
||||
- Jing Huo (Nanjing University)
|
||||
- Yang Gao (Nanjing University)
|
||||
source: arXiv
|
||||
source_id: 2601.04805
|
||||
published: 2026-01-08
|
||||
categories:
|
||||
- cs.AI
|
||||
---
|
||||
|
||||
# Thinking-Based Non-Thinking (TNT)
|
||||
|
||||
> Gan et al. (2026) — arXiv:2601.04805
|
||||
|
||||
## 核心问题
|
||||
|
||||
用 RL 训练[[hybrid-reasoning-models|混合推理模型]](自动决定思考/非思考)时,模型会 **Reward Hacking**:在非思考格式中嵌入思考内容,获取不应得的更高奖励。现有方案或计算成本过高(大规模 SFT),或效果有限(统一 token 上限)。
|
||||
|
||||
## TNT 的核心思路
|
||||
|
||||
**以思考定非思考**:利用思考模式响应的 solution 部分长度,为**每个查询动态设定**非思考模式的 token 上限。
|
||||
|
||||
### 为什么这可行
|
||||
|
||||
[[large-reasoning-models|LRM]] 的思考模式训练确保 `</think>` 之后的 solution **不含额外思考**——与真正的非思考模式输出高度一致。因此 thinking solution 长度是 non-thinking 自然长度的可靠估计。
|
||||
|
||||
### 算法
|
||||
|
||||
```
|
||||
对每个查询 x:
|
||||
1. 采样 K 个响应(用省略号提示)
|
||||
2. 从思考模式响应集 M_T^x 计算平均 solution 长度
|
||||
3. L_N^x = ω × avg(h(y)) — 动态上限(ω=2)
|
||||
4. 非思考响应超过 L_N^x → Reward Hacking → -2 惩罚
|
||||
```
|
||||
|
||||
## 奖励函数设计
|
||||
|
||||
| 模式 | 正确 | 错误 |
|
||||
|------|:--:|:--:|
|
||||
| 思考模式 | +1 | 0 |
|
||||
| 非思考 + 无 hacking | **+2** | -1 |
|
||||
| 非思考 + Reward Hacking | **-2** | **-2** |
|
||||
|
||||
核心:**超过 token 上限一律 -2**——无论对错,强力抑制 hacking。
|
||||
|
||||
## 实验亮点
|
||||
|
||||
| 指标 | TNT vs Base |
|
||||
|------|------------|
|
||||
| Token 使用 | **↓ ~50%** |
|
||||
| 准确率 | **↑ 4.1%** |
|
||||
| Reward Hacking 率 | **< 10%** |
|
||||
| 效率权衡 | **最优**(所有方法中) |
|
||||
|
||||
5 个数学基准测试:AIME24, AIME25, Minerva, AMC23, Olympiad。基础模型:DeepSeek-R1-Distill-Qwen-1.5B/7B, DeepScaleR-1.5B。
|
||||
|
||||
## 概念网络
|
||||
|
||||
```
|
||||
overthinking → hybrid-reasoning-models → reward-hacking
|
||||
↓ ↓ ↓
|
||||
large-reasoning-models thinking-mode dynamic-token-limit
|
||||
non-thinking-mode ↓
|
||||
ellipsis-prompt thinking-based-non-thinking (TNT)
|
||||
↓
|
||||
token-level-policy-gradient → GRPO
|
||||
```
|
||||
|
||||
## 兼容性
|
||||
|
||||
TNT 只关注 token 上限设定,与 RL 算法解耦:GRPO, PPO, DAPO, Dr.GRPO, GSPO 均可使用。也可与 CoT Compression、Batch-Level Reward Balancing、Length-Aware Reward 等技术组合。
|
||||
|
||||
## 来源
|
||||
|
||||
[arXiv:2601.04805](https://arxiv.org/abs/2601.04805) | [原始存档](raw/papers/gan-thinking-based-non-thinking-2026.md)
|
||||
Reference in New Issue
Block a user