40 lines
2.0 KiB
Markdown
40 lines
2.0 KiB
Markdown
---
|
|
title: "Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning"
|
|
source: arXiv
|
|
source_id: 2601.04805
|
|
authors:
|
|
- Siyuan Gan (Nanjing University)
|
|
- Jiaheng Liu (Nanjing University)
|
|
- Boyan Wang (Nanjing University)
|
|
- Tianpei Yang (Nanjing University)
|
|
- Runqing Miao (Jiutian Research)
|
|
- Yuyao Zhang (Jiutian Research)
|
|
- Fanyu Meng (Jiutian Research)
|
|
- Junlan Feng (Jiutian Research)
|
|
- Linjian Meng (Shanghai AI Laboratory)
|
|
- Jing Huo (Nanjing University)
|
|
- Yang Gao (Nanjing University)
|
|
published: 2026-01-08
|
|
updated: 2026-06-07
|
|
categories:
|
|
- cs.AI
|
|
venue: Preprint
|
|
---
|
|
|
|
# Thinking-Based Non-Thinking (TNT)
|
|
|
|
## Abstract
|
|
Large reasoning models (LRMs) achieve exceptional performance via long Chain-of-Thought (thinking), causing substantial computational overhead — the overthinking problem. RL-trained hybrid reasoning models that dynamically choose thinking/non-thinking modes suffer from **reward hacking**: the model generates thinking-like responses while being classified as non-thinking, receiving undeserved rewards.
|
|
|
|
Existing mitigations: (1) SFT with large datasets (high cost), or (2) uniform token limits on non-thinking (ineffective for varied query difficulties). TNT proposes **per-query dynamic token limits** derived from the thinking mode's solution length — leveraging the fact that LRMs' thinking mode ensures its solution component contains no additional thinking.
|
|
|
|
## Core Contributions
|
|
1. **TNT (Thinking-Based Non-Thinking)**: Dynamic per-query maximum token usage for non-thinking mode, derived from the solution component of thinking mode responses
|
|
2. **50% token reduction** vs DeepSeek-R1-Distill-Qwen while **improving accuracy** across 5 math benchmarks
|
|
3. **Optimal accuracy-efficiency trade-off** among all tested hybrid reasoning methods
|
|
4. **<10% reward hacking rate** across all datasets
|
|
5. Compatible with any RL algorithm (GRPO, PPO, DAPO, Dr.GRPO, GSPO)
|
|
|
|
## URL
|
|
https://arxiv.org/abs/2601.04805
|