myWiki/raw/papers/gan-thinking-based-non-thinking-2026.md

---
title: "Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning"
source: arXiv
source_id: 2601.04805
authors:
  - Siyuan Gan (Nanjing University)
  - Jiaheng Liu (Nanjing University)
  - Boyan Wang (Nanjing University)
  - Tianpei Yang (Nanjing University)
  - Runqing Miao (Jiutian Research)
  - Yuyao Zhang (Jiutian Research)
  - Fanyu Meng (Jiutian Research)
  - Junlan Feng (Jiutian Research)
  - Linjian Meng (Shanghai AI Laboratory)
  - Jing Huo (Nanjing University)
  - Yang Gao (Nanjing University)
published: 2026-01-08
updated: 2026-06-07
categories:
  - cs.AI
venue: Preprint
---

# Thinking-Based Non-Thinking (TNT)

## Abstract
Large reasoning models (LRMs) achieve exceptional performance via long Chain-of-Thought (thinking), causing substantial computational overhead — the overthinking problem. RL-trained hybrid reasoning models that dynamically choose thinking/non-thinking modes suffer from **reward hacking**: the model generates thinking-like responses while being classified as non-thinking, receiving undeserved rewards.

Existing mitigations: (1) SFT with large datasets (high cost), or (2) uniform token limits on non-thinking (ineffective for varied query difficulties). TNT proposes **per-query dynamic token limits** derived from the thinking mode's solution length — leveraging the fact that LRMs' thinking mode ensures its solution component contains no additional thinking.

## Core Contributions
1. **TNT (Thinking-Based Non-Thinking)**: Dynamic per-query maximum token usage for non-thinking mode, derived from the solution component of thinking mode responses
2. **50% token reduction** vs DeepSeek-R1-Distill-Qwen while **improving accuracy** across 5 math benchmarks
3. **Optimal accuracy-efficiency trade-off** among all tested hybrid reasoning methods
4. **<10% reward hacking rate** across all datasets
5. Compatible with any RL algorithm (GRPO, PPO, DAPO, Dr.GRPO, GSPO)

## URL
https://arxiv.org/abs/2601.04805