Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning
arXiv
2601.04805
Siyuan Gan (Nanjing University)
Jiaheng Liu (Nanjing University)
Boyan Wang (Nanjing University)
Tianpei Yang (Nanjing University)
Runqing Miao (Jiutian Research)
Yuyao Zhang (Jiutian Research)
Fanyu Meng (Jiutian Research)
Junlan Feng (Jiutian Research)
Linjian Meng (Shanghai AI Laboratory)
Jing Huo (Nanjing University)
Yang Gao (Nanjing University)
2026-01-08
2026-06-07
cs.AI
Preprint
Thinking-Based Non-Thinking (TNT)
Abstract
Large reasoning models (LRMs) achieve exceptional performance via long Chain-of-Thought (thinking), causing substantial computational overhead — the overthinking problem. RL-trained hybrid reasoning models that dynamically choose thinking/non-thinking modes suffer from reward hacking: the model generates thinking-like responses while being classified as non-thinking, receiving undeserved rewards.
Existing mitigations: (1) SFT with large datasets (high cost), or (2) uniform token limits on non-thinking (ineffective for varied query difficulties). TNT proposes per-query dynamic token limits derived from the thinking mode's solution length — leveraging the fact that LRMs' thinking mode ensures its solution component contains no additional thinking.
Core Contributions
TNT (Thinking-Based Non-Thinking): Dynamic per-query maximum token usage for non-thinking mode, derived from the solution component of thinking mode responses
50% token reduction vs DeepSeek-R1-Distill-Qwen while improving accuracy across 5 math benchmarks
Optimal accuracy-efficiency trade-off among all tested hybrid reasoning methods
<10% reward hacking rate across all datasets
Compatible with any RL algorithm (GRPO, PPO, DAPO, Dr.GRPO, GSPO)