Thinking-Based Non-Thinking (TNT)

Abstract

Large reasoning models (LRMs) achieve exceptional performance via long Chain-of-Thought (thinking), causing substantial computational overhead — the overthinking problem. RL-trained hybrid reasoning models that dynamically choose thinking/non-thinking modes suffer from reward hacking: the model generates thinking-like responses while being classified as non-thinking, receiving undeserved rewards.

Existing mitigations: (1) SFT with large datasets (high cost), or (2) uniform token limits on non-thinking (ineffective for varied query difficulties). TNT proposes per-query dynamic token limits derived from the thinking mode's solution length — leveraging the fact that LRMs' thinking mode ensures its solution component contains no additional thinking.

Core Contributions

TNT (Thinking-Based Non-Thinking): Dynamic per-query maximum token usage for non-thinking mode, derived from the solution component of thinking mode responses
50% token reduction vs DeepSeek-R1-Distill-Qwen while improving accuracy across 5 math benchmarks
Optimal accuracy-efficiency trade-off among all tested hybrid reasoning methods
<10% reward hacking rate across all datasets
Compatible with any RL algorithm (GRPO, PPO, DAPO, Dr.GRPO, GSPO)

URL

https://arxiv.org/abs/2601.04805

2.0 KiB Raw Blame History

Thinking-Based Non-Thinking (TNT)

Abstract

Core Contributions

URL

2.0 KiB

Raw Blame History