20260625:很多新内容
This commit is contained in:
39
raw/papers/gan-thinking-based-non-thinking-2026.md
Normal file
39
raw/papers/gan-thinking-based-non-thinking-2026.md
Normal file
@@ -0,0 +1,39 @@
|
||||
---
|
||||
title: "Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning"
|
||||
source: arXiv
|
||||
source_id: 2601.04805
|
||||
authors:
|
||||
- Siyuan Gan (Nanjing University)
|
||||
- Jiaheng Liu (Nanjing University)
|
||||
- Boyan Wang (Nanjing University)
|
||||
- Tianpei Yang (Nanjing University)
|
||||
- Runqing Miao (Jiutian Research)
|
||||
- Yuyao Zhang (Jiutian Research)
|
||||
- Fanyu Meng (Jiutian Research)
|
||||
- Junlan Feng (Jiutian Research)
|
||||
- Linjian Meng (Shanghai AI Laboratory)
|
||||
- Jing Huo (Nanjing University)
|
||||
- Yang Gao (Nanjing University)
|
||||
published: 2026-01-08
|
||||
updated: 2026-06-07
|
||||
categories:
|
||||
- cs.AI
|
||||
venue: Preprint
|
||||
---
|
||||
|
||||
# Thinking-Based Non-Thinking (TNT)
|
||||
|
||||
## Abstract
|
||||
Large reasoning models (LRMs) achieve exceptional performance via long Chain-of-Thought (thinking), causing substantial computational overhead — the overthinking problem. RL-trained hybrid reasoning models that dynamically choose thinking/non-thinking modes suffer from **reward hacking**: the model generates thinking-like responses while being classified as non-thinking, receiving undeserved rewards.
|
||||
|
||||
Existing mitigations: (1) SFT with large datasets (high cost), or (2) uniform token limits on non-thinking (ineffective for varied query difficulties). TNT proposes **per-query dynamic token limits** derived from the thinking mode's solution length — leveraging the fact that LRMs' thinking mode ensures its solution component contains no additional thinking.
|
||||
|
||||
## Core Contributions
|
||||
1. **TNT (Thinking-Based Non-Thinking)**: Dynamic per-query maximum token usage for non-thinking mode, derived from the solution component of thinking mode responses
|
||||
2. **50% token reduction** vs DeepSeek-R1-Distill-Qwen while **improving accuracy** across 5 math benchmarks
|
||||
3. **Optimal accuracy-efficiency trade-off** among all tested hybrid reasoning methods
|
||||
4. **<10% reward hacking rate** across all datasets
|
||||
5. Compatible with any RL algorithm (GRPO, PPO, DAPO, Dr.GRPO, GSPO)
|
||||
|
||||
## URL
|
||||
https://arxiv.org/abs/2601.04805
|
||||
Reference in New Issue
Block a user