20260625:很多新内容

This commit is contained in:
2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions

View File

@@ -0,0 +1,39 @@
---
title: "Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning"
source: arXiv
source_id: 2601.04805
authors:
- Siyuan Gan (Nanjing University)
- Jiaheng Liu (Nanjing University)
- Boyan Wang (Nanjing University)
- Tianpei Yang (Nanjing University)
- Runqing Miao (Jiutian Research)
- Yuyao Zhang (Jiutian Research)
- Fanyu Meng (Jiutian Research)
- Junlan Feng (Jiutian Research)
- Linjian Meng (Shanghai AI Laboratory)
- Jing Huo (Nanjing University)
- Yang Gao (Nanjing University)
published: 2026-01-08
updated: 2026-06-07
categories:
- cs.AI
venue: Preprint
---
# Thinking-Based Non-Thinking (TNT)
## Abstract
Large reasoning models (LRMs) achieve exceptional performance via long Chain-of-Thought (thinking), causing substantial computational overhead — the overthinking problem. RL-trained hybrid reasoning models that dynamically choose thinking/non-thinking modes suffer from **reward hacking**: the model generates thinking-like responses while being classified as non-thinking, receiving undeserved rewards.
Existing mitigations: (1) SFT with large datasets (high cost), or (2) uniform token limits on non-thinking (ineffective for varied query difficulties). TNT proposes **per-query dynamic token limits** derived from the thinking mode's solution length — leveraging the fact that LRMs' thinking mode ensures its solution component contains no additional thinking.
## Core Contributions
1. **TNT (Thinking-Based Non-Thinking)**: Dynamic per-query maximum token usage for non-thinking mode, derived from the solution component of thinking mode responses
2. **50% token reduction** vs DeepSeek-R1-Distill-Qwen while **improving accuracy** across 5 math benchmarks
3. **Optimal accuracy-efficiency trade-off** among all tested hybrid reasoning methods
4. **<10% reward hacking rate** across all datasets
5. Compatible with any RL algorithm (GRPO, PPO, DAPO, Dr.GRPO, GSPO)
## URL
https://arxiv.org/abs/2601.04805