20260625:很多新内容

2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions
--- a/raw/papers/gan-thinking-based-non-thinking-2026.md
+++ b/raw/papers/gan-thinking-based-non-thinking-2026.md
@@ -0,0 +1,39 @@
+---
+title: "Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning"
+source: arXiv
+source_id: 2601.04805
+authors:
+  - Siyuan Gan (Nanjing University)
+  - Jiaheng Liu (Nanjing University)
+  - Boyan Wang (Nanjing University)
+  - Tianpei Yang (Nanjing University)
+  - Runqing Miao (Jiutian Research)
+  - Yuyao Zhang (Jiutian Research)
+  - Fanyu Meng (Jiutian Research)
+  - Junlan Feng (Jiutian Research)
+  - Linjian Meng (Shanghai AI Laboratory)
+  - Jing Huo (Nanjing University)
+  - Yang Gao (Nanjing University)
+published: 2026-01-08
+updated: 2026-06-07
+categories:
+  - cs.AI
+venue: Preprint
+---
+
+# Thinking-Based Non-Thinking (TNT)
+
+## Abstract
+Large reasoning models (LRMs) achieve exceptional performance via long Chain-of-Thought (thinking), causing substantial computational overhead — the overthinking problem. RL-trained hybrid reasoning models that dynamically choose thinking/non-thinking modes suffer from **reward hacking**: the model generates thinking-like responses while being classified as non-thinking, receiving undeserved rewards.
+
+Existing mitigations: (1) SFT with large datasets (high cost), or (2) uniform token limits on non-thinking (ineffective for varied query difficulties). TNT proposes **per-query dynamic token limits** derived from the thinking mode's solution length — leveraging the fact that LRMs' thinking mode ensures its solution component contains no additional thinking.
+
+## Core Contributions
+1. **TNT (Thinking-Based Non-Thinking)**: Dynamic per-query maximum token usage for non-thinking mode, derived from the solution component of thinking mode responses
+2. **50% token reduction** vs DeepSeek-R1-Distill-Qwen while **improving accuracy** across 5 math benchmarks
+3. **Optimal accuracy-efficiency trade-off** among all tested hybrid reasoning methods
+4. **<10% reward hacking rate** across all datasets
+5. Compatible with any RL algorithm (GRPO, PPO, DAPO, Dr.GRPO, GSPO)
+
+## URL
+https://arxiv.org/abs/2601.04805