20260625:很多新内容
This commit is contained in:
43
concepts/dynamic-token-limit.md
Normal file
43
concepts/dynamic-token-limit.md
Normal file
@@ -0,0 +1,43 @@
|
||||
---
|
||||
title: "动态 Token 限制 (Dynamic Token Limit)"
|
||||
created: 2026-06-18
|
||||
updated: 2026-06-18
|
||||
type: concept
|
||||
tags: [token-efficiency, hybrid-reasoning, reward-hacking]
|
||||
sources:
|
||||
- gan-thinking-based-non-thinking-2026
|
||||
---
|
||||
|
||||
# 动态 Token 限制 (Dynamic Token Limit)
|
||||
|
||||
动态 Token 限制是 TNT 的核心技术:为**每个查询**单独设定非思考模式响应的最大 token 使用量,而非所有查询使用统一上限(Gan et al., 2026)。
|
||||
|
||||
## 为什么需要动态限制
|
||||
|
||||
### 统一上限的失败(AdaptThink 方案)
|
||||
Zhang et al. (2025) 的 AdaptThink 为所有查询设定同一个较小的 max token:
|
||||
- 简单查询的思考模式 solution 可能**少于** 100 tokens
|
||||
- 复杂查询的自然非思考回答可能需要 **300+ tokens**
|
||||
- 统一上限要么**漏检**简单查询的 reward hacking,要么**误伤**复杂查询的合法非思考响应
|
||||
|
||||
### TNT 的动态方案
|
||||
```
|
||||
L_N^x = ω × mean(solution_length of thinking_mode_responses for x)
|
||||
```
|
||||
- 简单查询 → L_N^x 小 → 严格检测 reward hacking
|
||||
- 复杂查询 → L_N^x 大 → 给予合法非思考响应足够空间
|
||||
- ω = 2 提供 2 倍容错边界,防止轻微偏差被误判
|
||||
|
||||
## 实现细节
|
||||
|
||||
- 每次训练步对每个 prompt x 采样 K 个响应
|
||||
- 从思考模式响应集合 M_T^x 计算平均 solution 长度
|
||||
- 若 M_T^x 为空(on-policy 采样未产生思考响应),回退到 L_∅ = 1000
|
||||
- 使用 token 级策略梯度(GRPO)进行训练
|
||||
|
||||
## 参考
|
||||
|
||||
- [[thinking-based-non-thinking|TNT]]
|
||||
- [[reward-hacking|Reward Hacking]]
|
||||
- [[token-efficiency|Token 效率]]
|
||||
- [[gan-thinking-based-non-thinking-2026|TNT 论文]]
|
||||
Reference in New Issue
Block a user