20260514:增加新内容
This commit is contained in:
55
concepts/token-efficiency.md
Normal file
55
concepts/token-efficiency.md
Normal file
@@ -0,0 +1,55 @@
|
||||
---
|
||||
title: "Token 效率 (Token Efficiency)"
|
||||
domain: "Multimodal AI / Efficiency"
|
||||
tags: [token-efficiency, visual-token, compression]
|
||||
sources: [[thinking-with-visual-primitives]]
|
||||
---
|
||||
|
||||
# Token 效率 (Token Efficiency)
|
||||
|
||||
> 以更少的视觉 token 实现相当或更强的推理能力——「Thinking with Visual Primitives」的核心架构优势。
|
||||
|
||||
## 动机
|
||||
|
||||
前沿多模态模型普遍依赖大量视觉 token 来弥补视觉缺陷:
|
||||
- GPT-5.4: ~740 tokens/image
|
||||
- Claude-Sonnet-4.6: ~870 tokens/image
|
||||
- Gemini-3-Flash: ~1,100 tokens/image
|
||||
|
||||
高 token 预算意味着:
|
||||
- 更长的推理延迟
|
||||
- 更大的 KV cache 内存占用
|
||||
- 更高的 API 成本
|
||||
|
||||
## DeepSeek 的方案
|
||||
|
||||
```
|
||||
756×756 图像
|
||||
→ Patch Embedding (14×14): 2,916 tokens
|
||||
→ 3×3 空间压缩: 324 visual tokens
|
||||
→ CSA 压缩: 81 KV entries (~90 in KV cache)
|
||||
```
|
||||
|
||||
**总压缩比:7056×**
|
||||
|
||||
## 性能对比
|
||||
|
||||
| 模型 | KV Entries ≈ | CountQA EM | SpatialMQA |
|
||||
|------|-------------|------------|------------|
|
||||
| **Ours** | **~90** | **66.1** | **69.4** |
|
||||
| GPT-5.4 | ~740 | 48.3 | 61.9 |
|
||||
| Gemini-3-Flash | ~1,100 | 34.8 | 58.2 |
|
||||
|
||||
> 以 1/8 到 1/12 的 token 预算,实现更优或相当的性能。
|
||||
|
||||
## 关键使能技术
|
||||
|
||||
- [[compressed-sparse-attention|压缩稀疏注意力]] — KV cache 层面的压缩
|
||||
- [[deepseek-vit|DeepSeek-ViT]] — 3×3 空间 token 压缩
|
||||
- [[visual-primitives|视觉原语]] — 每个 token 信息密度更高
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[compressed-sparse-attention|压缩稀疏注意力]] — 核心压缩机制
|
||||
- [[deepseek-vit|DeepSeek-ViT]] — 视觉编码器
|
||||
- [[visual-primitives|视觉原语]] — 信息密度提升
|
||||
Reference in New Issue
Block a user