68 lines
4.5 KiB
Markdown
68 lines
4.5 KiB
Markdown
---
|
||
title: "Thinking with Visual Primitives"
|
||
authors: "Ruijie Lu, Yiyang Ma, Xiaokang Chen (Project Lead), Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, Yutong Lin, Hao Li, Wen Liu, Zhewen Hao, Xi Gao, Shaoheng Nie, Yixuan Wei, Zhenda Xie, Ting Chen, Gang Zeng"
|
||
affiliations: "DeepSeek-AI, Peking University, Tsinghua University"
|
||
year: 2026
|
||
source: "https://github.com/deepseek-ai/Thinking-with-Visual-Primitives"
|
||
domain: "Multimodal AI / Visual Reasoning"
|
||
tags: [visual-primitives, multimodal, chain-of-thought, spatial-reasoning, token-efficiency, deepseek]
|
||
---
|
||
|
||
# Thinking with Visual Primitives
|
||
|
||
**Authors:** Ruijie Lu¹²\*, Yiyang Ma¹\*, Xiaokang Chen¹\*‡, Lingxiao Luo¹³\*, Zhiyu Wu¹\*, Zizheng Pan¹\*, Xingchao Liu¹\*, Yutong Lin¹, Hao Li¹, Wen Liu¹, Zhewen Hao¹, Xi Gao¹, Shaoheng Nie¹, Yixuan Wei¹, Zhenda Xie¹, Ting Chen³, Gang Zeng²
|
||
- ¹ DeepSeek-AI, ² Peking University, ³ Tsinghua University
|
||
- \* Core contributors, ‡ Project lead
|
||
|
||
## Abstract
|
||
|
||
Despite the remarkable progress in Multimodal Large Language Models (MLLMs), the prevailing Chain-of-Thought (CoT) paradigms remain predominantly confined to the linguistic space. While recent advancements have focused on bridging the [[perception-gap]] through high-resolution cropping, they overlook a more fundamental bottleneck: the **[[reference-gap]]**. The inherent ambiguity of natural language often fails to provide precise, unambiguous pointers to complex spatial layouts, leading to logical collapse in tasks requiring rigorous grounding.
|
||
|
||
In this work, the authors introduce **Thinking with Visual Primitives**, a novel reasoning framework that elevates spatial markers—such as points and bounding boxes—to "minimal units of thought". By interleaving these [[visual-primitives]] directly into the thinking process, the model can "point" while it "reasons", effectively grounding its cognitive trajectory in the physical coordinates of the image.
|
||
|
||
The framework is built on a highly optimized architecture with extreme visual token efficiency. Despite its compact model scale and significantly lower image-token budget, the model achieves frontier-competitive performance on challenging visual QA tasks, matching or exceeding models such as GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash.
|
||
|
||
## Key Concepts
|
||
|
||
- [[visual-primitives]] — Bounding boxes and points as minimal cognitive units
|
||
- [[reference-gap]] — Language ambiguity in spatial referencing
|
||
- [[perception-gap]] — Seeing vs. reasoning in MLLMs
|
||
- [[compressed-sparse-attention]] — KV cache compression (7056× ratio)
|
||
- [[mixture-of-experts]] — 284B total / 13B active parameters
|
||
- [[specialized-sft]] — Train separate experts for box/point primitives
|
||
- [[specialized-rl]] — GRPO-based RL per expert
|
||
- [[group-relative-policy-optimization]] — RL algorithm
|
||
- [[unified-rft]] — Rejection Fine-Tuning to unify experts
|
||
- [[on-policy-distillation]] — Consolidating expert capabilities
|
||
- [[coarse-grained-counting]] — Category-level counting with boxes
|
||
- [[fine-grained-counting]] — Attribute-constrained counting
|
||
- [[maze-navigation]] — Topological reasoning with point primitives
|
||
- [[path-tracing]] — Curve following with visual primitives
|
||
- [[exponential-decay-reward]] — Smooth reward for counting accuracy
|
||
- [[bidirectional-trajectory-evaluation]] — Forward+reverse path scoring
|
||
- [[token-efficiency]] — 7056× overall compression from pixels to KV cache
|
||
|
||
## Architecture
|
||
|
||
- Vision: [[deepseek-vit]] (in-house ViT, 14×14 patch, 3×3 spatial compression)
|
||
- Language: [[deepseek-v4-flash]] (284B MoE, 13B active)
|
||
- KV cache: [[compressed-sparse-attention]] — further 4× compression
|
||
- Overall compression: 756×756 image → 2,916 patches → 324 visual tokens → 81 KV entries (7056×)
|
||
|
||
## Training Pipeline
|
||
|
||
1. **Pretraining**: Web-scale data curation (97,984 sources → 31,701 after filtering, >40M samples) for visual primitive capabilities
|
||
2. **Specialized SFT**: Separate training for box-grounding (FTwG) and point-tracking (FTwP)
|
||
3. **Specialized RL**: GRPO with Format/Quality/Accuracy reward models
|
||
4. **Unified RFT**: On-policy rollouts → rejection sampling → unified SFT
|
||
5. **On-Policy Distillation**: KL-divergence consolidation of expert models
|
||
|
||
## Key Results
|
||
|
||
- CountQA: 66.1/75.1 (EM/RA@10) vs Gemini-3-Flash 48.3/60.3
|
||
- Pixmo-Count: 89.2 EM
|
||
- SpatialMQA: 69.4 ACC
|
||
- DS_Maze_Navigation: 66.9 ACC (frontier models ~49-50)
|
||
- DS_Path_Tracing: 56.7 ACC (frontier models ~25-46)
|
||
- Token consumption: ~90 KV entries vs 660-1100 for frontier models
|