Thinking with Visual Primitives

Authors: Ruijie Lu¹²*, Yiyang Ma¹*, Xiaokang Chen¹*‡, Lingxiao Luo¹³*, Zhiyu Wu¹*, Zizheng Pan¹*, Xingchao Liu¹*, Yutong Lin¹, Hao Li¹, Wen Liu¹, Zhewen Hao¹, Xi Gao¹, Shaoheng Nie¹, Yixuan Wei¹, Zhenda Xie¹, Ting Chen³, Gang Zeng²

¹ DeepSeek-AI, ² Peking University, ³ Tsinghua University
* Core contributors, ‡ Project lead

Abstract

Despite the remarkable progress in Multimodal Large Language Models (MLLMs), the prevailing Chain-of-Thought (CoT) paradigms remain predominantly confined to the linguistic space. While recent advancements have focused on bridging the perception-gap through high-resolution cropping, they overlook a more fundamental bottleneck: the reference-gap. The inherent ambiguity of natural language often fails to provide precise, unambiguous pointers to complex spatial layouts, leading to logical collapse in tasks requiring rigorous grounding.

In this work, the authors introduce Thinking with Visual Primitives, a novel reasoning framework that elevates spatial markers—such as points and bounding boxes—to "minimal units of thought". By interleaving these visual-primitives directly into the thinking process, the model can "point" while it "reasons", effectively grounding its cognitive trajectory in the physical coordinates of the image.

The framework is built on a highly optimized architecture with extreme visual token efficiency. Despite its compact model scale and significantly lower image-token budget, the model achieves frontier-competitive performance on challenging visual QA tasks, matching or exceeding models such as GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash.

Key Concepts

visual-primitives — Bounding boxes and points as minimal cognitive units
reference-gap — Language ambiguity in spatial referencing
perception-gap — Seeing vs. reasoning in MLLMs
compressed-sparse-attention — KV cache compression (7056× ratio)
mixture-of-experts — 284B total / 13B active parameters
specialized-sft — Train separate experts for box/point primitives
specialized-rl — GRPO-based RL per expert
group-relative-policy-optimization — RL algorithm
unified-rft — Rejection Fine-Tuning to unify experts
on-policy-distillation — Consolidating expert capabilities
coarse-grained-counting — Category-level counting with boxes
fine-grained-counting — Attribute-constrained counting
maze-navigation — Topological reasoning with point primitives
path-tracing — Curve following with visual primitives
exponential-decay-reward — Smooth reward for counting accuracy
bidirectional-trajectory-evaluation — Forward+reverse path scoring
token-efficiency — 7056× overall compression from pixels to KV cache

Architecture

Vision: deepseek-vit (in-house ViT, 14×14 patch, 3×3 spatial compression)
Language: deepseek-v4-flash (284B MoE, 13B active)
KV cache: compressed-sparse-attention — further 4× compression
Overall compression: 756×756 image → 2,916 patches → 324 visual tokens → 81 KV entries (7056×)

Training Pipeline

Pretraining: Web-scale data curation (97,984 sources → 31,701 after filtering, >40M samples) for visual primitive capabilities
Specialized SFT: Separate training for box-grounding (FTwG) and point-tracking (FTwP)
Specialized RL: GRPO with Format/Quality/Accuracy reward models
Unified RFT: On-policy rollouts → rejection sampling → unified SFT
On-Policy Distillation: KL-divergence consolidation of expert models

Key Results

CountQA: 66.1/75.1 (EM/RA@10) vs Gemini-3-Flash 48.3/60.3
Pixmo-Count: 89.2 EM
SpatialMQA: 69.4 ACC
DS_Maze_Navigation: 66.9 ACC (frontier models ~49-50)
DS_Path_Tracing: 56.7 ACC (frontier models ~25-46)
Token consumption: ~90 KV entries vs 660-1100 for frontier models

4.5 KiB Raw Blame History Unescape Escape