Files
myWiki/raw/papers/deepseek-visual-primitives-2026.md

4.5 KiB
Raw Blame History

title, authors, affiliations, year, source, domain, tags
title authors affiliations year source domain tags
Thinking with Visual Primitives Ruijie Lu, Yiyang Ma, Xiaokang Chen (Project Lead), Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, Yutong Lin, Hao Li, Wen Liu, Zhewen Hao, Xi Gao, Shaoheng Nie, Yixuan Wei, Zhenda Xie, Ting Chen, Gang Zeng DeepSeek-AI, Peking University, Tsinghua University 2026 https://github.com/deepseek-ai/Thinking-with-Visual-Primitives Multimodal AI / Visual Reasoning
visual-primitives
multimodal
chain-of-thought
spatial-reasoning
token-efficiency
deepseek

Thinking with Visual Primitives

Authors: Ruijie Lu¹²*, Yiyang Ma¹*, Xiaokang Chen¹*‡, Lingxiao Luo¹³*, Zhiyu Wu¹*, Zizheng Pan¹*, Xingchao Liu¹*, Yutong Lin¹, Hao Li¹, Wen Liu¹, Zhewen Hao¹, Xi Gao¹, Shaoheng Nie¹, Yixuan Wei¹, Zhenda Xie¹, Ting Chen³, Gang Zeng²

  • ¹ DeepSeek-AI, ² Peking University, ³ Tsinghua University
  • * Core contributors, ‡ Project lead

Abstract

Despite the remarkable progress in Multimodal Large Language Models (MLLMs), the prevailing Chain-of-Thought (CoT) paradigms remain predominantly confined to the linguistic space. While recent advancements have focused on bridging the perception-gap through high-resolution cropping, they overlook a more fundamental bottleneck: the reference-gap. The inherent ambiguity of natural language often fails to provide precise, unambiguous pointers to complex spatial layouts, leading to logical collapse in tasks requiring rigorous grounding.

In this work, the authors introduce Thinking with Visual Primitives, a novel reasoning framework that elevates spatial markers—such as points and bounding boxes—to "minimal units of thought". By interleaving these visual-primitives directly into the thinking process, the model can "point" while it "reasons", effectively grounding its cognitive trajectory in the physical coordinates of the image.

The framework is built on a highly optimized architecture with extreme visual token efficiency. Despite its compact model scale and significantly lower image-token budget, the model achieves frontier-competitive performance on challenging visual QA tasks, matching or exceeding models such as GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash.

Key Concepts

Architecture

  • Vision: deepseek-vit (in-house ViT, 14×14 patch, 3×3 spatial compression)
  • Language: deepseek-v4-flash (284B MoE, 13B active)
  • KV cache: compressed-sparse-attention — further 4× compression
  • Overall compression: 756×756 image → 2,916 patches → 324 visual tokens → 81 KV entries (7056×)

Training Pipeline

  1. Pretraining: Web-scale data curation (97,984 sources → 31,701 after filtering, >40M samples) for visual primitive capabilities
  2. Specialized SFT: Separate training for box-grounding (FTwG) and point-tracking (FTwP)
  3. Specialized RL: GRPO with Format/Quality/Accuracy reward models
  4. Unified RFT: On-policy rollouts → rejection sampling → unified SFT
  5. On-Policy Distillation: KL-divergence consolidation of expert models

Key Results

  • CountQA: 66.1/75.1 (EM/RA@10) vs Gemini-3-Flash 48.3/60.3
  • Pixmo-Count: 89.2 EM
  • SpatialMQA: 69.4 ACC
  • DS_Maze_Navigation: 66.9 ACC (frontier models ~49-50)
  • DS_Path_Tracing: 56.7 ACC (frontier models ~25-46)
  • Token consumption: ~90 KV entries vs 660-1100 for frontier models