63 lines
3.0 KiB
Markdown
63 lines
3.0 KiB
Markdown
# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
|
|
|
|
> **Source**: Hugging Face (technical report)
|
|
> **Authors**: DeepSeek-AI
|
|
> **Date**: 2026
|
|
> **Link**: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
|
|
> **Models**: DeepSeek-V4-Pro (1.6T/49B activated), DeepSeek-V4-Flash (284B/13B activated)
|
|
|
|
## Abstract
|
|
|
|
We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.
|
|
|
|
## Key Upgrades over DeepSeek-V3
|
|
|
|
1. **Hybrid attention architecture**: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) for long-context efficiency
|
|
2. **Manifold-Constrained Hyper-Connections (mHC)**: Upgrades conventional residual connections for stability and expressivity
|
|
3. **Muon optimizer**: Faster convergence and greater training stability
|
|
|
|
## Architecture Summary
|
|
|
|
- Retains DeepSeekMoE framework (fine-grained + shared experts) and Multi-Token Prediction (MTP)
|
|
- Hybrid CSA/HCA: CSA compresses KV cache along sequence dimension then applies sparse attention; HCA applies aggressive compression with dense attention
|
|
- mHC constrains residual mapping to doubly stochastic matrices (Birkhoff polytope) via Sinkhorn-Knopp algorithm
|
|
- Muon with hybrid Newton-Schulz orthogonalization for most modules; AdamW for embeddings, heads, biases, RMSNorm
|
|
|
|
## Infrastructure Highlights
|
|
|
|
- Fine-grained communication-computation overlap in Expert Parallelism (1.5-1.73x speedup)
|
|
- MegaMoE2 mega-kernel (open-sourced)
|
|
- TileLang DSL with Z3 SMT solver integration
|
|
- Batch-invariant and deterministic kernel libraries
|
|
- FP4 quantization-aware training for MoE experts
|
|
- Inference: heterogeneous KV cache with on-disk storage
|
|
|
|
## Pre-Training
|
|
|
|
- DeepSeek-V4-Flash: 32T tokens; DeepSeek-V4-Pro: 33T tokens
|
|
- Both natively support 1M-length contexts after pre-training
|
|
|
|
## Post-Training Pipeline
|
|
|
|
Two-stage paradigm:
|
|
1. **Specialist Training**: Independent expert models trained per domain (math, coding, agent, instruction following) via SFT + RL (GRPO)
|
|
2. **On-Policy Distillation (OPD)**: Multi-teacher reverse-KL distillation merging expert capabilities into unified model
|
|
|
|
## Key Evaluation Results
|
|
|
|
- **Knowledge (SimpleQA, MMLU-Pro, HLE, GPQA)**: Significantly outperforms open-source models; closing gap with Gemini-3.1-Pro
|
|
- **Reasoning**: Superior to GPT-5.2, Gemini-3.0-Pro; trails GPT-5.4/Gemini-3.1-Pro by ~3-6 months
|
|
- **Agent**: On par with Kimi-K2.6, GLM-5.1; outperforms Claude Sonnet 4.5 in internal eval
|
|
- **Long-Context**: Surpasses Gemini-3.1-Pro on academic benchmarks at 1M tokens
|
|
- **Chinese Writing**: 62.7% win rate vs Gemini-3.1-Pro
|
|
|
|
## Efficiency (1M-token context vs DeepSeek-V3.2)
|
|
|
|
- DeepSeek-V4-Pro: 27% FLOPs, 10% KV cache
|
|
- DeepSeek-V4-Flash: 10% FLOPs, 7% KV cache
|
|
|
|
---
|
|
|
|
*Format: Raw paper archive. See [[deepseek-v4-million-token-context]] for the wiki page.*
|
|
*Last Updated: 2026-04-27*
|