20260429:一些新东西
This commit is contained in:
62
raw/papers/deepseek-ai-deepseek-v4-2026.md
Normal file
62
raw/papers/deepseek-ai-deepseek-v4-2026.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
|
||||
|
||||
> **Source**: Hugging Face (technical report)
|
||||
> **Authors**: DeepSeek-AI
|
||||
> **Date**: 2026
|
||||
> **Link**: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
|
||||
> **Models**: DeepSeek-V4-Pro (1.6T/49B activated), DeepSeek-V4-Flash (284B/13B activated)
|
||||
|
||||
## Abstract
|
||||
|
||||
We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.
|
||||
|
||||
## Key Upgrades over DeepSeek-V3
|
||||
|
||||
1. **Hybrid attention architecture**: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) for long-context efficiency
|
||||
2. **Manifold-Constrained Hyper-Connections (mHC)**: Upgrades conventional residual connections for stability and expressivity
|
||||
3. **Muon optimizer**: Faster convergence and greater training stability
|
||||
|
||||
## Architecture Summary
|
||||
|
||||
- Retains DeepSeekMoE framework (fine-grained + shared experts) and Multi-Token Prediction (MTP)
|
||||
- Hybrid CSA/HCA: CSA compresses KV cache along sequence dimension then applies sparse attention; HCA applies aggressive compression with dense attention
|
||||
- mHC constrains residual mapping to doubly stochastic matrices (Birkhoff polytope) via Sinkhorn-Knopp algorithm
|
||||
- Muon with hybrid Newton-Schulz orthogonalization for most modules; AdamW for embeddings, heads, biases, RMSNorm
|
||||
|
||||
## Infrastructure Highlights
|
||||
|
||||
- Fine-grained communication-computation overlap in Expert Parallelism (1.5-1.73x speedup)
|
||||
- MegaMoE2 mega-kernel (open-sourced)
|
||||
- TileLang DSL with Z3 SMT solver integration
|
||||
- Batch-invariant and deterministic kernel libraries
|
||||
- FP4 quantization-aware training for MoE experts
|
||||
- Inference: heterogeneous KV cache with on-disk storage
|
||||
|
||||
## Pre-Training
|
||||
|
||||
- DeepSeek-V4-Flash: 32T tokens; DeepSeek-V4-Pro: 33T tokens
|
||||
- Both natively support 1M-length contexts after pre-training
|
||||
|
||||
## Post-Training Pipeline
|
||||
|
||||
Two-stage paradigm:
|
||||
1. **Specialist Training**: Independent expert models trained per domain (math, coding, agent, instruction following) via SFT + RL (GRPO)
|
||||
2. **On-Policy Distillation (OPD)**: Multi-teacher reverse-KL distillation merging expert capabilities into unified model
|
||||
|
||||
## Key Evaluation Results
|
||||
|
||||
- **Knowledge (SimpleQA, MMLU-Pro, HLE, GPQA)**: Significantly outperforms open-source models; closing gap with Gemini-3.1-Pro
|
||||
- **Reasoning**: Superior to GPT-5.2, Gemini-3.0-Pro; trails GPT-5.4/Gemini-3.1-Pro by ~3-6 months
|
||||
- **Agent**: On par with Kimi-K2.6, GLM-5.1; outperforms Claude Sonnet 4.5 in internal eval
|
||||
- **Long-Context**: Surpasses Gemini-3.1-Pro on academic benchmarks at 1M tokens
|
||||
- **Chinese Writing**: 62.7% win rate vs Gemini-3.1-Pro
|
||||
|
||||
## Efficiency (1M-token context vs DeepSeek-V3.2)
|
||||
|
||||
- DeepSeek-V4-Pro: 27% FLOPs, 10% KV cache
|
||||
- DeepSeek-V4-Flash: 10% FLOPs, 7% KV cache
|
||||
|
||||
---
|
||||
|
||||
*Format: Raw paper archive. See [[deepseek-v4-million-token-context]] for the wiki page.*
|
||||
*Last Updated: 2026-04-27*
|
||||
46
raw/papers/godel-tutorial-2026.md
Normal file
46
raw/papers/godel-tutorial-2026.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# 哥德尔不完备定理教程 — 原始存档
|
||||
|
||||
- **标题**: 哥德尔不完备定理教程:从哥德尔编号到人工智能的边界探索
|
||||
- **类型**: 综合教程/教学资料(面向数学系本科生)
|
||||
- **年份**: 2026年4月
|
||||
- **语言**: 中文
|
||||
- **页数**: 43页(含附录)
|
||||
- **来源**: PDF 直接提交
|
||||
- **文件**: godel_tutorial.pdf
|
||||
|
||||
## 摘要
|
||||
|
||||
哥德尔不完备定理是 20 世纪数学与逻辑学中最深刻的成果之一。1931 年,年仅 25 岁的奥地利逻辑学家库尔特·哥德尔在其论文中证明了两条影响深远的定理:
|
||||
|
||||
- **第一不完备定理**:任何包含皮亚诺算术的一致形式系统,必然存在在该系统中既不能被证明也不能被否证的真命题。
|
||||
- **第二不完备定理**:任何包含皮亚诺算术的一致形式系统,不能在该系统内部证明自身的一致性。
|
||||
|
||||
本教程面向数学系本科生,从希尔伯特计划的历史背景出发,系统地介绍哥德尔不完备定理的形成、核心内容、证明技术,及其对数学基础、计算机科学和哲学的深远影响。
|
||||
|
||||
## 章节结构
|
||||
|
||||
1. **历史背景**:希尔伯特计划与数学危机(集合论悖论、三大学派、哥德尔生平)
|
||||
2. **哥德尔第一不完备定理**:形式系统、哥德尔编码、可表示性、原始递归函数、证明思路
|
||||
3. **哥德尔第二不完备定理**:一致性命题的形式化、证明概要
|
||||
4. **证明技术详解**:哥德尔编号、对角线替换函数 Sub、自指命题 G 的构造
|
||||
5. **对数学基础的影响**:希尔伯特计划终结、连续统假设独立性、形式主义衰落与多元主义
|
||||
6. **对计算机科学的影响**:可计算性理论、停机问题、形式验证、自动定理证明
|
||||
7. **哲学影响与人类思维**:数学真理本质、卢卡斯-彭罗斯论证、知识界限、哥德尔宇宙
|
||||
8. **应用与误用**:物理学讨论、AI 讨论、常见误解澄清
|
||||
9. **现代发展**:巴黎-哈灵顿定理、古德斯坦定理、蔡廷的算法信息论
|
||||
|
||||
## 关键概念
|
||||
|
||||
[[godel-incompleteness-theorems]] · [[godel-numbering]] · [[hilberts-program]] · [[peano-arithmetic]] · [[self-reference]] · [[diagonalization-method]] · [[halting-problem]] · [[lucas-penrose-argument]] · [[chaitin-algorithmic-information-theory]] · [[metamathematics]]
|
||||
|
||||
## 参考文献精选
|
||||
|
||||
- Gödel, K. (1931). Über formal unentscheidbare Sätze...
|
||||
- Nagel & Newman (1958). Gödel's Proof
|
||||
- Hofstadter, D. R. (1979). Gödel, Escher, Bach
|
||||
- Smullyan, R. M. (1992). Gödel's Incompleteness Theorems
|
||||
- Franzén, T. (2005). Gödel's Theorem: An Incomplete Guide to Its Use and Abuse
|
||||
- Paris & Harrington (1977). A Mathematical Incompleteness in Peano Arithmetic
|
||||
- Chaitin, G. J. (1974). Information-Theoretic Limitations of Formal Systems
|
||||
- Lucas, J. R. (1961). Minds, Machines and Gödel
|
||||
- Penrose, R. (1989). The Emperor's New Mind
|
||||
38
raw/papers/llm-attention-survey-2026.md
Normal file
38
raw/papers/llm-attention-survey-2026.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# 大语言模型注意力机制全面分析
|
||||
|
||||
- **类型**: 综述论文 (Review Paper)
|
||||
- **日期**: 2026年4月
|
||||
- **来源**: 直接上传 PDF
|
||||
- **文件名**: LLM注意力机制全面分析
|
||||
- **标签**: #attention-mechanism #LLM #transformer #survey
|
||||
|
||||
## 摘要
|
||||
|
||||
注意力机制是Transformer架构的核心组件,也是大语言模型(LLM)取得突破性进展的关键因素。本文从数学原理、机制分类、实际应用问题及解决方案等多个维度,对LLM中的注意力机制进行全面系统的综述分析。首先,从缩放点积注意力的数学基础出发,详细推导了自注意力、多头注意力及其各种变体的数学表达。其次,系统梳理了从标准多头注意力(MHA)到多查询注意力(MQA)、分组查询注意力(GQA)、多潜在头注意力(MLA)以及各类稀疏注意力和线性注意力架构的发展脉络。然后,深入分析了当前注意力机制面临的核心挑战,包括二次计算复杂度、KV缓存内存瓶颈、注意力熵崩溃、长上下文"Lost in the Middle"现象以及注意力漂移导致的幻觉问题。最后,全面介绍了FlashAttention系列、KV缓存压缩与量化、稀疏注意力优化、架构创新及训练策略优化等前沿解决方案。
|
||||
|
||||
## 关键概念
|
||||
|
||||
- [[multi-head-attention]] (MHA) — 标准多头注意力机制
|
||||
- [[multi-query-attention]] (MQA) — 共享KV头的注意力变体
|
||||
- [[grouped-query-attention]] (GQA) — MHA与MQA之间的折中方案
|
||||
- [[multi-head-latent-attention]] (MLA) — 低秩压缩KV缓存
|
||||
- [[flash-attention]] — IO感知的注意力优化
|
||||
- [[attention-entropy-collapse]] — 注意力退化与熵崩溃
|
||||
- [[kv-cache-bottleneck]] — KV缓存内存瓶颈
|
||||
- [[lost-in-the-middle]] — 长上下文中的信息丢失现象
|
||||
- [[sparse-attention-patterns]] — 稀疏注意力模式
|
||||
- [[linear-attention-methods]] — 线性注意力与替代架构
|
||||
- [[rotary-position-embedding]] — 旋转位置编码
|
||||
- [[attention-sinks]] — 注意力汇技术
|
||||
|
||||
## 结构
|
||||
|
||||
1. 注意力机制的数学原理
|
||||
2. 主要变体(MHA/MQA/GQA/MLA/稀疏/线性)
|
||||
3. 挑战与问题(复杂度/缓存/熵崩溃/Lost in Middle/幻觉)
|
||||
4. 优化策略(FlashAttention/KV压缩/稀疏优化/架构创新/训练策略)
|
||||
5. 未来展望与结论
|
||||
|
||||
## 参考文献
|
||||
|
||||
共43篇,涵盖Vaswani 2017 (Attention is All You Need)、Shazeer 2019 (MQA)、Ainslie 2023 (GQA)、DeepSeek 2024 (MLA/V2)、Dao 2022 (FlashAttention)、Gu & Dao 2024 (Mamba) 等核心工作。
|
||||
Reference in New Issue
Block a user