20260429:一些新东西

2026-04-29 16:28:13 +08:00
parent 0b1535dfaf
commit 56c4d3ef7c
70 changed files with 2798 additions and 3 deletions
--- a/raw/papers/deepseek-ai-deepseek-v4-2026.md
+++ b/raw/papers/deepseek-ai-deepseek-v4-2026.md
@@ -0,0 +1,62 @@
+# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
+
+> **Source**: Hugging Face (technical report)
+> **Authors**: DeepSeek-AI
+> **Date**: 2026
+> **Link**: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
+> **Models**: DeepSeek-V4-Pro (1.6T/49B activated), DeepSeek-V4-Flash (284B/13B activated)
+
+## Abstract
+
+We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.
+
+## Key Upgrades over DeepSeek-V3
+
+1. **Hybrid attention architecture**: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) for long-context efficiency
+2. **Manifold-Constrained Hyper-Connections (mHC)**: Upgrades conventional residual connections for stability and expressivity
+3. **Muon optimizer**: Faster convergence and greater training stability
+
+## Architecture Summary
+
+- Retains DeepSeekMoE framework (fine-grained + shared experts) and Multi-Token Prediction (MTP)
+- Hybrid CSA/HCA: CSA compresses KV cache along sequence dimension then applies sparse attention; HCA applies aggressive compression with dense attention
+- mHC constrains residual mapping to doubly stochastic matrices (Birkhoff polytope) via Sinkhorn-Knopp algorithm
+- Muon with hybrid Newton-Schulz orthogonalization for most modules; AdamW for embeddings, heads, biases, RMSNorm
+
+## Infrastructure Highlights
+
+- Fine-grained communication-computation overlap in Expert Parallelism (1.5-1.73x speedup)
+- MegaMoE2 mega-kernel (open-sourced)
+- TileLang DSL with Z3 SMT solver integration
+- Batch-invariant and deterministic kernel libraries
+- FP4 quantization-aware training for MoE experts
+- Inference: heterogeneous KV cache with on-disk storage
+
+## Pre-Training
+
+- DeepSeek-V4-Flash: 32T tokens; DeepSeek-V4-Pro: 33T tokens
+- Both natively support 1M-length contexts after pre-training
+
+## Post-Training Pipeline
+
+Two-stage paradigm:
+1. **Specialist Training**: Independent expert models trained per domain (math, coding, agent, instruction following) via SFT + RL (GRPO)
+2. **On-Policy Distillation (OPD)**: Multi-teacher reverse-KL distillation merging expert capabilities into unified model
+
+## Key Evaluation Results
+
+- **Knowledge (SimpleQA, MMLU-Pro, HLE, GPQA)**: Significantly outperforms open-source models; closing gap with Gemini-3.1-Pro
+- **Reasoning**: Superior to GPT-5.2, Gemini-3.0-Pro; trails GPT-5.4/Gemini-3.1-Pro by ~3-6 months
+- **Agent**: On par with Kimi-K2.6, GLM-5.1; outperforms Claude Sonnet 4.5 in internal eval
+- **Long-Context**: Surpasses Gemini-3.1-Pro on academic benchmarks at 1M tokens
+- **Chinese Writing**: 62.7% win rate vs Gemini-3.1-Pro
+
+## Efficiency (1M-token context vs DeepSeek-V3.2)
+
+- DeepSeek-V4-Pro: 27% FLOPs, 10% KV cache
+- DeepSeek-V4-Flash: 10% FLOPs, 7% KV cache
+
+---
+
+*Format: Raw paper archive. See [[deepseek-v4-million-token-context]] for the wiki page.*
+*Last Updated: 2026-04-27*
--- a/raw/papers/godel-tutorial-2026.md
+++ b/raw/papers/godel-tutorial-2026.md
@@ -0,0 +1,46 @@
+# 哥德尔不完备定理教程 — 原始存档
+
+- **标题**: 哥德尔不完备定理教程：从哥德尔编号到人工智能的边界探索
+- **类型**: 综合教程/教学资料（面向数学系本科生）
+- **年份**: 2026年4月
+- **语言**: 中文
+- **页数**: 43页（含附录）
+- **来源**: PDF 直接提交
+- **文件**: godel_tutorial.pdf
+
+## 摘要
+
+哥德尔不完备定理是 20 世纪数学与逻辑学中最深刻的成果之一。1931 年，年仅 25 岁的奥地利逻辑学家库尔特·哥德尔在其论文中证明了两条影响深远的定理：
+
+- **第一不完备定理**：任何包含皮亚诺算术的一致形式系统，必然存在在该系统中既不能被证明也不能被否证的真命题。
+- **第二不完备定理**：任何包含皮亚诺算术的一致形式系统，不能在该系统内部证明自身的一致性。
+
+本教程面向数学系本科生，从希尔伯特计划的历史背景出发，系统地介绍哥德尔不完备定理的形成、核心内容、证明技术，及其对数学基础、计算机科学和哲学的深远影响。
+
+## 章节结构
+
+1. **历史背景**：希尔伯特计划与数学危机（集合论悖论、三大学派、哥德尔生平）
+2. **哥德尔第一不完备定理**：形式系统、哥德尔编码、可表示性、原始递归函数、证明思路
+3. **哥德尔第二不完备定理**：一致性命题的形式化、证明概要
+4. **证明技术详解**：哥德尔编号、对角线替换函数 Sub、自指命题 G 的构造
+5. **对数学基础的影响**：希尔伯特计划终结、连续统假设独立性、形式主义衰落与多元主义
+6. **对计算机科学的影响**：可计算性理论、停机问题、形式验证、自动定理证明
+7. **哲学影响与人类思维**：数学真理本质、卢卡斯-彭罗斯论证、知识界限、哥德尔宇宙
+8. **应用与误用**：物理学讨论、AI 讨论、常见误解澄清
+9. **现代发展**：巴黎-哈灵顿定理、古德斯坦定理、蔡廷的算法信息论
+
+## 关键概念
+
+[[godel-incompleteness-theorems]] · [[godel-numbering]] · [[hilberts-program]] · [[peano-arithmetic]] · [[self-reference]] · [[diagonalization-method]] · [[halting-problem]] · [[lucas-penrose-argument]] · [[chaitin-algorithmic-information-theory]] · [[metamathematics]]
+
+## 参考文献精选
+
+- Gödel, K. (1931). Über formal unentscheidbare Sätze...
+- Nagel & Newman (1958). Gödel's Proof
+- Hofstadter, D. R. (1979). Gödel, Escher, Bach
+- Smullyan, R. M. (1992). Gödel's Incompleteness Theorems
+- Franzén, T. (2005). Gödel's Theorem: An Incomplete Guide to Its Use and Abuse
+- Paris & Harrington (1977). A Mathematical Incompleteness in Peano Arithmetic
+- Chaitin, G. J. (1974). Information-Theoretic Limitations of Formal Systems
+- Lucas, J. R. (1961). Minds, Machines and Gödel
+- Penrose, R. (1989). The Emperor's New Mind
--- a/raw/papers/llm-attention-survey-2026.md
+++ b/raw/papers/llm-attention-survey-2026.md
@@ -0,0 +1,38 @@
+# 大语言模型注意力机制全面分析
+
+- **类型**: 综述论文 (Review Paper)
+- **日期**: 2026年4月
+- **来源**: 直接上传 PDF
+- **文件名**: LLM注意力机制全面分析
+- **标签**: #attention-mechanism #LLM #transformer #survey
+
+## 摘要
+
+注意力机制是Transformer架构的核心组件，也是大语言模型（LLM）取得突破性进展的关键因素。本文从数学原理、机制分类、实际应用问题及解决方案等多个维度，对LLM中的注意力机制进行全面系统的综述分析。首先，从缩放点积注意力的数学基础出发，详细推导了自注意力、多头注意力及其各种变体的数学表达。其次，系统梳理了从标准多头注意力（MHA）到多查询注意力（MQA）、分组查询注意力（GQA）、多潜在头注意力（MLA）以及各类稀疏注意力和线性注意力架构的发展脉络。然后，深入分析了当前注意力机制面临的核心挑战，包括二次计算复杂度、KV缓存内存瓶颈、注意力熵崩溃、长上下文"Lost in the Middle"现象以及注意力漂移导致的幻觉问题。最后，全面介绍了FlashAttention系列、KV缓存压缩与量化、稀疏注意力优化、架构创新及训练策略优化等前沿解决方案。
+
+## 关键概念
+
+- [[multi-head-attention]] (MHA) — 标准多头注意力机制
+- [[multi-query-attention]] (MQA) — 共享KV头的注意力变体
+- [[grouped-query-attention]] (GQA) — MHA与MQA之间的折中方案
+- [[multi-head-latent-attention]] (MLA) — 低秩压缩KV缓存
+- [[flash-attention]] — IO感知的注意力优化
+- [[attention-entropy-collapse]] — 注意力退化与熵崩溃
+- [[kv-cache-bottleneck]] — KV缓存内存瓶颈
+- [[lost-in-the-middle]] — 长上下文中的信息丢失现象
+- [[sparse-attention-patterns]] — 稀疏注意力模式
+- [[linear-attention-methods]] — 线性注意力与替代架构
+- [[rotary-position-embedding]] — 旋转位置编码
+- [[attention-sinks]] — 注意力汇技术
+
+## 结构
+
+1. 注意力机制的数学原理
+2. 主要变体（MHA/MQA/GQA/MLA/稀疏/线性）
+3. 挑战与问题（复杂度/缓存/熵崩溃/Lost in Middle/幻觉）
+4. 优化策略（FlashAttention/KV压缩/稀疏优化/架构创新/训练策略）
+5. 未来展望与结论
+
+## 参考文献
+
+共43篇，涵盖Vaswani 2017 (Attention is All You Need)、Shazeer 2019 (MQA)、Ainslie 2023 (GQA)、DeepSeek 2024 (MLA/V2)、Dao 2022 (FlashAttention)、Gu & Dao 2024 (Mamba) 等核心工作。