20260429:一些新东西

2026-04-29 16:28:13 +08:00
parent 0b1535dfaf
commit 56c4d3ef7c
70 changed files with 2798 additions and 3 deletions
--- a/concepts/mixture-of-experts.md
+++ b/concepts/mixture-of-experts.md
@@ -0,0 +1,54 @@
+---
+title: "Mixture of Experts (MoE)"
+domain: "Deep Learning / Model Architecture"
+tags: [moe, architecture, sparsity, transformer]
+sources: [[deepseek-v4-million-token-context]], Dai et al. (2024)
+---
+
+# Mixture of Experts (MoE)
+
+> **类型**: Concept (Tier 2 — Foundation)
+> **来源**: [[deepseek-v4-million-token-context]]
+
+## 定义
+
+Mixture of Experts（MoE）是一种神经网络架构范式，通过稀疏激活机制，每个 token 只路由到模型参数的一个子集（专家），从而在扩大总参数量的同时控制计算开销。
+
+## DeepSeekMoE 设计
+
+DeepSeek-V4 继承并扩展了 DeepSeekMoE 框架：
+
+### 核心组件
+- **细粒度路由专家**：大量小型专家，每个 token 选择 top-k 激活
+- **共享专家**：所有 token 始终激活的专家，捕获通用知识
+- **路由策略**：Sqrt(Softplus(·)) 替代 Sigmoid 计算亲和度分数
+
+### DeepSeek-V4 的改进
+
+1. **负载均衡**：辅助损失自由策略 + 轻微序列级平衡损失
+2. **移除路由目标数限制**：灵活的路由拓扑
+3. **Hash 路由**：前几层 Transformer 的 FFN 用 Hash 路由替代密集层
+4. **FP4 量化**：路由专家权重采用 FP4 精度
+
+### Expert Parallelism 优化
+
+[[deepseek-v4-million-token-context|DeepSeek-V4]] 引入细粒度通信-计算重叠：
+- 将专家分组为 waves，流水线化 dispatch/compute/combine
+- MegaMoE2 mega-kernel：理论加速 1.92×
+- 在每个 GPU 上通信延迟可被计算完全隐藏
+
+## 效率分析
+
+对于 V4-Pro 的 token-expert 对：
+- 计算量：6hd FLOPs（SwiGLU gate + up + down projections）
+- 通信量：3h bytes（FP8 dispatch + BF16 combine）
+- 需求：C/B ≤ 6144 FLOPs/Byte（即每 GBps 带宽可支撑 6.1 TFLOP/s 计算）
+
+## 相关概念
+
+- [[fp4-quantization-training]] — FP4 量化训练
+- [[subquadratic-transformer-alternatives]] — Transformer 替代架构
+
+---
+
+*Last Updated: 2026-04-27*