20260429:一些新东西
This commit is contained in:
54
concepts/mixture-of-experts.md
Normal file
54
concepts/mixture-of-experts.md
Normal file
@@ -0,0 +1,54 @@
|
||||
---
|
||||
title: "Mixture of Experts (MoE)"
|
||||
domain: "Deep Learning / Model Architecture"
|
||||
tags: [moe, architecture, sparsity, transformer]
|
||||
sources: [[deepseek-v4-million-token-context]], Dai et al. (2024)
|
||||
---
|
||||
|
||||
# Mixture of Experts (MoE)
|
||||
|
||||
> **类型**: Concept (Tier 2 — Foundation)
|
||||
> **来源**: [[deepseek-v4-million-token-context]]
|
||||
|
||||
## 定义
|
||||
|
||||
Mixture of Experts(MoE)是一种神经网络架构范式,通过稀疏激活机制,每个 token 只路由到模型参数的一个子集(专家),从而在扩大总参数量的同时控制计算开销。
|
||||
|
||||
## DeepSeekMoE 设计
|
||||
|
||||
DeepSeek-V4 继承并扩展了 DeepSeekMoE 框架:
|
||||
|
||||
### 核心组件
|
||||
- **细粒度路由专家**:大量小型专家,每个 token 选择 top-k 激活
|
||||
- **共享专家**:所有 token 始终激活的专家,捕获通用知识
|
||||
- **路由策略**:Sqrt(Softplus(·)) 替代 Sigmoid 计算亲和度分数
|
||||
|
||||
### DeepSeek-V4 的改进
|
||||
|
||||
1. **负载均衡**:辅助损失自由策略 + 轻微序列级平衡损失
|
||||
2. **移除路由目标数限制**:灵活的路由拓扑
|
||||
3. **Hash 路由**:前几层 Transformer 的 FFN 用 Hash 路由替代密集层
|
||||
4. **FP4 量化**:路由专家权重采用 FP4 精度
|
||||
|
||||
### Expert Parallelism 优化
|
||||
|
||||
[[deepseek-v4-million-token-context|DeepSeek-V4]] 引入细粒度通信-计算重叠:
|
||||
- 将专家分组为 waves,流水线化 dispatch/compute/combine
|
||||
- MegaMoE2 mega-kernel:理论加速 1.92×
|
||||
- 在每个 GPU 上通信延迟可被计算完全隐藏
|
||||
|
||||
## 效率分析
|
||||
|
||||
对于 V4-Pro 的 token-expert 对:
|
||||
- 计算量:6hd FLOPs(SwiGLU gate + up + down projections)
|
||||
- 通信量:3h bytes(FP8 dispatch + BF16 combine)
|
||||
- 需求:C/B ≤ 6144 FLOPs/Byte(即每 GBps 带宽可支撑 6.1 TFLOP/s 计算)
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[fp4-quantization-training]] — FP4 量化训练
|
||||
- [[subquadratic-transformer-alternatives]] — Transformer 替代架构
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: 2026-04-27*
|
||||
Reference in New Issue
Block a user