Files
myWiki/concepts/mixture-of-experts.md

55 lines
1.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Mixture of Experts (MoE)"
domain: "Deep Learning / Model Architecture"
tags: [moe, architecture, sparsity, transformer]
sources: [[deepseek-v4-million-token-context]], Dai et al. (2024)
---
# Mixture of Experts (MoE)
> **类型**: Concept (Tier 2 — Foundation)
> **来源**: [[deepseek-v4-million-token-context]]
## 定义
Mixture of ExpertsMoE是一种神经网络架构范式通过稀疏激活机制每个 token 只路由到模型参数的一个子集(专家),从而在扩大总参数量的同时控制计算开销。
## DeepSeekMoE 设计
DeepSeek-V4 继承并扩展了 DeepSeekMoE 框架:
### 核心组件
- **细粒度路由专家**:大量小型专家,每个 token 选择 top-k 激活
- **共享专家**:所有 token 始终激活的专家,捕获通用知识
- **路由策略**Sqrt(Softplus(·)) 替代 Sigmoid 计算亲和度分数
### DeepSeek-V4 的改进
1. **负载均衡**:辅助损失自由策略 + 轻微序列级平衡损失
2. **移除路由目标数限制**:灵活的路由拓扑
3. **Hash 路由**:前几层 Transformer 的 FFN 用 Hash 路由替代密集层
4. **FP4 量化**:路由专家权重采用 FP4 精度
### Expert Parallelism 优化
[[deepseek-v4-million-token-context|DeepSeek-V4]] 引入细粒度通信-计算重叠:
- 将专家分组为 waves流水线化 dispatch/compute/combine
- MegaMoE2 mega-kernel理论加速 1.92×
- 在每个 GPU 上通信延迟可被计算完全隐藏
## 效率分析
对于 V4-Pro 的 token-expert 对:
- 计算量6hd FLOPsSwiGLU gate + up + down projections
- 通信量3h bytesFP8 dispatch + BF16 combine
- 需求C/B ≤ 6144 FLOPs/Byte即每 GBps 带宽可支撑 6.1 TFLOP/s 计算)
## 相关概念
- [[fp4-quantization-training]] — FP4 量化训练
- [[subquadratic-transformer-alternatives]] — Transformer 替代架构
---
*Last Updated: 2026-04-27*