20260625:很多新内容
This commit is contained in:
62
concepts/memory-compute-decoupling.md
Normal file
62
concepts/memory-compute-decoupling.md
Normal file
@@ -0,0 +1,62 @@
|
||||
---
|
||||
title: "Memory-Compute Decoupling"
|
||||
created: 2026-06-25
|
||||
updated: 2026-06-25
|
||||
type: concept
|
||||
tags: ["infrastructure", "efficiency", "memory", "prefetching"]
|
||||
sources:
|
||||
- "[[engram-conditional-memory-2026]]"
|
||||
---
|
||||
|
||||
# Memory-Compute Decoupling
|
||||
|
||||
Memory-Compute Decoupling 是 Engram 提出的基础设施感知设计原则:通过确定性寻址将大型嵌入表从 GPU 内存卸载到主机内存,运行时预取重叠通信与计算。
|
||||
|
||||
## 动机
|
||||
|
||||
MoE 的动态路由导致:
|
||||
- 专家选择依赖当前 token 的 hidden state
|
||||
- 无法预知下一个 token 会激活哪个专家
|
||||
- 必须将所有专家参数保留在 GPU 显存中
|
||||
|
||||
Engram 的确定性哈希提供了相反的属性。
|
||||
|
||||
## 机制
|
||||
|
||||
### 确定性寻址
|
||||
- N-gram 嵌入的索引由哈希函数 𝜑_{n,k}(g_{t,n}) 确定
|
||||
- **仅依赖输入 token,不依赖 hidden state**
|
||||
- → 可以提前预取下一个 token 所需的嵌入向量
|
||||
|
||||
### 内存层次
|
||||
```
|
||||
GPU HBM: 常驻骨干网络(Attention + MoE)
|
||||
Host Memory: 大容量 Engram 嵌入表
|
||||
↓
|
||||
运行时:预取线程提前将下一批嵌入从 Host → GPU
|
||||
```
|
||||
|
||||
### 开销
|
||||
- 100B 参数嵌入表卸载到主机内存
|
||||
- 延迟开销 < 3%
|
||||
- 通信与计算重叠
|
||||
|
||||
## 意义
|
||||
|
||||
1. **突破 GPU 内存墙**:嵌入表大小不再受 GPU HBM 限制
|
||||
2. **激进参数扩展**:可以部署远超 GPU 容量的记忆模块
|
||||
3. **可预测扩展**:记忆容量增长不带来计算开销增长
|
||||
|
||||
## 与 MoE Offloading 的对比
|
||||
|
||||
| 维度 | MoE Offloading | Engram Decoupling |
|
||||
|------|---------------|-------------------|
|
||||
| 寻址 | 动态路由(依赖 hidden state) | 确定性哈希(仅依赖 token ID) |
|
||||
| 预取可能性 | 困难(不可预知) | 简单(提前知道索引) |
|
||||
| 延迟影响 | 显著 | <3% |
|
||||
|
||||
## 参考
|
||||
- [[engram-conditional-memory-2026]]
|
||||
- [[engram]]
|
||||
- [[conditional-memory]]
|
||||
- [[mixture-of-experts]]
|
||||
Reference in New Issue
Block a user