20260625:很多新内容

2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions
--- a/concepts/sparsity-allocation.md
+++ b/concepts/sparsity-allocation.md
@@ -0,0 +1,64 @@
+---
+title: "Sparsity Allocation (U-shaped Law)"
+created: 2026-06-25
+updated: 2026-06-25
+type: concept
+tags: ["sparsity", "scaling-law", "mixture-of-experts", "architecture"]
+sources:
+  - "[[engram-conditional-memory-2026]]"
+---
+
+# Sparsity Allocation (U-shaped Law)
+
+Sparsity Allocation 是 Engram 论文提出的形式化问题：在固定的总参数预算下，如何将稀疏容量在 MoE（条件计算）和 Engram（条件记忆）之间最优分配。
+
+## 问题定义
+
+给定三个参数度量：
+- **P_tot**：总可训练参数
+- **P_act**：每个 token 的激活参数（决定 FLOPs）
+- **P_sparse** = P_tot - P_act：非活动参数（"免费"预算）
+
+分配比 ρ ∈ [0,1]：MoE 占 P_sparse 的比例。
+
+```
+P_MoE(sparse) = ρ · P_sparse
+P_Engram = (1-ρ) · P_sparse
+```
+
+- ρ = 1 → 纯 MoE（所有非活动参数是路由专家）
+- ρ < 1 → 减少路由专家，释放参数给 Engram 嵌入槽
+
+## U 形缩放律
+
+实验在两个计算规模下（C=2e20 FLOPs, P_tot=5.7B; C=6e20 FLOPs, P_tot=9.9B），保持 P_tot/P_act ≈ 10：
+
+**关键发现**：
+
+1. **U 形验证损失曲线**：纯 MoE (ρ=1) 和极低 ρ 都不如中间值
+2. **最优 ρ ≈ 75-80%**：将约 20-25% 的稀疏预算分配给 Engram
+3. **ρ=40% 仍可比肩 ρ=100%**：Engram 在仅 46 个专家（vs 106）时性能接近纯 MoE
+4. **最优值稳定**：不同计算规模下（5.7B vs 9.9B），最优 ρ 保持在 75-80%
+
+在 10B 级别：验证损失从 1.7248 (ρ=1) 改善至 1.7109 (ρ≈0.8)，Δ=0.0139。
+
+## 结构含义
+
+| 区域 | 现象 | 原因 |
+|------|------|------|
+| MoE-dominated (ρ→1) | 次优 | 缺少专用记忆，被迫用计算重建静态模式 |
+| Engram-dominated (ρ→0) | 恶化 | 失去条件计算能力，无法处理动态推理 |
+| Optimal (ρ≈0.75-0.80) | 最优 | 计算和记忆的互补性达到平衡 |
+
+## 无限内存扩展
+
+固定 MoE backbone (P_tot≈3B, P_act=568M)，单独扩大 Engram 嵌入槽（2.58e5 → 1e7，额外 +13B 参数）：
+- 验证损失遵循**严格幂律**（log-log 线性）
+- Engram 比 OverEncoding（直接平均 N-gram 嵌入到词表）释放大得多的扩展潜力
+- 提供**可预测的扩展旋钮**：更大内存持续产生收益，无需额外计算
+
+## 参考
+- [[engram-conditional-memory-2026]]
+- [[conditional-memory]]
+- [[engram]]
+- [[mixture-of-experts]]