20260420:first commit
This commit is contained in:
39
papers/zhu-moda-mixture-of-depths.md
Normal file
39
papers/zhu-moda-mixture-of-depths.md
Normal file
@@ -0,0 +1,39 @@
|
||||
---
|
||||
title: "Mixture-of-Depths Attention (MoDA)"
|
||||
created: 2026-04-19
|
||||
updated: 2026-04-19
|
||||
type: paper
|
||||
tags: [llm, architecture, deep-learning, transformer]
|
||||
sources: [raw/papers/zhu-moda-mixture-of-depths-2026.md]
|
||||
---
|
||||
|
||||
# Mixture-of-Depths Attention (MoDA)
|
||||
|
||||
**arXiv:** 2603.15619 [cs.LG] · 2026-03-26
|
||||
**作者:** Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang
|
||||
**代码:** https://github.com/hustvl/MoDA
|
||||
|
||||
## 核心贡献
|
||||
|
||||
提出 **Mixture-of-Depths Attention (MoDA)**,一种解决大模型深度扩展中**信号退化 (Signal Degradation)** 问题的注意力机制。MoDA 允许每个注意力头同时关注当前层的序列 KV 对和前序层的深度 KV 对,从而在深层网络中保留浅层形成的信息特征。
|
||||
|
||||
## 关键发现
|
||||
|
||||
- **信号退化问题**:随着 LLM 变深,浅层形成的信息特征在多次残差更新中被稀释,深层难以恢复
|
||||
- **MoDA 机制**:
|
||||
- 每个注意力头混合关注:当前层序列 KV + 前序层深度 KV
|
||||
- 类似于跨层的"快捷通道",但基于注意力机制而非简单残差连接
|
||||
- **硬件高效实现**:
|
||||
- 解决了非连续内存访问模式问题
|
||||
- 在 64K 序列长度下达到 FlashAttention-2 **97.3%** 的效率
|
||||
- 仅增加 **3.7%** 的 FLOPs 计算开销
|
||||
- **实验结果**(1.5B 参数模型):
|
||||
- 平均困惑度 (Perplexity) 在 10 个验证基准上改善 **0.2**
|
||||
- 10 个下游任务平均性能提升 **2.11%**
|
||||
- **归一化位置**:MoDA + **Post-Norm** 表现优于 Pre-Norm
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[mixture-of-depths-attention]] — MoDA 机制详解
|
||||
- [[depth-scaling-llms]] — LLM 深度扩展技术与挑战
|
||||
- [[signal-degradation]] — 深层网络中的信号退化问题
|
||||
Reference in New Issue
Block a user