Files
myWiki/papers/zhu-moda-mixture-of-depths.md

40 lines
1.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Mixture-of-Depths Attention (MoDA)"
created: 2026-04-19
updated: 2026-04-19
type: paper
tags: [llm, architecture, deep-learning, transformer]
sources: [raw/papers/zhu-moda-mixture-of-depths-2026.md]
---
# Mixture-of-Depths Attention (MoDA)
**arXiv:** 2603.15619 [cs.LG] · 2026-03-26
**作者:** Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang
**代码:** https://github.com/hustvl/MoDA
## 核心贡献
提出 **Mixture-of-Depths Attention (MoDA)**,一种解决大模型深度扩展中**信号退化 (Signal Degradation)** 问题的注意力机制。MoDA 允许每个注意力头同时关注当前层的序列 KV 对和前序层的深度 KV 对,从而在深层网络中保留浅层形成的信息特征。
## 关键发现
- **信号退化问题**:随着 LLM 变深,浅层形成的信息特征在多次残差更新中被稀释,深层难以恢复
- **MoDA 机制**
- 每个注意力头混合关注:当前层序列 KV + 前序层深度 KV
- 类似于跨层的"快捷通道",但基于注意力机制而非简单残差连接
- **硬件高效实现**
- 解决了非连续内存访问模式问题
- 在 64K 序列长度下达到 FlashAttention-2 **97.3%** 的效率
- 仅增加 **3.7%** 的 FLOPs 计算开销
- **实验结果**1.5B 参数模型):
- 平均困惑度 (Perplexity) 在 10 个验证基准上改善 **0.2**
- 10 个下游任务平均性能提升 **2.11%**
- **归一化位置**MoDA + **Post-Norm** 表现优于 Pre-Norm
## 相关概念
- [[mixture-of-depths-attention]] — MoDA 机制详解
- [[depth-scaling-signal-degradation]] — LLM 深度扩展技术与挑战
- [[depth-scaling-signal-degradation]] — 深层网络中的信号退化问题