20260420:first commit
This commit is contained in:
23
raw/papers/zhu-moda-mixture-of-depths-2026.md
Normal file
23
raw/papers/zhu-moda-mixture-of-depths-2026.md
Normal file
@@ -0,0 +1,23 @@
|
||||
---
|
||||
title: "Mixture-of-Depths Attention"
|
||||
arxiv_id: "2603.15619"
|
||||
authors: ["Lianghui Zhu", "Yuxin Fang", "Bencheng Liao", "Shijie Wang", "Tianheng Cheng", "Zilong Huang", "Chen Chen", "Lai Wei", "Yutao Zeng", "Ya Wang", "Yi Lin", "Yu Li", "Xinggang Wang"]
|
||||
published: "2026-03-26"
|
||||
updated: "2026-03-26"
|
||||
categories: ["cs.LG", "cs.AI", "cs.CL"]
|
||||
primary_category: "cs.LG"
|
||||
url: "https://arxiv.org/abs/2603.15619"
|
||||
github: "https://github.com/hustvl/MoDA"
|
||||
abstract: |
|
||||
Scaling depth is a key driver for large language models (LLMs). Yet, as LLLs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling.
|
||||
---
|
||||
|
||||
# Mixture-of-Depths Attention
|
||||
|
||||
**arXiv:** 2603.15619 [cs.LG]
|
||||
**Published:** 2026-03-26
|
||||
**Authors:** Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang
|
||||
|
||||
## Abstract
|
||||
|
||||
Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling.
|
||||
Reference in New Issue
Block a user