SidneyZhang/myWiki

Files

Sidney Zhang dd8345a6ea

20260420:first commit

2026-04-20 11:42:41 +08:00

1.4 KiB

Raw Blame History

title, created, updated, type, tags, sources

title

created

updated

type

tags

sources

LLM 深度扩展与信号退化

2026-04-19

2026-04-19

concept

architecture

deep-learning

transformer

raw/papers/zhu-moda-mixture-of-depths-2026.md

LLM 深度扩展与信号退化 (Depth Scaling & Signal Degradation)

背景

增加模型深度是提升 LLM 性能的关键途径之一。然而，深度扩展面临信号退化问题：随着层数增加，浅层提取的信息特征在多次残差更新中被稀释，导致深层难以有效利用这些特征。

信号退化机制

在标准 Transformer 的残差流（Residual Stream）中：

x_{l+1} = x_l + f_l(x_l)

其中 f_l 是第 l 层的变换（注意力 + FFN）。随着 l 增加，x_0 的原始信息被多次叠加的 f_k 覆盖，导致"遗忘"。

缓解策略

架构级

MoDA (Mixture-of-Depths Attention)：注意力头直接跨层访问前序 KV mixture-of-depths-attention
残差连接变体：如 Pre-Norm vs Post-Norm，影响梯度流动
层归一化位置：Post-Norm 在 MoDA 中表现更好

训练级

深度初始化：特殊初始化策略保持信号幅度
梯度裁剪与缩放：防止深层梯度爆炸/消失

相关概念

mixture-of-depths-attention — MoDA 机制
zhu-moda-mixture-of-depths — MoDA 论文
transformer-architecture — Transformer 基础架构