20260420:first commit
This commit is contained in:
37
concepts/depth-scaling-signal-degradation.md
Normal file
37
concepts/depth-scaling-signal-degradation.md
Normal file
@@ -0,0 +1,37 @@
|
||||
---
|
||||
title: "LLM 深度扩展与信号退化"
|
||||
created: 2026-04-19
|
||||
updated: 2026-04-19
|
||||
type: concept
|
||||
tags: [architecture, deep-learning, transformer]
|
||||
sources: [raw/papers/zhu-moda-mixture-of-depths-2026.md]
|
||||
---
|
||||
|
||||
# LLM 深度扩展与信号退化 (Depth Scaling & Signal Degradation)
|
||||
|
||||
## 背景
|
||||
|
||||
增加模型深度是提升 LLM 性能的关键途径之一。然而,深度扩展面临**信号退化**问题:随着层数增加,浅层提取的信息特征在多次残差更新中被稀释,导致深层难以有效利用这些特征。
|
||||
|
||||
## 信号退化机制
|
||||
|
||||
在标准 Transformer 的残差流(Residual Stream)中:
|
||||
$$x_{l+1} = x_l + f_l(x_l)$$
|
||||
其中 $f_l$ 是第 $l$ 层的变换(注意力 + FFN)。随着 $l$ 增加,$x_0$ 的原始信息被多次叠加的 $f_k$ 覆盖,导致"遗忘"。
|
||||
|
||||
## 缓解策略
|
||||
|
||||
### 架构级
|
||||
- **MoDA (Mixture-of-Depths Attention)**:注意力头直接跨层访问前序 KV [[mixture-of-depths-attention]]
|
||||
- **残差连接变体**:如 Pre-Norm vs Post-Norm,影响梯度流动
|
||||
- **层归一化位置**:Post-Norm 在 MoDA 中表现更好
|
||||
|
||||
### 训练级
|
||||
- **深度初始化**:特殊初始化策略保持信号幅度
|
||||
- **梯度裁剪与缩放**:防止深层梯度爆炸/消失
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[mixture-of-depths-attention]] — MoDA 机制
|
||||
- [[zhu-moda-mixture-of-depths]] — MoDA 论文
|
||||
- [[transformer-architecture]] — Transformer 基础架构
|
||||
Reference in New Issue
Block a user