Files
myWiki/concepts/sparse-autoencoder.md

49 lines
1.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "稀疏自编码器 (Sparse Autoencoder)"
created: 2026-06-17
updated: 2026-06-17
type: concept
tags: [interpretability, architecture, dictionary-learning, sparse-coding]
sources: [raw/papers/zhang-geometric-sae-2026.md]
confidence: high
---
# 稀疏自编码器 (Sparse Autoencoder)
SAE 是**机制可解释性的核心工具**——通过学过完备稀疏表征将神经网络激活分解为可解释特征。
## 基本结构
```
z = W_enc (x - b_pre) + b_enc # 编码:从 n 维激活映射到 d 维 (d >> n)
a = Act(z) # 稀疏激活
x̂ = W_dec a + b_dec # 解码:重构原始激活
```
## 主要变体
[[geometric-sae-concepts|Zhang et al. (2026)]] 将 SAE 分为两类:
### [[absolute-gating|绝对门控]]
每个神经元激活独立于其他:
- **ReLU SAE**`L = ‖x - x̂‖² + λ‖a‖₁`L1 正则化强制稀疏
- **Gated SAE**:引入门控机制提高选择性
- **JumpReLU SAE**:使用跳跃 ReLU 激活
### [[absolute-gating|相对门控]]
神经元激活依赖于其他神经元(竞争选择):
- **Top-K SAE**:仅保留 k 个最大激活,其余归零
- **Matching Pursuit SAE**:迭代选择最有贡献的神经元
- **SPaDE**:结构化稀疏分解
## 核心理念
SAE 的基础假设是[[linear-representation-hypothesis|线性表征假设]]语义概念对应于激活空间中的方向并可线性组合。SAE 通过稀疏性强制将这些方向解耦,使单个神经元趋向[[polysemanticity|单义性]]。
## 参考
- [[polysemanticity|多义性/单义性]]
- [[mechanistic-interpretability|机制可解释性]]
- [[linear-representation-hypothesis|线性表征假设]]
- [[geometric-sae-concepts|几何框架论文]]