20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/activation-steering.md
+++ b/concepts/activation-steering.md
@@ -0,0 +1,46 @@
+---
+title: "Activation Steering"
+created: 2026-06-01
+updated: 2026-06-01
+type: concept
+tags: [steering, interpretability, inference-time-intervention]
+sources: [raw/papers/xu-why-steering-works-2026.md]
+---
+
+# Activation Steering（激活导向）
+
+## 定义
+
+激活导向是在推理时修改 LLM 中间层表示的方法，通过向选定激活添加一个导向向量：
+
+$$h_{i+1} = W h_i + b + mv$$
+
+其中 $v$ 是预定的方向，$m$ 是标量系数。
+
+## 理论基础
+
+激活导向建立在**线性表示假说** ([[linear-representation-hypothesis]]) 之上：抽象概念在表示空间中近似对应线性子空间。导向向量 $v$ 可以从概念正负样本的激活差异中提取（DiffMean）。
+
+## 在统一框架中
+
+在 Xu et al. (2026) 的统一动态权重视角中，激活导向等价于仅修改偏置 b：
+
+$$h_{i+1} = W h_i + (b + m\Delta b)$$
+
+$$\Delta h = m\Delta b$$
+
+它是动态权重更新中**参数规模最小**（仅 $d_{out}$ 参数）的形式。
+
+## 常见方法
+
+- **DiffMean**（Marks & Tegmark, 2023）：无训练，从对比对中取激活差值的均值
+- **SFT**：监督微调导向向量
+- **RePS**：基于偏好的训练
+
+## 相关概念
+
+- [[dynamic-weight-updates]] — 统一框架
+- [[steering-vector]] — 导向向量的提取方法
+- [[linear-representation-hypothesis]] — 线性空间假设
+- [[split-steering]] — 改进的向量训练方法
+- [[xu-why-steering-works]] — 源论文