20260617:目前有914 页
This commit is contained in:
41
concepts/linear-representation-hypothesis.md
Normal file
41
concepts/linear-representation-hypothesis.md
Normal file
@@ -0,0 +1,41 @@
|
||||
---
|
||||
title: "Linear Representation Hypothesis"
|
||||
created: 2026-06-01
|
||||
updated: 2026-06-01
|
||||
type: concept
|
||||
tags: [interpretability, representation-geometry, steering]
|
||||
sources: [raw/papers/xu-why-steering-works-2026.md]
|
||||
confidence: medium
|
||||
---
|
||||
|
||||
# Linear Representation Hypothesis(线性表示假说)
|
||||
|
||||
## 定义
|
||||
|
||||
线性表示假说(LRH)认为:高级语义概念在神经网络表示空间中近似编码为线性方向或子空间。形式化:存在方向向量 $\omega$ 和偏置 $b$ 使得概念强度近似为 $\omega^T h + b$。
|
||||
|
||||
## 来源
|
||||
|
||||
该假说源于词嵌入的类比推理(Mikolov et al., 2013: king - man + woman ≈ queen),后被推广到 LLM 中间层(Nanda et al., 2023; Park et al., 2024; Tigges et al., 2023)。
|
||||
|
||||
## 在 Steering 中的作用
|
||||
|
||||
LRH 是激活导向 ([[activation-steering]]) 的理论基础:
|
||||
- 偏好概率可建模为 logistic 形式:$P(p_p|h) = \sigma(-(\omega_p^T h + b_p))$
|
||||
- 导向向量 $\Delta h$ 对齐于偏好方向 $\omega_p$
|
||||
- 投影增益 $\alpha_p = \omega_p^T \Delta h$ 量化导向效果
|
||||
|
||||
## 激活流形框架中的扩展
|
||||
|
||||
Xu et al. (2026) 在 LRH 基础上引入激活流形约束:
|
||||
- LRH 解释了偏好随 m 线性增长的机制
|
||||
- 流形有效性衰减 $D(m)$ 解释了线性关系在中大 m 区间的破裂
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[activation-steering]] — LRH 的直接应用
|
||||
- [[activation-manifold]] — 对 LRH 的几何约束
|
||||
- [[preference-log-odds]] — LRH 下的偏好形式化
|
||||
- [[steering-vector]] — 导向向量提取
|
||||
- [[representation-space]] — 表示空间几何
|
||||
- [[xu-why-steering-works]] — 源论文
|
||||
Reference in New Issue
Block a user