SidneyZhang/myWiki

Files

Sidney Zhang 91fac5b6fc

20260617:目前有914 页

2026-06-17 15:02:40 +08:00

1.6 KiB

Raw Blame History

title, created, updated, type, tags, sources, confidence

title

created

updated

type

tags

sources

confidence

Linear Representation Hypothesis

2026-06-01

2026-06-01

concept

interpretability

representation-geometry

steering

raw/papers/xu-why-steering-works-2026.md

medium

Linear Representation Hypothesis（线性表示假说）

定义

线性表示假说（LRH）认为：高级语义概念在神经网络表示空间中近似编码为线性方向或子空间。形式化：存在方向向量 \omega 和偏置 b 使得概念强度近似为 $\omega^T h + b$。

来源

该假说源于词嵌入的类比推理（Mikolov et al., 2013: king - man + woman ≈ queen），后被推广到 LLM 中间层（Nanda et al., 2023; Park et al., 2024; Tigges et al., 2023）。

在 Steering 中的作用

LRH 是激活导向 (activation-steering) 的理论基础：

偏好概率可建模为 logistic 形式：P(p_p|h) = \sigma(-(\omega_p^T h + b_p))
导向向量 \Delta h 对齐于偏好方向 \omega_p
投影增益 \alpha_p = \omega_p^T \Delta h 量化导向效果

激活流形框架中的扩展

Xu et al. (2026) 在 LRH 基础上引入激活流形约束：

LRH 解释了偏好随 m 线性增长的机制
流形有效性衰减 D(m) 解释了线性关系在中大 m 区间的破裂

相关概念

activation-steering — LRH 的直接应用
activation-manifold — 对 LRH 的几何约束
preference-log-odds — LRH 下的偏好形式化
steering-vector — 导向向量提取
representation-space — 表示空间几何
xu-why-steering-works — 源论文