20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/split-steering.md
+++ b/concepts/split-steering.md
@@ -0,0 +1,61 @@
+---
+title: "SPLIT Steering"
+created: 2026-06-01
+updated: 2026-06-01
+type: concept
+tags: [steering, optimization, controllability]
+sources: [raw/papers/xu-why-steering-works-2026.md]
+---
+
+# SPLIT Steering（偏好-效用联合干预）
+
+## 定义
+
+SPLIT（**S**teering with **P**reference–Uti**L**ity **I**nterven**T**ion）是 Xu et al. (2026) 提出的训练目标，显式优化偏好同时保留效用——直接针对 preference–utility 折衷问题设计。
+
+## 目标函数
+
+### 效用损失（保持通用能力）
+
+$$L_{util} = \lambda_p L_p + \lambda_n L_n$$
+
+同时在正负样本上训练，确保模型保持连贯生成能力。
+
+### 偏好损失（最大化控制效果）
+
+$$L_{pref} = \gamma \cdot \sigma(\theta - (L_n - L_p))$$
+
+Hinge-style margin loss：当 $L_n - L_p$（即偏好 log-odds）超过阈值 $\theta$ 时损失为 0，否则推动 gap 增大。
+
+- $\sigma(\cdot)$ 是 ReLU
+- $\theta$ 是 margin 阈值
+- $\gamma$ 平衡偏好提升与效用保留
+
+### 联合目标
+
+$$L = L_{util} + L_{pref}$$
+
+## 实验结果
+
+在三种干预形式（Local Weight、LoRA、Vector）上，SPLIT 在 Psychopathy、PowerSeeking 和 AxBench 任务上**均优于** SFT 和 RePS 基线：
+
+| 模型 | 方法 | Psychopathy Acc(%) | PowerSeeking Concept(0-4) |
+|------|------|-------------------|--------------------------|
+| Gemma-2-9B | SPLIT (Vector) | 99.00 | 3.62 |
+| Gemma-2-9B | SFT (Vector) | 97.00 | 3.30 |
+| Qwen-2.5-7B | SPLIT (Local Weight) | 98.00 | 3.66 |
+
+## 设计原理
+
+SPLIT 的核心创新是将 preference 和 utility 作为**可分离的优化目标**：
+
+- $L_{util}$ 确保模型不离流形太远（preserve utility）
+- $L_{pref}$ 在流形约束内最大化偏好方向对齐（projection gain）
+
+## 相关概念
+
+- [[preference-utility-analysis]] — SPLIT 的理论基础
+- [[activation-manifold]] — 效用保留的几何解释
+- [[validity-decay]] — SPLIT 试图延迟的退化
+- [[preference-log-odds]] — $L_n - L_p$ 作为优化目标
+- [[xu-why-steering-works]] — 源论文