---
title: "SPLIT Steering"
created: 2026-06-01
updated: 2026-06-01
type: concept
tags: [steering, optimization, controllability]
sources: [raw/papers/xu-why-steering-works-2026.md]
---

# SPLIT Steering（偏好-效用联合干预）

## 定义

SPLIT（**S**teering with **P**reference–Uti**L**ity **I**nterven**T**ion）是 Xu et al. (2026) 提出的训练目标，显式优化偏好同时保留效用——直接针对 preference–utility 折衷问题设计。

## 目标函数

### 效用损失（保持通用能力）

$$L_{util} = \lambda_p L_p + \lambda_n L_n$$

同时在正负样本上训练，确保模型保持连贯生成能力。

### 偏好损失（最大化控制效果）

$$L_{pref} = \gamma \cdot \sigma(\theta - (L_n - L_p))$$

Hinge-style margin loss：当 $L_n - L_p$（即偏好 log-odds）超过阈值 $\theta$ 时损失为 0，否则推动 gap 增大。

- $\sigma(\cdot)$ 是 ReLU
- $\theta$ 是 margin 阈值
- $\gamma$ 平衡偏好提升与效用保留

### 联合目标

$$L = L_{util} + L_{pref}$$

## 实验结果

在三种干预形式（Local Weight、LoRA、Vector）上，SPLIT 在 Psychopathy、PowerSeeking 和 AxBench 任务上**均优于** SFT 和 RePS 基线：

| 模型 | 方法 | Psychopathy Acc(%) | PowerSeeking Concept(0-4) |
|------|------|-------------------|--------------------------|
| Gemma-2-9B | SPLIT (Vector) | 99.00 | 3.62 |
| Gemma-2-9B | SFT (Vector) | 97.00 | 3.30 |
| Qwen-2.5-7B | SPLIT (Local Weight) | 98.00 | 3.66 |

## 设计原理

SPLIT 的核心创新是将 preference 和 utility 作为**可分离的优化目标**：

- $L_{util}$ 确保模型不离流形太远（preserve utility）
- $L_{pref}$ 在流形约束内最大化偏好方向对齐（projection gain）

## 相关概念

- [[preference-utility-analysis]] — SPLIT 的理论基础
- [[activation-manifold]] — 效用保留的几何解释
- [[validity-decay]] — SPLIT 试图延迟的退化
- [[preference-log-odds]] — $L_n - L_p$ 作为优化目标
- [[xu-why-steering-works]] — 源论文