myWiki/raw/papers/xu-why-steering-works-2026.md

---
source_url: https://arxiv.org/abs/2602.02343
ingested: 2026-06-01
sha256: raw-from-pdf
---

# Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

**Authors:** Ziwen Xu¹², Chenyan Wu¹, Hengyu Sun¹, Haiwen Hong²*, Mengru Wang¹, Yunzhi Yao¹, Longtao Huang², Hui Xue², Shumin Deng¹, Zhixuan Chu¹, Huajun Chen¹, Ningyu Zhang¹*

**Affiliations:** ¹Zhejiang University, ²Alibaba Group

**arXiv:** 2602.02343 (v3, 12 Apr 2026)

**Code:** https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md

## Abstract

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis. This analysis separates control effects into two components: preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation. Both components are measured on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility.

## Key Contributions

1. **Unified View** — casts local weight fine-tuning, LoRA, and activation steering as dynamic weight updates: `h_{i+1} = (W + m₁ΔW)h_i + (b + m₂Δb)`
2. **Preference–Utility Analysis** — decomposes control into preference (target concept alignment) and utility (task validity) on a shared log-odds scale
3. **Activation Manifold Hypothesis** — explains the preference–utility trade-off: steering pushes representations off the training-induced activation manifold, causing utility degradation
4. **Three-Stage Preference Dynamics** — Linear Region → Transitional Region → Convergence Region as steering factor m varies
5. **SPLIT Method** — Steering with Preference-UtiLity IntervenTion, a training objective that jointly optimizes preference and utility

## Experimental Setup
- Models: Gemma-2-9B-IT, Qwen-2.5-7B-Instruct
- Tasks: Psychopathy, PowerSeeking, AxBench (top 10 concepts)
- Intervention forms: Local Weight, LoRA, Vector (DiffMean/SFT/RePS)
- Curve fitting R² > 0.95 across most settings