2.9 KiB
source_url, ingested, sha256
| source_url | ingested | sha256 |
|---|---|---|
| https://arxiv.org/abs/2602.02343 | 2026-06-01 | raw-from-pdf |
Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
Authors: Ziwen Xu¹², Chenyan Wu¹, Hengyu Sun¹, Haiwen Hong²*, Mengru Wang¹, Yunzhi Yao¹, Longtao Huang², Hui Xue², Shumin Deng¹, Zhixuan Chu¹, Huajun Chen¹, Ningyu Zhang¹*
Affiliations: ¹Zhejiang University, ²Alibaba Group
arXiv: 2602.02343 (v3, 12 Apr 2026)
Code: https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md
Abstract
Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis. This analysis separates control effects into two components: preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation. Both components are measured on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility.
Key Contributions
- Unified View — casts local weight fine-tuning, LoRA, and activation steering as dynamic weight updates:
h_{i+1} = (W + m₁ΔW)h_i + (b + m₂Δb) - Preference–Utility Analysis — decomposes control into preference (target concept alignment) and utility (task validity) on a shared log-odds scale
- Activation Manifold Hypothesis — explains the preference–utility trade-off: steering pushes representations off the training-induced activation manifold, causing utility degradation
- Three-Stage Preference Dynamics — Linear Region → Transitional Region → Convergence Region as steering factor m varies
- SPLIT Method — Steering with Preference-UtiLity IntervenTion, a training objective that jointly optimizes preference and utility
Experimental Setup
- Models: Gemma-2-9B-IT, Qwen-2.5-7B-Instruct
- Tasks: Psychopathy, PowerSeeking, AxBench (top 10 concepts)
- Intervention forms: Local Weight, LoRA, Vector (DiffMean/SFT/RePS)
- Curve fitting R² > 0.95 across most settings