34 lines
2.9 KiB
Markdown
34 lines
2.9 KiB
Markdown
---
|
||
source_url: https://arxiv.org/abs/2602.02343
|
||
ingested: 2026-06-01
|
||
sha256: raw-from-pdf
|
||
---
|
||
|
||
# Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
|
||
|
||
**Authors:** Ziwen Xu¹², Chenyan Wu¹, Hengyu Sun¹, Haiwen Hong²*, Mengru Wang¹, Yunzhi Yao¹, Longtao Huang², Hui Xue², Shumin Deng¹, Zhixuan Chu¹, Huajun Chen¹, Ningyu Zhang¹*
|
||
|
||
**Affiliations:** ¹Zhejiang University, ²Alibaba Group
|
||
|
||
**arXiv:** 2602.02343 (v3, 12 Apr 2026)
|
||
|
||
**Code:** https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md
|
||
|
||
## Abstract
|
||
|
||
Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis. This analysis separates control effects into two components: preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation. Both components are measured on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility.
|
||
|
||
## Key Contributions
|
||
|
||
1. **Unified View** — casts local weight fine-tuning, LoRA, and activation steering as dynamic weight updates: `h_{i+1} = (W + m₁ΔW)h_i + (b + m₂Δb)`
|
||
2. **Preference–Utility Analysis** — decomposes control into preference (target concept alignment) and utility (task validity) on a shared log-odds scale
|
||
3. **Activation Manifold Hypothesis** — explains the preference–utility trade-off: steering pushes representations off the training-induced activation manifold, causing utility degradation
|
||
4. **Three-Stage Preference Dynamics** — Linear Region → Transitional Region → Convergence Region as steering factor m varies
|
||
5. **SPLIT Method** — Steering with Preference-UtiLity IntervenTion, a training objective that jointly optimizes preference and utility
|
||
|
||
## Experimental Setup
|
||
- Models: Gemma-2-9B-IT, Qwen-2.5-7B-Instruct
|
||
- Tasks: Psychopathy, PowerSeeking, AxBench (top 10 concepts)
|
||
- Intervention forms: Local Weight, LoRA, Vector (DiffMean/SFT/RePS)
|
||
- Curve fitting R² > 0.95 across most settings
|