Files
myWiki/raw/papers/xu-why-steering-works-2026.md

34 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
source_url: https://arxiv.org/abs/2602.02343
ingested: 2026-06-01
sha256: raw-from-pdf
---
# Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
**Authors:** Ziwen Xu¹², Chenyan Wu¹, Hengyu Sun¹, Haiwen Hong²*, Mengru Wang¹, Yunzhi Yao¹, Longtao Huang², Hui Xue², Shumin Deng¹, Zhixuan Chu¹, Huajun Chen¹, Ningyu Zhang¹*
**Affiliations:** ¹Zhejiang University, ²Alibaba Group
**arXiv:** 2602.02343 (v3, 12 Apr 2026)
**Code:** https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md
## Abstract
Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis. This analysis separates control effects into two components: preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation. Both components are measured on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility.
## Key Contributions
1. **Unified View** — casts local weight fine-tuning, LoRA, and activation steering as dynamic weight updates: `h_{i+1} = (W + m₁ΔW)h_i + (b + m₂Δb)`
2. **PreferenceUtility Analysis** — decomposes control into preference (target concept alignment) and utility (task validity) on a shared log-odds scale
3. **Activation Manifold Hypothesis** — explains the preferenceutility trade-off: steering pushes representations off the training-induced activation manifold, causing utility degradation
4. **Three-Stage Preference Dynamics** — Linear Region → Transitional Region → Convergence Region as steering factor m varies
5. **SPLIT Method** — Steering with Preference-UtiLity IntervenTion, a training objective that jointly optimizes preference and utility
## Experimental Setup
- Models: Gemma-2-9B-IT, Qwen-2.5-7B-Instruct
- Tasks: Psychopathy, PowerSeeking, AxBench (top 10 concepts)
- Intervention forms: Local Weight, LoRA, Vector (DiffMean/SFT/RePS)
- Curve fitting R² > 0.95 across most settings