Files
myWiki/raw/papers/xu-why-steering-works-2026.md

2.9 KiB
Raw Blame History

source_url, ingested, sha256
source_url ingested sha256
https://arxiv.org/abs/2602.02343 2026-06-01 raw-from-pdf

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Authors: Ziwen Xu¹², Chenyan Wu¹, Hengyu Sun¹, Haiwen Hong²*, Mengru Wang¹, Yunzhi Yao¹, Longtao Huang², Hui Xue², Shumin Deng¹, Zhixuan Chu¹, Huajun Chen¹, Ningyu Zhang¹*

Affiliations: ¹Zhejiang University, ²Alibaba Group

arXiv: 2602.02343 (v3, 12 Apr 2026)

Code: https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md

Abstract

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis. This analysis separates control effects into two components: preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation. Both components are measured on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility.

Key Contributions

  1. Unified View — casts local weight fine-tuning, LoRA, and activation steering as dynamic weight updates: h_{i+1} = (W + m₁ΔW)h_i + (b + m₂Δb)
  2. PreferenceUtility Analysis — decomposes control into preference (target concept alignment) and utility (task validity) on a shared log-odds scale
  3. Activation Manifold Hypothesis — explains the preferenceutility trade-off: steering pushes representations off the training-induced activation manifold, causing utility degradation
  4. Three-Stage Preference Dynamics — Linear Region → Transitional Region → Convergence Region as steering factor m varies
  5. SPLIT Method — Steering with Preference-UtiLity IntervenTion, a training objective that jointly optimizes preference and utility

Experimental Setup

  • Models: Gemma-2-9B-IT, Qwen-2.5-7B-Instruct
  • Tasks: Psychopathy, PowerSeeking, AxBench (top 10 concepts)
  • Intervention forms: Local Weight, LoRA, Vector (DiffMean/SFT/RePS)
  • Curve fitting R² > 0.95 across most settings