SidneyZhang/myWiki

Fork 0

Files

Sidney Zhang 91fac5b6fc

20260617:目前有914 页

2026-06-17 15:02:40 +08:00

2.9 KiB

Raw Blame History

source_url, ingested, sha256

source_url	ingested	sha256
https://arxiv.org/abs/2602.02343	2026-06-01	raw-from-pdf

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Authors: Ziwen Xu¹², Chenyan Wu¹, Hengyu Sun¹, Haiwen Hong²*, Mengru Wang¹, Yunzhi Yao¹, Longtao Huang², Hui Xue², Shumin Deng¹, Zhixuan Chu¹, Huajun Chen¹, Ningyu Zhang¹*

Affiliations: ¹Zhejiang University, ²Alibaba Group

arXiv: 2602.02343 (v3, 12 Apr 2026)

Code: https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md

Abstract

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis. This analysis separates control effects into two components: preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation. Both components are measured on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility.

Key Contributions

Unified View — casts local weight fine-tuning, LoRA, and activation steering as dynamic weight updates: h_{i+1} = (W + m₁ΔW)h_i + (b + m₂Δb)
Preference–Utility Analysis — decomposes control into preference (target concept alignment) and utility (task validity) on a shared log-odds scale
Activation Manifold Hypothesis — explains the preference–utility trade-off: steering pushes representations off the training-induced activation manifold, causing utility degradation
Three-Stage Preference Dynamics — Linear Region → Transitional Region → Convergence Region as steering factor m varies
SPLIT Method — Steering with Preference-UtiLity IntervenTion, a training objective that jointly optimizes preference and utility

Experimental Setup

Models: Gemma-2-9B-IT, Qwen-2.5-7B-Instruct
Tasks: Psychopathy, PowerSeeking, AxBench (top 10 concepts)
Intervention forms: Local Weight, LoRA, Vector (DiffMean/SFT/RePS)
Curve fitting R² > 0.95 across most settings

2.9 KiB Raw Blame History Unescape Escape

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Abstract

Key Contributions

Experimental Setup

2.9 KiB

Raw Blame History