Files
myWiki/concepts/model-steering.md

40 lines
1.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Model Steering"
created: 2026-06-01
updated: 2026-06-01
type: concept
tags: [steering, controllability, llm]
sources: [raw/papers/xu-why-steering-works-2026.md]
---
# Model Steering模型导向控制
> 占位符页面 — 关于 LLM 导向控制的更广泛文献
## 概述
Model Steering 泛指在推理时通过修改模型内部表示或参数来控制 LLM 行为的技术,包括但不限于:
- **激活导向** ([[activation-steering]]):向隐藏状态添加方向向量
- **参数干预**局部权重微调、LoRA ([[lora]]) 适配
- **推理时对齐**:通过系统提示或上下文控制
## 统一视角
Xu et al. (2026) 的 [[dynamic-weight-updates]] 框架将所有方法统一为动态权重更新,揭示了它们共享的 [[preference-utility-analysis|preferenceutility 折衷]] 规律。
## 核心挑战
- **偏好-效用折衷**:更强控制 → 更高偏好 + 更低效用
- **方向选择**:如何找到最优的 $\Delta W$ / $\Delta b$
- **强度调节**$m$ 的最佳取值依赖于具体任务
## 相关概念
- [[activation-steering]]
- [[lora]]
- [[dynamic-weight-updates]]
- [[steering-vector]]
- [[split-steering]]
- [[controlled-text-generation]]