Files
myWiki/concepts/preference-log-odds.md

49 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Preference Log-Odds"
created: 2026-06-01
updated: 2026-06-01
type: concept
tags: [steering, evaluation, metrics]
sources: [raw/papers/xu-why-steering-works-2026.md]
---
# Preference Log-Odds偏好对数几率
## 定义
Preference Log-Odds 是 Xu et al. (2026) 引入的度量,在共享 log-odds 尺度上量化 LLM 对目标概念的内在偏好:
$$\text{PrefOdds}(q) = \log \frac{P(p_p | q)}{P(p_n | q)} = L_n - L_p$$
其中 $(A_p, A_n)$ 是极性对比示例对,$L_p = -\log P(A_p|q)$$L_n = -\log P(A_n|q)$。
## 关键性质
1. **效用无关**:共享效用 $P(u|q)$ 在似然比中抵消PrefOdds 仅测量偏好
2. **与干预乘子的关系**:在激活流形框架下,$\log\frac{P(p_p)}{1-P(p_p)} = (\alpha_p m + \beta_p)D_p(m) + b_p$
3. **拟合质量**RQ 衰减模型拟合 R² > 0.95
## 三阶段响应
当 PrefOdds 相对于 $m$ 绘图时:
- 线性区:$\alpha_p m$ 项主导
- 过渡区:$D_p(m)$ 开始下降
- 收敛区:$D_p(m)$ 衰减至很低PrefOdds 趋于稳定
## 对比PrefOdds vs UtilOdds
| 属性 | PrefOdds | UtilOdds |
|------|----------|----------|
| 公式 | $L_n - L_p$ | $\log\frac{e^{-L_p}+e^{-L_n}}{1-e^{-L_p}-e^{-L_n}}$ |
| 含义 | 目标概念偏好 | 任务连贯性 |
| 导向方向投影 | α_p m | ≈0ω_u ⊥ Δh |
| 衰减依赖 | 投影 × 衰减 | 纯衰减 |
## 相关概念
- [[preference-utility-analysis]] — 度量框架
- [[intervention-multiplier]] — 控制变量 m
- [[validity-decay]] — D(m) 衰减
- [[steering-dynamics]] — PrefOdds 的三阶段行为
- [[xu-why-steering-works]] — 源论文