20260514:增加新内容

2026-05-14 13:54:52 +08:00
parent 56c4d3ef7c
commit b116710e4c
294 changed files with 10682 additions and 255 deletions
--- a/concepts/certainty-based-rewards.md
+++ b/concepts/certainty-based-rewards.md
@@ -0,0 +1,42 @@
+---
+title: 确定性奖励 (Certainty-Based Rewards)
+created: 2025-04-15
+updated: 2026-05-01
+type: concept
+tags: []
+sources: []
+---
+
+# 确定性奖励 (Certainty-Based Rewards)
+
+**URLVR 的内在奖励范式之一**，从策略的置信度（logits/概率分布）推导奖励，假设更高置信度 = 更正确。
+
+## 代表方法
+
+| 方法 | 奖励函数 | 核心思想 |
+|------|---------|---------|
+| EM-RL | 轨迹级平均对数概率 | 鼓励低熵（高置信）轨迹 |
+| RENT | 序列级熵最小化 | 同上，不同归一化 |
+| RLIF | 自确定性 (KL 散度) | 鼓励输出分布偏离均匀 |
+| RLSC | 概率自我一致性 | 高概率采样点的自我一致性 |
+| RLSF | 概率差异 | 交叉样本概率对比 |
+
+## 理论局限
+
+[[intrinsic-rewards-sharpening|Sharpening 理论]] 揭示了确定性奖励的根本问题：置信度是模型内部状态——它只反映"模型认为什么是对的"，而非"什么客观上是对的"。当模型自信但错误时，确定性奖励在积极强化错误。
+
+## 对比 Ensemble-Based
+
+| 确定性奖励 | [[ensemble-based-rewards|集成奖励]] |
+|-----------|------|
+| 单次前向传播 | 需多次采样 |
+| 计算成本低 | 计算成本高 |
+| 完全依赖模型内部状态 | 通过多样本交叉验证 |
+| 同样受 Sharpening 限制 | 同样受 Sharpening 限制 |
+
+## 相关概念
+
+- [[ensemble-based-rewards]] — 另一内在范式
+- [[intrinsic-rewards-sharpening]] — 统一理论
+- [[unsupervised-rlvr]] — URLVR 全景
+- [[he-urlvr-sharpening-2026]] — 综述参考