20260514:增加新内容

2026-05-14 13:54:52 +08:00
parent 56c4d3ef7c
commit b116710e4c
294 changed files with 10682 additions and 255 deletions
--- a/concepts/unsupervised-rlvr.md
+++ b/concepts/unsupervised-rlvr.md
@@ -0,0 +1,41 @@
+---
+title: 无监督可验证奖励强化学习 (URLVR)
+created: 2025-04-15
+updated: 2026-05-01
+type: concept
+tags: []
+sources: []
+---
+
+# 无监督可验证奖励强化学习 (URLVR)
+
+**Unsupervised RL with Verifiable Rewards** — 无需 ground truth 标签的强化学习范式，用代理奖励信号扩展 LLM 后训练。
+
+## 定义
+
+URLVR 是对标准 RLVR 的扩展。标准 RLVR（如 DeepSeek-R1）依赖可验证的 ground truth（数学答案对错、代码通过测试），而 URLVR 从模型自身或无标签数据中推导奖励信号。
+
+### 公式化
+
+$$\max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} [r(x, y)] - \beta D_{KL}[\pi_\theta \| \pi_{ref}]$$
+
+关键区别在于 $r(x,y)$ 的来源。
+
+## 分类法 (He et al. 2026)
+
+| 类别 | 奖励来源 | 代表方法 |
+|------|---------|---------|
+| [[certainty-based-rewards|确定性奖励]] | 策略置信度（logits/熵） | EM-RL, RENT, RLIF, RLSC |
+| [[ensemble-based-rewards|集成奖励]] | 多样本一致性（多数投票） | TTRL, SRT, SeRL, R-Zero |
+| [[self-verification-rewards|外部奖励]] | 生成-验证不对称性 | Self-verification, Co-Reward |
+
+## 核心发现
+
+He et al. (2026) 证明：**所有内在 URLVR 方法统一收敛于 [[intrinsic-rewards-sharpening|锐化初始分布]]**，这既是其优势（置信度-正确性对齐时）也是其根本局限（错位时灾难性失败）。
+
+## 相关概念
+
+- [[intrinsic-rewards-sharpening]] — Sharpening 机制
+- [[model-collapse-step]] — 崩溃度量
+- [[reward-hacking-llm]] — 奖励黑客
+- [[he-urlvr-sharpening-2026]] — 综述参考