Files
myWiki/concepts/unsupervised-rlvr.md

42 lines
1.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: 无监督可验证奖励强化学习 (URLVR)
created: 2025-04-15
updated: 2026-05-01
type: concept
tags: []
sources: []
---
# 无监督可验证奖励强化学习 (URLVR)
**Unsupervised RL with Verifiable Rewards** — 无需 ground truth 标签的强化学习范式,用代理奖励信号扩展 LLM 后训练。
## 定义
URLVR 是对标准 RLVR 的扩展。标准 RLVR如 DeepSeek-R1依赖可验证的 ground truth数学答案对错、代码通过测试而 URLVR 从模型自身或无标签数据中推导奖励信号。
### 公式化
$$\max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} [r(x, y)] - \beta D_{KL}[\pi_\theta \| \pi_{ref}]$$
关键区别在于 $r(x,y)$ 的来源。
## 分类法 (He et al. 2026)
| 类别 | 奖励来源 | 代表方法 |
|------|---------|---------|
| [[certainty-based-rewards|确定性奖励]] | 策略置信度logits/熵) | EM-RL, RENT, RLIF, RLSC |
| [[ensemble-based-rewards|集成奖励]] | 多样本一致性(多数投票) | TTRL, SRT, SeRL, R-Zero |
| [[self-verification-rewards|外部奖励]] | 生成-验证不对称性 | Self-verification, Co-Reward |
## 核心发现
He et al. (2026) 证明:**所有内在 URLVR 方法统一收敛于 [[intrinsic-rewards-sharpening|锐化初始分布]]**,这既是其优势(置信度-正确性对齐时)也是其根本局限(错位时灾难性失败)。
## 相关概念
- [[intrinsic-rewards-sharpening]] — Sharpening 机制
- [[model-collapse-step]] — 崩溃度量
- [[reward-hacking-llm]] — 奖励黑客
- [[he-urlvr-sharpening-2026]] — 综述参考