20260514:增加新内容
This commit is contained in:
41
concepts/unsupervised-rlvr.md
Normal file
41
concepts/unsupervised-rlvr.md
Normal file
@@ -0,0 +1,41 @@
|
||||
---
|
||||
title: 无监督可验证奖励强化学习 (URLVR)
|
||||
created: 2025-04-15
|
||||
updated: 2026-05-01
|
||||
type: concept
|
||||
tags: []
|
||||
sources: []
|
||||
---
|
||||
|
||||
# 无监督可验证奖励强化学习 (URLVR)
|
||||
|
||||
**Unsupervised RL with Verifiable Rewards** — 无需 ground truth 标签的强化学习范式,用代理奖励信号扩展 LLM 后训练。
|
||||
|
||||
## 定义
|
||||
|
||||
URLVR 是对标准 RLVR 的扩展。标准 RLVR(如 DeepSeek-R1)依赖可验证的 ground truth(数学答案对错、代码通过测试),而 URLVR 从模型自身或无标签数据中推导奖励信号。
|
||||
|
||||
### 公式化
|
||||
|
||||
$$\max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} [r(x, y)] - \beta D_{KL}[\pi_\theta \| \pi_{ref}]$$
|
||||
|
||||
关键区别在于 $r(x,y)$ 的来源。
|
||||
|
||||
## 分类法 (He et al. 2026)
|
||||
|
||||
| 类别 | 奖励来源 | 代表方法 |
|
||||
|------|---------|---------|
|
||||
| [[certainty-based-rewards|确定性奖励]] | 策略置信度(logits/熵) | EM-RL, RENT, RLIF, RLSC |
|
||||
| [[ensemble-based-rewards|集成奖励]] | 多样本一致性(多数投票) | TTRL, SRT, SeRL, R-Zero |
|
||||
| [[self-verification-rewards|外部奖励]] | 生成-验证不对称性 | Self-verification, Co-Reward |
|
||||
|
||||
## 核心发现
|
||||
|
||||
He et al. (2026) 证明:**所有内在 URLVR 方法统一收敛于 [[intrinsic-rewards-sharpening|锐化初始分布]]**,这既是其优势(置信度-正确性对齐时)也是其根本局限(错位时灾难性失败)。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[intrinsic-rewards-sharpening]] — Sharpening 机制
|
||||
- [[model-collapse-step]] — 崩溃度量
|
||||
- [[reward-hacking-llm]] — 奖励黑客
|
||||
- [[he-urlvr-sharpening-2026]] — 综述参考
|
||||
Reference in New Issue
Block a user