20260514:增加新内容
This commit is contained in:
26
concepts/rlvr-unified-framework.md
Normal file
26
concepts/rlvr-unified-framework.md
Normal file
@@ -0,0 +1,26 @@
|
||||
---
|
||||
title: RLVR 统一理论框架
|
||||
created: 2025-04-15
|
||||
updated: 2026-05-01
|
||||
type: concept
|
||||
tags: []
|
||||
sources: []
|
||||
---
|
||||
|
||||
# RLVR 统一理论框架
|
||||
|
||||
**URLVR 的统一数学框架**,由 He et al. (ICLR 2026) 建立,从 KL 正则化 RL 目标推导出所有内在方法的收敛行为。
|
||||
|
||||
## 核心结论
|
||||
|
||||
无论奖励函数具体设计如何,内在 URLVR 的最优策略闭式解统一为:
|
||||
|
||||
$$\pi_\theta^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta}r(x,y)\right)$$
|
||||
|
||||
这揭示了所有方法本质上都在做「锐化初始分布」。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[intrinsic-rewards-sharpening]] — Sharpening 机制
|
||||
- [[unsupervised-rlvr]] — URLVR 全景
|
||||
- [[he-urlvr-sharpening-2026]] — 综述参考
|
||||
Reference in New Issue
Block a user