SidneyZhang/myWiki

Files

Sidney Zhang b116710e4c

20260514:增加新内容

2026-05-14 13:54:52 +08:00

719 B

Raw Blame History

title, created, updated, type, tags, sources

title

created

updated

type

tags

sources

RLVR 统一理论框架

2025-04-15

2026-05-01

concept

RLVR 统一理论框架

URLVR 的统一数学框架，由 He et al. (ICLR 2026) 建立，从 KL 正则化 RL 目标推导出所有内在方法的收敛行为。

核心结论

无论奖励函数具体设计如何，内在 URLVR 的最优策略闭式解统一为：

\pi_\theta^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta}r(x,y)\right)

这揭示了所有方法本质上都在做「锐化初始分布」。

相关概念

intrinsic-rewards-sharpening — Sharpening 机制
unsupervised-rlvr — URLVR 全景
he-urlvr-sharpening-2026 — 综述参考