Files
myWiki/raw/papers/he-urlvr-sharpening-2026.md

30 lines
1.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# How Far Can Unsupervised RLVR Scale LLM Training?
- **arXiv ID**: 2603.08660
- **作者**: Bingxiang He, Yuxin Zuo, Zeyuan Liu, et al. (22 authors)
- **机构**: Tsinghua University, Shanghai AI Lab, Xi'an Jiaotong, UIUC, SJTU, Peking University, Frontis.AI
- **日期**: 2026-03-09
- **会议**: Accepted to ICLR 2026
- **GitHub**: https://github.com/PRIME-RL/TTRL
- **标签**: #RLVR #unsupervised-learning #LLM-training #reward-hacking #model-collapse
## 摘要
无监督可验证奖励强化学习 (URLVR) 通过无需 ground truth 标签的奖励信号扩展 LLM 训练。本文建立统一理论框架,揭示所有内在奖励方法本质上都收敛于"锐化模型初始分布 (sharpening)"——当初始置信度与正确性对齐时放大收益,错位时则灾难性失败。实验表明内在奖励始终遵循"先升后降 (rise-then-fall)"模式。提出 Model Collapse Step 作为模型先验的实用指标。最后探索基于计算不对称性的外部奖励方法self-verification展示其可能突破置信度-正确性天花板的初步证据。
## 核心贡献
1. **URLVR 分类法**: 将方法分为内在奖励 (intrinsic) 和外部奖励 (external) 两类
2. **统一理论框架**: 证明所有内在方法收敛于锐化初始分布
3. **Rise-then-Fall 模式**: 系统实验跨越多种方法验证统一的先升后降轨迹
4. **Model Collapse Step**: 无需 ground truth 标签的模型先验度量,预测 RL 可训练性
5. **外部突破路径**: Self-verification 展示持续改进而无崩溃模式
## 结构
- Sec 2: URLVR 方法分类Certainty-based / Ensemble-based
- Sec 3: Sharpening 机制的理论推导
- Sec 4: 内在 URLVR 何时有效/失败
- Sec 5: 测试时训练中的安全应用
- Sec 6: Model Collapse Step 指标
- Sec 7: 外部奖励方法的突破Self-verification