SidneyZhang/myWiki

Files

Sidney Zhang b116710e4c

20260514:增加新内容

2026-05-14 13:54:52 +08:00

733 B

Raw Blame History

title, created, updated, type, tags, sources

title

created

updated

type

tags

sources

测试时训练 RL (Test-Time Training with RL)

2025-04-15

2026-05-01

concept

测试时训练 RL (Test-Time Training with RL)

在推理时对少量领域特定数据使用内在 URLVR 的轻量适应技术。

He et al. 的发现

尽管内在 URLVR 在规模化训练中存在根本限制，但在小数据集和测试时训练场景中安全有效——即使初始偏好完全错误也能避免崩溃。

这使得内在奖励成为"推理时快速适应"而非"大规模后训练"的理想工具。

相关概念

unsupervised-rlvr — URLVR 全景
intrinsic-rewards-sharpening — 底层机制
he-urlvr-sharpening-2026 — 综述参考