20260514:增加新内容
This commit is contained in:
24
concepts/test-time-training-rl.md
Normal file
24
concepts/test-time-training-rl.md
Normal file
@@ -0,0 +1,24 @@
|
||||
---
|
||||
title: 测试时训练 RL (Test-Time Training with RL)
|
||||
created: 2025-04-15
|
||||
updated: 2026-05-01
|
||||
type: concept
|
||||
tags: []
|
||||
sources: []
|
||||
---
|
||||
|
||||
# 测试时训练 RL (Test-Time Training with RL)
|
||||
|
||||
**在推理时对少量领域特定数据使用内在 URLVR 的轻量适应技术**。
|
||||
|
||||
## He et al. 的发现
|
||||
|
||||
尽管内在 URLVR 在规模化训练中存在根本限制,但在小数据集和测试时训练场景中安全有效——即使初始偏好完全错误也能避免崩溃。
|
||||
|
||||
这使得内在奖励成为"推理时快速适应"而非"大规模后训练"的理想工具。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[unsupervised-rlvr]] — URLVR 全景
|
||||
- [[intrinsic-rewards-sharpening]] — 底层机制
|
||||
- [[he-urlvr-sharpening-2026]] — 综述参考
|
||||
Reference in New Issue
Block a user