20260514:增加新内容

2026-05-14 13:54:52 +08:00
parent 56c4d3ef7c
commit b116710e4c
294 changed files with 10682 additions and 255 deletions
--- a/concepts/test-time-training-rl.md
+++ b/concepts/test-time-training-rl.md
@@ -0,0 +1,24 @@
+---
+title: 测试时训练 RL (Test-Time Training with RL)
+created: 2025-04-15
+updated: 2026-05-01
+type: concept
+tags: []
+sources: []
+---
+
+# 测试时训练 RL (Test-Time Training with RL)
+
+**在推理时对少量领域特定数据使用内在 URLVR 的轻量适应技术**。
+
+## He et al. 的发现
+
+尽管内在 URLVR 在规模化训练中存在根本限制，但在小数据集和测试时训练场景中安全有效——即使初始偏好完全错误也能避免崩溃。
+
+这使得内在奖励成为"推理时快速适应"而非"大规模后训练"的理想工具。
+
+## 相关概念
+
+- [[unsupervised-rlvr]] — URLVR 全景
+- [[intrinsic-rewards-sharpening]] — 底层机制
+- [[he-urlvr-sharpening-2026]] — 综述参考