20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/reparameterization-exploration.md
+++ b/concepts/reparameterization-exploration.md
@@ -0,0 +1,53 @@
+---
+title: "重参数化探索 (Reparameterization Exploration)"
+created: 2026-06-17
+updated: 2026-06-17
+type: concept
+tags: [reinforcement-learning, latent-reasoning, exploration]
+sources: [raw/papers/zhang-tarpo-2026.md]
+confidence: high
+---
+
+# 重参数化探索 (Reparameterization Exploration)
+
+重参数化探索是 [[latent-reasoning|潜在推理]] RL 中解决**连续表征确定性困境**的一条技术路线——通过噪声注入为连续表征引入随机性。
+
+## 动机
+
+连续潜在表征（如 [[soft-token]]）本质上是确定性的——它们是对 logits 的加权求和，不包含采样随机性。这限制了 RL 中的策略探索。
+
+## 主要方法
+
+### Gaussian 噪声注入
+
+在压缩潜变量或连续 token embedding 中注入高斯噪声：
+
+```
+u_noisy = u + eps， eps ~ N(0, sigma^2)
+```
+
+代表性工作：Soft Tokens（Butt et al., 2025）、Latent-GRPO（Deng et al., 2026）
+
+### Gumbel-Softmax 重参数化
+
+使用 [[gumbel-softmax|Gumbel-Softmax trick]] 从 categorical 分布中导出可微的概率 soft-token 分布：
+
+- 保留离散 token 的采样随机性
+- 同时支持梯度反向传播
+- 在 top-k 条件下构造近似的离散采样
+
+## 与 TARPO 的关系
+
+[[tarpo|TARPO]] 采取了**正交策略**——不修改连续表征本身，而是引入**结构级探索**（structural exploration）：
+
+- 重参数化探索 = **表征级**随机性（在连续向量内部加噪）
+- TARPO 的路由探索 = **结构级**随机性（在 hard/soft 模式间采样）
+
+TARPO 论文明确将两者的结合作为未来方向。
+
+## 参考
+
+- [[gumbel-softmax|Gumbel-Softmax]]
+- [[latent-reasoning|潜在推理]]
+- [[hybrid-reasoning|混合推理]]
+- [[tarpo|TARPO]]