20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/btsd-ppo.md
+++ b/concepts/btsd-ppo.md
@@ -0,0 +1,50 @@
+---
+title: "BTSD-PPO"
+created: 2026-06-17
+updated: 2026-06-17
+type: concept
+tags: [reinforcement-learning, algorithm, ppo, action-interface]
+sources: [raw/papers/chen-bellman-taylor-score-2026.md]
+confidence: high
+---
+
+# BTSD-PPO
+
+BTSD-PPO 是 [[bellman-taylor-score-decoding|Bellman-Taylor Score Decoding]] 框架与 PPO 的**具体算法实例**——在潜在得分 MDP 上使用标准 PPO 训练，无需对动作解码器求导。
+
+## 算法流程
+
+```
+每轮：
+  1. 观察状态 s_t
+  2. 策略 π̃ 输出得分 z_t ~ π̃(·|s_t)
+  3. 解码器: a_t = Γ(s_t, z_t)  # 前向传播，无梯度
+  4. 环境执行 a_t，返回 r_t, s_{t+1}
+  5. 收集 (s_t, z_t, r_t) 到轨迹 buffer
+  
+  6. PPO 更新: θ ← θ + η ∇_θ L_PPO(θ)
+     # 梯度仅涉及 log π̃_θ(z|s)，不涉及 Γ
+```
+
+## 关键特性
+
+- **零梯度解耦**：解码器是完全的黑箱优化器，PPO 策略梯度对其透明
+- **标准化接口**：π̃ 输出连续向量 z，π̃ 是标准高斯策略
+- **无需架构改动**：直接使用现成的 PPO 实现
+
+## 与可微优化层的区别
+
+可微优化层方法需要 `∂a/∂z = ∂Γ(s,z)/∂z` 用于反向传播——这对组合/整数优化问题是不可微的。BTSD-PPO 用 `∇_θ log π̃_θ(z|s)` 替代，完全绕过此问题。
+
+## 其他 DRL 算法兼容性
+
+BTSD 框架不限于 PPO——任何连续动作 DRL 算法均可应用：
+- SAC, TD3 等 actor-critic 方法
+- 离散化得分空间后也可用 DQN
+- 实验表明性能提升来自 BTSD 框架本身，而非特定优化器
+
+## 参考
+
+- [[bellman-taylor-score-decoding|BTSD]]
+- [[latent-score-mdp|潜在得分 MDP]]
+- [[action-decoder|动作解码器]]