20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/bellman-taylor-score-decoding.md
+++ b/concepts/bellman-taylor-score-decoding.md
@@ -0,0 +1,44 @@
+---
+title: "Bellman-Taylor 得分解码 (BTSD)"
+created: 2026-06-17
+updated: 2026-06-17
+type: concept
+tags: [reinforcement-learning, mdp, action-interface, operations-research]
+sources: [raw/papers/chen-bellman-taylor-score-2026.md]
+confidence: high
+---
+
+# Bellman-Taylor 得分解码 (BTSD)
+
+BTSD 是 [[bellman-taylor-score-decoding|Chen et al. (2026)]] 提出的框架，通过**Taylor 展开最优 Q 函数**将 MDP 的动作空间从复杂约束空间转换为无约束欧氏得分空间。
+
+## 核心机制
+
+```
+原始 MDP (s, a ∈ A(s) 受约束)  →  Taylor 展开 Q*  →  得分 MDP (s, z ∈ R^d)
+```
+
+1. **Taylor 近似**：`Q*(s,a) ≈ ψ_s(a) + γ⟨∇G*_s, φ_s(a)⟩ + const`
+2. **动作解码器**：`Γ(s,z) = argmax [ψ_s(a) + ⟨z, φ_s(a)⟩]`
+3. **策略学习**：π̃ 输出得分 z ∈ R^d（无约束连续动作）
+4. **前向解码**：解码器 Γ(s,z) 将 z 映射为可行动作 a
+
+## 与优化层的区别
+
+| 方法 | 解码器角色 | 梯度需求 |
+|------|----------|---------|
+| Differentiable Optimization | 可训练层 | 需通过优化器反向传播 |
+| BTSD | 固定 action-selection map | 仅前向传播，无需梯度 |
+
+## 性能保证
+
+最优性差距 `J* − J_decode ≤ ε_approx + ε_learn`：
+- `ε_approx` 由 Taylor 余项控制
+- `ε_learn` 是标准 DRL 的泛化误差
+
+## 参考
+
+- [[latent-score-mdp|潜在得分 MDP]]
+- [[action-decoder|动作解码器]]
+- [[taylor-expansion-q-function|Q 函数 Taylor 展开]]
+- [[bellman-taylor-score-decoding|BTSD 论文]]