Files
myWiki/concepts/bellman-taylor-score-decoding.md

45 lines
1.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Bellman-Taylor 得分解码 (BTSD)"
created: 2026-06-17
updated: 2026-06-17
type: concept
tags: [reinforcement-learning, mdp, action-interface, operations-research]
sources: [raw/papers/chen-bellman-taylor-score-2026.md]
confidence: high
---
# Bellman-Taylor 得分解码 (BTSD)
BTSD 是 [[bellman-taylor-score-decoding|Chen et al. (2026)]] 提出的框架,通过**Taylor 展开最优 Q 函数**将 MDP 的动作空间从复杂约束空间转换为无约束欧氏得分空间。
## 核心机制
```
原始 MDP (s, a ∈ A(s) 受约束) → Taylor 展开 Q* → 得分 MDP (s, z ∈ R^d)
```
1. **Taylor 近似**`Q*(s,a) ≈ ψ_s(a) + γ⟨∇G*_s, φ_s(a)⟩ + const`
2. **动作解码器**`Γ(s,z) = argmax [ψ_s(a) + ⟨z, φ_s(a)⟩]`
3. **策略学习**:π̃ 输出得分 z ∈ R^d无约束连续动作
4. **前向解码**:解码器 Γ(s,z) 将 z 映射为可行动作 a
## 与优化层的区别
| 方法 | 解码器角色 | 梯度需求 |
|------|----------|---------|
| Differentiable Optimization | 可训练层 | 需通过优化器反向传播 |
| BTSD | 固定 action-selection map | 仅前向传播,无需梯度 |
## 性能保证
最优性差距 `J* J_decode ≤ ε_approx + ε_learn`
- `ε_approx` 由 Taylor 余项控制
- `ε_learn` 是标准 DRL 的泛化误差
## 参考
- [[latent-score-mdp|潜在得分 MDP]]
- [[action-decoder|动作解码器]]
- [[taylor-expansion-q-function|Q 函数 Taylor 展开]]
- [[bellman-taylor-score-decoding|BTSD 论文]]