20260617:目前有914 页
This commit is contained in:
50
concepts/post-action-configuration.md
Normal file
50
concepts/post-action-configuration.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
title: "后动作配置 (Post-Action Configuration)"
|
||||
created: 2026-06-17
|
||||
updated: 2026-06-17
|
||||
type: concept
|
||||
tags: [mdp, reinforcement-learning, operations-research]
|
||||
sources: [raw/papers/chen-bellman-taylor-score-2026.md]
|
||||
confidence: high
|
||||
---
|
||||
|
||||
# 后动作配置 (Post-Action Configuration)
|
||||
|
||||
后动作配置是 [[bellman-taylor-score-decoding|BTSD]] 框架中的关键结构表示——捕获系统在**动作执行后、不确定性实现前**的中间状态。
|
||||
|
||||
## 定义
|
||||
|
||||
给定状态 s 和自然动作 a:
|
||||
|
||||
```
|
||||
φ_s(a) ∈ X_s ⊆ R^{d_s}
|
||||
```
|
||||
|
||||
`φ_s(a)` 是确定性的中间配置。完整转移:`s' = Ξ_s(φ_s(a), ξ_s)`,其中 `ξ_s` 是外生扰动。
|
||||
|
||||
## 为什么重要
|
||||
|
||||
`φ_s(a)` 是**Taylor 展开 Q 函数的锚点**:
|
||||
|
||||
```
|
||||
Q*(s,a) ≈ ψ_s(a) + γ⟨∇G*_s, φ_s(a) - x_ref⟩ + const
|
||||
```
|
||||
|
||||
得分 z 的语义解释 = `γ ∇G*_s`:后动作配置对下游价值的**边际贡献**。
|
||||
|
||||
## 在排队网络中的实例
|
||||
|
||||
- 状态 s:当前队列长度 + 服务器可用性
|
||||
- 动作 a:调度/分配决策
|
||||
- `φ_s(a)`:执行动作后、新到达/服务完成前的队列配置
|
||||
- 得分 z:各队列的后动作边际价值估计
|
||||
|
||||
## 与延续价值函数的关系
|
||||
|
||||
`G*_s(x) = E[V* (Ξ_s(x, ξ_s))]` 是后动作配置 x 的期望下游回报。φ_s(a) 决定了延续价值的输入,使得 BTSD 框架中的得分解码具有明确的经济含义。
|
||||
|
||||
## 参考
|
||||
|
||||
- [[continuation-value-function|延续价值函数]]
|
||||
- [[bellman-taylor-score-decoding|BTSD]]
|
||||
- [[taylor-expansion-q-function|Q 函数 Taylor 展开]]
|
||||
Reference in New Issue
Block a user