20260617:目前有914 页
This commit is contained in:
48
concepts/epoch-based-optimistic-mle.md
Normal file
48
concepts/epoch-based-optimistic-mle.md
Normal file
@@ -0,0 +1,48 @@
|
||||
---
|
||||
title: "Epoch-based 乐观 MLE (Epoch-based Optimistic MLE)"
|
||||
created: 2026-06-10
|
||||
updated: 2026-06-10
|
||||
type: concept
|
||||
tags: ["rl-algorithms", "optimism", "online-learning", "exploration"]
|
||||
sources: ["[[minimax-policy-regret-pomg]]"]
|
||||
---
|
||||
|
||||
# Epoch-based 乐观 MLE
|
||||
|
||||
**Epoch-based Optimistic MLE** 是 [[minimax-policy-regret-pomg|Arora (2026)]] 提出的 POMG 策略后悔最小化算法,核心思想是通过极少次数的策略切换来控制传输成本。
|
||||
|
||||
## 算法结构
|
||||
|
||||
```
|
||||
for e = 0, 1, 2, ...:
|
||||
T_e = 2^e // 几何增长的 epoch 长度
|
||||
基于累积数据构建 MLE 置信集 C_e
|
||||
选择乐观策略 pi_e = argmax_{pi, xi in C_e} V^{pi, xi}
|
||||
执行 pi_e 整个 epoch (T_e episodes)
|
||||
```
|
||||
|
||||
## 关键设计选择
|
||||
|
||||
1. **几何增长 epoch**:T_e = 2^e
|
||||
- 仅 O(log T) 个不同策略被部署
|
||||
- 切换成本保持 polylogarithmic
|
||||
|
||||
2. **累积置信集**:每个 epoch 使用所有历史数据构建
|
||||
- 置信集单调收缩
|
||||
- 确保乐观性:真实参数以高概率在置信集内
|
||||
|
||||
3. **乐观策略选择**:在置信集内最大化价值
|
||||
- 探索-利用的经典乐观原则
|
||||
- 配合 [[eluder-dimension|Eluder 维度]]确保高效
|
||||
|
||||
## 策略切换的传输成本
|
||||
|
||||
在 [[posterior-lipschitz-adversary|Posterior-Lipschitz POMG]] 中,每次策略切换会触发对手响应变化。Epoch 结构确保:
|
||||
- 仅 O(log T) 次切换
|
||||
- 每次切换的对手适应成本被 Lipschitz 常数控制
|
||||
- 总传输成本 O(m * H * log T),不破坏 sqrt(T) 速率
|
||||
|
||||
## 参考
|
||||
- [[minimax-policy-regret-pomg|Minimax-Optimal Policy Regret in POMGs]]
|
||||
- [[policy-regret|Policy Regret]]
|
||||
- [[eluder-dimension|Eluder Dimension]]
|
||||
Reference in New Issue
Block a user