20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/mrq-algorithm.md
+++ b/concepts/mrq-algorithm.md
@@ -0,0 +1,54 @@
+---
+title: "MR.Q 算法 (MR.Q Algorithm)"
+created: 2026-06-10
+updated: 2026-06-10
+type: concept
+tags: ["deep-rl", "model-free-rl", "actor-critic", "predictive-learning"]
+sources: ["[[predictive-representations-scalable-mtrl]]"]
+---
+
+# MR.Q 算法 (MR.Q Algorithm)
+
+**MR.Q**（Fujimoto et al., 2025）是一个 model-free RL agent，其核心创新是将[[auxiliary-predictive-objectives|预测目标]]整合进 TD 学习以塑造表征。
+
+## 架构
+
+```
+观测 s_t, 任务 tau → 编码器 phi → 潜状态 z_t
+                                      ↓
+                          Actor pi(a|z)  +  Twin Critics Q(z,a)
+                                      ↓
+                          预测头: z_{t+1}, r_t, d_t
+```
+
+## 核心组件
+
+1. **编码器** phi_xi: (s_t, tau) -> z_t — 观测+任务到潜空间
+2. **Actor-Critic**：TD3 风格的 twin Q-network + 确定性策略
+3. **预测模块**：从 (z_t, a_t) 预测 (z_{t+1}, r_t, d_t)
+4. **梯度流**：预测损失回传至编码器 → 塑造表征
+
+## 关键设计选择
+
+- **不做规划**：预测模型仅用于表征学习，不做潜空间 rollout
+- **共享编码器**：Actor、Critic、预测头共享同一个编码器
+- **TD3 基础**：twin critics 缓解过估计偏差
+
+## 为什么叫 MR.Q
+
+MR = Model-based Representations（基于模型的表征）
+Q = Q-learning / Critic
+
+即：使用 model-based 的表征学习 + model-free 的控制。
+
+## 在 [[predictive-representations-scalable-mtrl|多任务扩展]]中
+
+- 扩展到语言条件多任务设置（遵循 Newt 协议）
+- 10M steps 低数据区间评估（vs 传统 100M）
+- 全部 10 个 MMBench 域上超越 Newt
+
+## 参考
+- [[predictive-representations-scalable-mtrl|Scalable Multitask Deep RL]]
+- [[predictive-representation-learning|Predictive Representation Learning]]
+- [[auxiliary-predictive-objectives|Auxiliary Predictive Objectives]]
+- [[model-free-rl|Model-Free RL]]