20260617:目前有914 页
This commit is contained in:
54
concepts/mrq-algorithm.md
Normal file
54
concepts/mrq-algorithm.md
Normal file
@@ -0,0 +1,54 @@
|
||||
---
|
||||
title: "MR.Q 算法 (MR.Q Algorithm)"
|
||||
created: 2026-06-10
|
||||
updated: 2026-06-10
|
||||
type: concept
|
||||
tags: ["deep-rl", "model-free-rl", "actor-critic", "predictive-learning"]
|
||||
sources: ["[[predictive-representations-scalable-mtrl]]"]
|
||||
---
|
||||
|
||||
# MR.Q 算法 (MR.Q Algorithm)
|
||||
|
||||
**MR.Q**(Fujimoto et al., 2025)是一个 model-free RL agent,其核心创新是将[[auxiliary-predictive-objectives|预测目标]]整合进 TD 学习以塑造表征。
|
||||
|
||||
## 架构
|
||||
|
||||
```
|
||||
观测 s_t, 任务 tau → 编码器 phi → 潜状态 z_t
|
||||
↓
|
||||
Actor pi(a|z) + Twin Critics Q(z,a)
|
||||
↓
|
||||
预测头: z_{t+1}, r_t, d_t
|
||||
```
|
||||
|
||||
## 核心组件
|
||||
|
||||
1. **编码器** phi_xi: (s_t, tau) -> z_t — 观测+任务到潜空间
|
||||
2. **Actor-Critic**:TD3 风格的 twin Q-network + 确定性策略
|
||||
3. **预测模块**:从 (z_t, a_t) 预测 (z_{t+1}, r_t, d_t)
|
||||
4. **梯度流**:预测损失回传至编码器 → 塑造表征
|
||||
|
||||
## 关键设计选择
|
||||
|
||||
- **不做规划**:预测模型仅用于表征学习,不做潜空间 rollout
|
||||
- **共享编码器**:Actor、Critic、预测头共享同一个编码器
|
||||
- **TD3 基础**:twin critics 缓解过估计偏差
|
||||
|
||||
## 为什么叫 MR.Q
|
||||
|
||||
MR = Model-based Representations(基于模型的表征)
|
||||
Q = Q-learning / Critic
|
||||
|
||||
即:使用 model-based 的表征学习 + model-free 的控制。
|
||||
|
||||
## 在 [[predictive-representations-scalable-mtrl|多任务扩展]]中
|
||||
|
||||
- 扩展到语言条件多任务设置(遵循 Newt 协议)
|
||||
- 10M steps 低数据区间评估(vs 传统 100M)
|
||||
- 全部 10 个 MMBench 域上超越 Newt
|
||||
|
||||
## 参考
|
||||
- [[predictive-representations-scalable-mtrl|Scalable Multitask Deep RL]]
|
||||
- [[predictive-representation-learning|Predictive Representation Learning]]
|
||||
- [[auxiliary-predictive-objectives|Auxiliary Predictive Objectives]]
|
||||
- [[model-free-rl|Model-Free RL]]
|
||||
Reference in New Issue
Block a user