Files
myWiki/concepts/mrq-algorithm.md

55 lines
1.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "MR.Q 算法 (MR.Q Algorithm)"
created: 2026-06-10
updated: 2026-06-10
type: concept
tags: ["deep-rl", "model-free-rl", "actor-critic", "predictive-learning"]
sources: ["[[predictive-representations-scalable-mtrl]]"]
---
# MR.Q 算法 (MR.Q Algorithm)
**MR.Q**Fujimoto et al., 2025是一个 model-free RL agent其核心创新是将[[auxiliary-predictive-objectives|预测目标]]整合进 TD 学习以塑造表征。
## 架构
```
观测 s_t, 任务 tau → 编码器 phi → 潜状态 z_t
Actor pi(a|z) + Twin Critics Q(z,a)
预测头: z_{t+1}, r_t, d_t
```
## 核心组件
1. **编码器** phi_xi: (s_t, tau) -> z_t — 观测+任务到潜空间
2. **Actor-Critic**TD3 风格的 twin Q-network + 确定性策略
3. **预测模块**:从 (z_t, a_t) 预测 (z_{t+1}, r_t, d_t)
4. **梯度流**:预测损失回传至编码器 → 塑造表征
## 关键设计选择
- **不做规划**:预测模型仅用于表征学习,不做潜空间 rollout
- **共享编码器**Actor、Critic、预测头共享同一个编码器
- **TD3 基础**twin critics 缓解过估计偏差
## 为什么叫 MR.Q
MR = Model-based Representations基于模型的表征
Q = Q-learning / Critic
即:使用 model-based 的表征学习 + model-free 的控制。
## 在 [[predictive-representations-scalable-mtrl|多任务扩展]]中
- 扩展到语言条件多任务设置(遵循 Newt 协议)
- 10M steps 低数据区间评估vs 传统 100M
- 全部 10 个 MMBench 域上超越 Newt
## 参考
- [[predictive-representations-scalable-mtrl|Scalable Multitask Deep RL]]
- [[predictive-representation-learning|Predictive Representation Learning]]
- [[auxiliary-predictive-objectives|Auxiliary Predictive Objectives]]
- [[model-free-rl|Model-Free RL]]