Files
myWiki/concepts/multitask-rl.md

49 lines
1.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "多任务强化学习 (Multitask RL)"
created: 2026-06-10
updated: 2026-06-10
type: concept
tags: ["deep-rl", "multitask-learning", "transfer-learning"]
sources: ["[[predictive-representations-scalable-mtrl]]"]
---
# 多任务强化学习 (Multitask RL, MTRL)
**多任务RL**的目标是训练单一 agent 在**任务分布** p(tau) 上最大化期望回报,而非针对单一任务。
## 形式化
每个任务 tau 定义 MDP M_tau = (S, A, T_tau, R_tau, gamma)。状态和动作空间通常共享,动力学和奖励按任务变化。
目标:
```
E_{tau~p(tau), pi} [ sum_t gamma^t * r_t ]
```
## 核心挑战
1. **任务干扰Task Interference**:共享表征必须支持多个可能冲突的目标——不同任务的梯度可能不对齐
2. **非平稳性Non-stationarity**:数据分布随任务混合变化——违背 i.i.d. 假设
3. **容量利用不足Underutilization**:大模型在多任务设定中往往不能有效利用增加的参数
4. **表征瓶颈Representation Bottleneck**[[predictive-representations-scalable-mtrl|Obando-Ceron et al.]] 认为表征质量是核心瓶颈
## 方法谱系
| 方法 | 代表 | 核心机制 |
|------|------|---------|
| Model-Based | Newt, Dreamer | 共享 world model + 规划 |
| Model-Free + 预测表征 | [[mrq-algorithm|MR.Q]] | 辅助预测目标塑造表征 |
| 纯 Model-Free | PPO, SAC | 仅奖励监督 |
## Scaling 行为
关键发现:纯 model-free 方法随模型规模增大无收益甚至退化;加入[[predictive-representation-learning|预测表征学习]]后持续改善——表征质量是 scaling 的瓶颈。
## 参考
- [[predictive-representations-scalable-mtrl|Scalable Multitask Deep RL]]
- [[predictive-representation-learning|Predictive Representation Learning]]
- [[deep-rl-scaling|Scaling Deep RL]]