Files
myWiki/concepts/soft-actor-critic.md

53 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Soft Actor-Critic (SAC)"
created: 2026-06-17
updated: 2026-06-17
type: concept
tags: [reinforcement-learning, algorithm, actor-critic, entropy]
sources: [raw/papers/naveen-repmt-sac-2026.md]
confidence: high
---
# Soft Actor-Critic (SAC)
SAC 是**最大熵强化学习**的代表性算法,在 [[repmt-sac|RepMT-SAC]] 中作为基础框架。
## 核心思想
标准 RL 仅最大化期望奖励。SAC 额外最大化**策略熵**
```
π* = argmax E[Σ γ^t (r_t + α H(π(·|s_t)))]
```
其中 `H(π) = -E[log π(a|s)]` 鼓励探索和策略多样性。
## 架构
- **Actor**:参数化策略 π_θ(a|s),输出动作分布(通常高斯)
- **双 Critic**:两个 Q 网络减少过估计偏差
- **温度参数 α**:控制奖励-熵 trade-off
- **重参数化技巧**`a = f_θ(s, ε)`,ε ~ N(0,I),允许低方差梯度
## RepMT-SAC 中的扩展
在[[rep-mt-sac|RepMT-SAC]]中SAC 被扩展为多任务变体:
- Q 函数线性化:`Q(s,a;τ) = ⟨φ(s,a), w(τ)⟩`
- 策略条件于任务:`π(a|s,τ)`
- 上游联合学习 φ + w + π
- 下游冻结 φ,微调 w 和 π
## 关键特性
- **Off-policy**:从回放缓冲区学习
- **自动温度调节**α 可自适应调整
- **连续动作**:天然支持连续控制
- **样本效率**:相比 on-policy 方法(如 PPO
## 参考
- [[rep-mt-sac|RepMT-SAC]]
- [[multitask-rl|多任务 RL]]
- [[reinforcement-learning|强化学习]]