20260617:目前有914 页
This commit is contained in:
52
concepts/soft-actor-critic.md
Normal file
52
concepts/soft-actor-critic.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
title: "Soft Actor-Critic (SAC)"
|
||||
created: 2026-06-17
|
||||
updated: 2026-06-17
|
||||
type: concept
|
||||
tags: [reinforcement-learning, algorithm, actor-critic, entropy]
|
||||
sources: [raw/papers/naveen-repmt-sac-2026.md]
|
||||
confidence: high
|
||||
---
|
||||
|
||||
# Soft Actor-Critic (SAC)
|
||||
|
||||
SAC 是**最大熵强化学习**的代表性算法,在 [[repmt-sac|RepMT-SAC]] 中作为基础框架。
|
||||
|
||||
## 核心思想
|
||||
|
||||
标准 RL 仅最大化期望奖励。SAC 额外最大化**策略熵**:
|
||||
|
||||
```
|
||||
π* = argmax E[Σ γ^t (r_t + α H(π(·|s_t)))]
|
||||
```
|
||||
|
||||
其中 `H(π) = -E[log π(a|s)]` 鼓励探索和策略多样性。
|
||||
|
||||
## 架构
|
||||
|
||||
- **Actor**:参数化策略 π_θ(a|s),输出动作分布(通常高斯)
|
||||
- **双 Critic**:两个 Q 网络减少过估计偏差
|
||||
- **温度参数 α**:控制奖励-熵 trade-off
|
||||
- **重参数化技巧**:`a = f_θ(s, ε)`,ε ~ N(0,I),允许低方差梯度
|
||||
|
||||
## RepMT-SAC 中的扩展
|
||||
|
||||
在[[rep-mt-sac|RepMT-SAC]]中,SAC 被扩展为多任务变体:
|
||||
|
||||
- Q 函数线性化:`Q(s,a;τ) = ⟨φ(s,a), w(τ)⟩`
|
||||
- 策略条件于任务:`π(a|s,τ)`
|
||||
- 上游联合学习 φ + w + π
|
||||
- 下游冻结 φ,微调 w 和 π
|
||||
|
||||
## 关键特性
|
||||
|
||||
- **Off-policy**:从回放缓冲区学习
|
||||
- **自动温度调节**:α 可自适应调整
|
||||
- **连续动作**:天然支持连续控制
|
||||
- **样本效率**:相比 on-policy 方法(如 PPO)
|
||||
|
||||
## 参考
|
||||
|
||||
- [[rep-mt-sac|RepMT-SAC]]
|
||||
- [[multitask-rl|多任务 RL]]
|
||||
- [[reinforcement-learning|强化学习]]
|
||||
Reference in New Issue
Block a user