60 lines
1.6 KiB
Markdown
60 lines
1.6 KiB
Markdown
---
|
||
title: "动作路由策略 (Action-Routing Policy)"
|
||
created: 2026-06-17
|
||
updated: 2026-06-17
|
||
type: concept
|
||
tags: [reinforcement-learning, routing, policy-gradient]
|
||
sources: [raw/papers/zhang-tarpo-2026.md]
|
||
confidence: high
|
||
---
|
||
|
||
# 动作路由策略 (Action-Routing Policy)
|
||
|
||
动作路由策略是 [[tarpo|TARPO]] 框架中将**推理模式选择形式化为 RL 策略**的核心抽象。
|
||
|
||
## 形式化定义
|
||
|
||
将推理模式选择建模为二元离散动作空间 `D = {hard, soft}` 上的随机策略:
|
||
|
||
```
|
||
ρ_θ(d_t | h_t) = Softmax(W_r * h_t + b_r)
|
||
```
|
||
|
||
其中 `W_r ∈ R^{2×d}` 和 `b_r ∈ R^2` 是 [[action-head-router|动作头]] 的可训练参数。
|
||
|
||
## 策略优化目标
|
||
|
||
路由策略与 LLM 骨干共享 group-relative advantage 信号:
|
||
|
||
```
|
||
L_act = -(1/T) * sum_{t=1}^{T} log ρ_θ(d_{i,t} | h_{i,t}) * A_hat_i
|
||
```
|
||
|
||
- 当 advantage 为正时,鼓励当前路由决策
|
||
- 当 advantage 为负时,惩罚当前路由决策
|
||
- 通过 λ 超参数控制路由目标在总损失中的权重
|
||
|
||
## KL 正则化
|
||
|
||
为保证训练稳定性,对路由策略施加 KL 惩罚:
|
||
|
||
```
|
||
L_KL = sum_t [δ_t * D_KL(π_θ || π_ref) + α * D_KL(ρ_θ || ρ_ref)]
|
||
```
|
||
|
||
- `δ_t = ρ_θ(Hard | h_t)`:只在 hard 模式下施加 token 级 KL
|
||
- α 控制路由策略 KL 的强度
|
||
|
||
## 关键特性
|
||
|
||
- **可学习**:完全通过 RL 优化,无需预设启发式阈值
|
||
- **随机性保留**:从策略采样(而非 argmax)保证探索
|
||
- **初始化敏感性**:初始偏置 `b_0` 影响 soft 比率和训练 reward 动态
|
||
|
||
## 参考
|
||
|
||
- [[action-head-router|动作头路由器]]
|
||
- [[token-wise-routing|逐token路由]]
|
||
- [[tarpo|TARPO]]
|
||
- [[grpo|GRPO]]
|