Files
myWiki/concepts/action-routing-policy.md

60 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "动作路由策略 (Action-Routing Policy)"
created: 2026-06-17
updated: 2026-06-17
type: concept
tags: [reinforcement-learning, routing, policy-gradient]
sources: [raw/papers/zhang-tarpo-2026.md]
confidence: high
---
# 动作路由策略 (Action-Routing Policy)
动作路由策略是 [[tarpo|TARPO]] 框架中将**推理模式选择形式化为 RL 策略**的核心抽象。
## 形式化定义
将推理模式选择建模为二元离散动作空间 `D = {hard, soft}` 上的随机策略:
```
ρ_θ(d_t | h_t) = Softmax(W_r * h_t + b_r)
```
其中 `W_r ∈ R^{2×d}``b_r ∈ R^2` 是 [[action-head-router|动作头]] 的可训练参数。
## 策略优化目标
路由策略与 LLM 骨干共享 group-relative advantage 信号:
```
L_act = -(1/T) * sum_{t=1}^{T} log ρ_θ(d_{i,t} | h_{i,t}) * A_hat_i
```
- 当 advantage 为正时,鼓励当前路由决策
- 当 advantage 为负时,惩罚当前路由决策
- 通过 λ 超参数控制路由目标在总损失中的权重
## KL 正则化
为保证训练稳定性,对路由策略施加 KL 惩罚:
```
L_KL = sum_t [δ_t * D_KL(π_θ || π_ref) + α * D_KL(ρ_θ || ρ_ref)]
```
- `δ_t = ρ_θ(Hard | h_t)`:只在 hard 模式下施加 token 级 KL
- α 控制路由策略 KL 的强度
## 关键特性
- **可学习**:完全通过 RL 优化,无需预设启发式阈值
- **随机性保留**:从策略采样(而非 argmax保证探索
- **初始化敏感性**:初始偏置 `b_0` 影响 soft 比率和训练 reward 动态
## 参考
- [[action-head-router|动作头路由器]]
- [[token-wise-routing|逐token路由]]
- [[tarpo|TARPO]]
- [[grpo|GRPO]]