SidneyZhang/myWiki

Files

Sidney Zhang 91fac5b6fc

20260617:目前有914 页

2026-06-17 15:02:40 +08:00

1.6 KiB

Raw Blame History

title, created, updated, type, tags, sources, confidence

title

created

updated

type

tags

sources

confidence

动作路由策略 (Action-Routing Policy)

2026-06-17

2026-06-17

concept

reinforcement-learning

routing

policy-gradient

raw/papers/zhang-tarpo-2026.md

high

动作路由策略 (Action-Routing Policy)

动作路由策略是 tarpo 框架中将推理模式选择形式化为 RL 策略的核心抽象。

形式化定义

将推理模式选择建模为二元离散动作空间 D = {hard, soft} 上的随机策略：

ρ_θ(d_t | h_t) = Softmax(W_r * h_t + b_r)

其中 W_r ∈ R^{2×d} 和 b_r ∈ R^2 是 action-head-router 的可训练参数。

策略优化目标

路由策略与 LLM 骨干共享 group-relative advantage 信号：

L_act = -(1/T) * sum_{t=1}^{T} log ρ_θ(d_{i,t} | h_{i,t}) * A_hat_i

当 advantage 为正时，鼓励当前路由决策
当 advantage 为负时，惩罚当前路由决策
通过 λ 超参数控制路由目标在总损失中的权重

KL 正则化

为保证训练稳定性，对路由策略施加 KL 惩罚：

L_KL = sum_t [δ_t * D_KL(π_θ || π_ref) + α * D_KL(ρ_θ || ρ_ref)]

δ_t = ρ_θ(Hard | h_t)：只在 hard 模式下施加 token 级 KL
α 控制路由策略 KL 的强度

关键特性

可学习：完全通过 RL 优化，无需预设启发式阈值
随机性保留：从策略采样（而非 argmax）保证探索
初始化敏感性：初始偏置 b_0 影响 soft 比率和训练 reward 动态

参考