20260617:目前有914 页
This commit is contained in:
47
concepts/hrpo.md
Normal file
47
concepts/hrpo.md
Normal file
@@ -0,0 +1,47 @@
|
||||
---
|
||||
title: "HRPO: Hybrid Reasoning Policy Optimization"
|
||||
created: 2026-06-17
|
||||
updated: 2026-06-17
|
||||
type: concept
|
||||
tags: [reasoning, architecture, latent-reasoning, reinforcement-learning]
|
||||
sources: [raw/papers/zhang-tarpo-2026.md]
|
||||
confidence: high
|
||||
---
|
||||
|
||||
# HRPO: Hybrid Reasoning Policy Optimization
|
||||
|
||||
HRPO(Yue et al., 2026)是**密集融合型混合推理**的代表性 RL 方法,在 [[tarpo|TARPO]] 论文中是核心对比基线。
|
||||
|
||||
## 核心机制
|
||||
|
||||
HRPO 在**每一个解码步骤**中构造离散 token 和连续表征的融合表示:
|
||||
|
||||
```
|
||||
u_fused = g * E(v_t) + (1-g) * h_t
|
||||
```
|
||||
|
||||
其中:
|
||||
- `g` 是可学习的门控参数
|
||||
- `E(v_t)` 是离散 token embedding
|
||||
- `h_t` 是隐藏状态表征
|
||||
|
||||
## 与 TARPO 的区别
|
||||
|
||||
| 维度 | HRPO | [[tarpo|TARPO]] |
|
||||
|------|------|------|
|
||||
| 融合方式 | 密集融合(每步都混合) | 二值切换(hard 或 soft) |
|
||||
| 路由器 | 可学习门控 | 轻量级动作头 |
|
||||
| 决策粒度 | 连续权重 | 离散二值 |
|
||||
| 训练动态 | 后期易出现熵飙升 | 训练稳定 |
|
||||
| 随机性来源 | 离散 token 采样 | 路由决策采样 |
|
||||
|
||||
## 训练动态问题
|
||||
|
||||
TARPO 论文发现 HRPO 在后期训练阶段会出现**熵飙升**现象(token entropy 异常上升),可能源于门控机制的连续权重导致的不稳定优化。TARPO 的离散二值路由更好地保持了训练稳定性。
|
||||
|
||||
## 参考
|
||||
|
||||
- [[hybrid-reasoning|混合推理]]
|
||||
- [[tarpo|TARPO]]
|
||||
- [[latent-reasoning|潜在推理]]
|
||||
- [[soft-token]]
|
||||
Reference in New Issue
Block a user