myWiki/raw/papers/zhang-tarpo-2026.md

---
title: "TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization"
source_url: https://arxiv.org/abs/2606.05859
ingested: 2026-06-17
sha256: <computed>
---

# TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

**Authors:** Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang, Qicheng Li (TMCC, College of Computer Science, Nankai University, Tianjin)

**arXiv:** 2606.05859v1 [cs.CL] (2026-06-04)

**Code:** https://github.com/NKU-LITI/TARPO-master

## Abstract

Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics.

## Key Concepts

- [[latent-reasoning|潜在推理]] vs [[chain-of-thought|思维链]]
- [[continuous-representation|连续表征]]
- [[soft-token]] / [[hard-token]]
- [[action-routing-policy|动作路由策略]]
- [[action-head-router|动作头路由器]]
- [[token-wise-routing|逐token路由]]
- [[hybrid-reasoning|混合推理]]
- [[grpo|GRPO]]
- [[coconut|COCONUT]]
- [[hrpo|HRPO]]

## Key Findings

- TARPO achieves superior in-domain performance across Qwen2.5 (1.5B, 3B, 7B), improving GRPO by 0.52% Pass@1 and 1.22% Pass@32 on average
- Out-of-distribution generalization: 4.76% improvement on HumanEval over GRPO, with 18% fewer generated tokens
- Cross-architecture generalization verified on Llama-3.1-8B
- Adaptive switching behavior: router learns to select soft tokens for key mathematical operations while using hard tokens for structural text
- Action head bias initialization and KL penalty are critical hyperparameters for stable training