Files
myWiki/papers/onereason.md

57 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "OneReason: 生成式推荐中的推理能力解锁"
created: 2026-06-10
updated: 2026-06-10
type: paper
tags: [recommendation, reasoning, chain-of-thought, generative-model, rl]
sources: [raw/papers/onereason-team-onereason-2026.md]
confidence: high
---
# OneReason: 生成式推荐中的推理能力解锁
> **arXiv:2606.06260** | OneRec Team (Kuaishou) | 2026-06-04
> 从「缩放优势」到「推理优势」——让生成式推荐模型真正学会「先思考再推荐」
## 核心问题
[[onerec|OneRec]] 系列生成式推荐模型在工业界(快手短视频、直播、广告、电商)已广泛部署,但这些模型只能享受 **Scaling 红利**,推理能力难以激活——因为纯 [[itemic-tokens|itemic token]] 序列无法构造有意义的 [[chain-of-thought|思维链 (CoT)]]。
初步探索OneRec-Think、OpenOneRec虽成功将「think before answer」范式推广到推荐任务却出现**意外现象thinking mode 并不优于 non-thinking mode**。
## 方法论贡献
借鉴多模态 LLM 中 CoT 鲁棒性的研究,本文提出推荐推理的两大支柱:
1. **[[perception-cognition-recommendation|Perception (感知)]]**:将 itemic token 深度对齐到其底层语言语义,使其成为可指称、可组合的语义单元
2. **[[perception-cognition-recommendation|Cognition (认知)]]**:设计推荐专用的三层 CoT 结构来支撑审慎推理
基于此提出 **OneReason**,包含三个技术阶段:
| 阶段 | 技术 | 目标 |
|------|------|------|
| Pre-training | 强化 [[itemic-text-alignment|itemic-text 对齐]] | 建立强 item perception |
| SFT | 三层 [[recommendation-cot|cognition-enhanced CoT]] | 构建推荐推理能力 |
| RL | [[specialize-then-unify-rl|specialize-then-unify]] | 增强 thinking 优势 |
## 关键发现
- **Specialize-then-Unify**:多域混合 RL 下 thinking mode 仍落后于 non-thinking mode但单域 RL 下 consistently 超越。因此先做域内专项 RL再通过 [[rejection-sampling-fine-tuning|Rejection Sampling FT]] 或 [[multi-teacher-on-policy-distillation|Multi-Teacher On-Policy Distillation]] 做跨域平衡
- **[[thinking-supervision-transfer|Thinking Supervision Transfer]]**:用 CoT 监督数据替换 unCoT 数据可提升 non-thinking mode 性能——CoT 监督信号可能迁移到直接解码
- **[[abductive-reasoning-recommendation|Abductive Reasoning]]**:推荐推理是溯因而非演绎——从行为序列反推隐含兴趣点
## 评估体系
[[onereason-bench|OneReason-Bench]] 按 R0→R3 四层递进评估推荐推理能力。
## 开源
OneReason-8B 和 OneReason-0.8B 模型将开源。
## 参考
- [[onerec|OneRec 生成式推荐]]
- [[chain-of-thought|思维链 (CoT)]]
- [[generative-recommendation|生成式推荐]]
- [原始存档](raw/papers/onereason-team-onereason-2026.md)