20260617:目前有914 页
This commit is contained in:
56
papers/onereason.md
Normal file
56
papers/onereason.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
title: "OneReason: 生成式推荐中的推理能力解锁"
|
||||
created: 2026-06-10
|
||||
updated: 2026-06-10
|
||||
type: paper
|
||||
tags: [recommendation, reasoning, chain-of-thought, generative-model, rl]
|
||||
sources: [raw/papers/onereason-team-onereason-2026.md]
|
||||
confidence: high
|
||||
---
|
||||
|
||||
# OneReason: 生成式推荐中的推理能力解锁
|
||||
|
||||
> **arXiv:2606.06260** | OneRec Team (Kuaishou) | 2026-06-04
|
||||
> 从「缩放优势」到「推理优势」——让生成式推荐模型真正学会「先思考再推荐」
|
||||
|
||||
## 核心问题
|
||||
|
||||
[[onerec|OneRec]] 系列生成式推荐模型在工业界(快手短视频、直播、广告、电商)已广泛部署,但这些模型只能享受 **Scaling 红利**,推理能力难以激活——因为纯 [[itemic-tokens|itemic token]] 序列无法构造有意义的 [[chain-of-thought|思维链 (CoT)]]。
|
||||
|
||||
初步探索(OneRec-Think、OpenOneRec)虽成功将「think before answer」范式推广到推荐任务,却出现**意外现象:thinking mode 并不优于 non-thinking mode**。
|
||||
|
||||
## 方法论贡献
|
||||
|
||||
借鉴多模态 LLM 中 CoT 鲁棒性的研究,本文提出推荐推理的两大支柱:
|
||||
|
||||
1. **[[perception-cognition-recommendation|Perception (感知)]]**:将 itemic token 深度对齐到其底层语言语义,使其成为可指称、可组合的语义单元
|
||||
2. **[[perception-cognition-recommendation|Cognition (认知)]]**:设计推荐专用的三层 CoT 结构来支撑审慎推理
|
||||
|
||||
基于此提出 **OneReason**,包含三个技术阶段:
|
||||
|
||||
| 阶段 | 技术 | 目标 |
|
||||
|------|------|------|
|
||||
| Pre-training | 强化 [[itemic-text-alignment|itemic-text 对齐]] | 建立强 item perception |
|
||||
| SFT | 三层 [[recommendation-cot|cognition-enhanced CoT]] | 构建推荐推理能力 |
|
||||
| RL | [[specialize-then-unify-rl|specialize-then-unify]] | 增强 thinking 优势 |
|
||||
|
||||
## 关键发现
|
||||
|
||||
- **Specialize-then-Unify**:多域混合 RL 下 thinking mode 仍落后于 non-thinking mode,但单域 RL 下 consistently 超越。因此先做域内专项 RL,再通过 [[rejection-sampling-fine-tuning|Rejection Sampling FT]] 或 [[multi-teacher-on-policy-distillation|Multi-Teacher On-Policy Distillation]] 做跨域平衡
|
||||
- **[[thinking-supervision-transfer|Thinking Supervision Transfer]]**:用 CoT 监督数据替换 unCoT 数据可提升 non-thinking mode 性能——CoT 监督信号可能迁移到直接解码
|
||||
- **[[abductive-reasoning-recommendation|Abductive Reasoning]]**:推荐推理是溯因而非演绎——从行为序列反推隐含兴趣点
|
||||
|
||||
## 评估体系
|
||||
|
||||
[[onereason-bench|OneReason-Bench]] 按 R0→R3 四层递进评估推荐推理能力。
|
||||
|
||||
## 开源
|
||||
|
||||
OneReason-8B 和 OneReason-0.8B 模型将开源。
|
||||
|
||||
## 参考
|
||||
|
||||
- [[onerec|OneRec 生成式推荐]]
|
||||
- [[chain-of-thought|思维链 (CoT)]]
|
||||
- [[generative-recommendation|生成式推荐]]
|
||||
- [原始存档](raw/papers/onereason-team-onereason-2026.md)
|
||||
Reference in New Issue
Block a user