Files
myWiki/raw/papers/unlimited-ocr-works-2026.md

46 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Unlimited OCR Works: Welcome the Era of One-shot Long-horizon Parsing"
author: "Youyang Yin, Huanhuan Liu*, YY†, et al. (Baidu Inc.)"
source: "arXiv 2606.23050"
date: "2026-06-22"
type: paper
venue: "arXiv (cs.CV, cs.CL)"
tags: ["ocr", "attention-mechanism", "long-horizon", "kv-cache", "r-swa", "end-to-end"]
code: "https://github.com/baidu/Unlimited-OCR"
---
# Unlimited OCR Works
> Youyang Yin, Huanhuan Liu*, YY†, Qunyi Xie, Chaorun Liu, Shiqi Yang, Shaohua Wang, Zhanlong Liu, Hao Zou, Jinyue Chen, Shu Wei, Jingjing Wu, Mingxin Huang, Zhen Wu, Guibin Wang, Tengyu Du, Lei Jia
> Baidu Inc. | arXiv:2606.23050 | Jun 2026
## 核心问题
现有端到端 OCR 模型(如 DeepSeek OCR用 LLM 作解码器,利用语言先验提升精度,但代价是输出序列增长导致 KV cache 线性膨胀,推理速度持续下降。人类在长程抄写任务中效率不降,这是一个根本性的架构瓶颈。
## 核心方案Reference Sliding Window Attention (R-SWA)
提出 **R-SWA** — 一种模仿人类解析工作记忆的注意力机制:
1. 每个生成的 token 关注全部参考 token视觉 token + prompt 前 n 个输出 token默认 n=128
2. 参考 token 不参与状态转移,避免视觉特征逐渐模糊
3. KV cache 保持恒定大小 Lm + n不随解码长度增长
4. 整个解码过程推理速度TPS和 GPU 内存恒定
## 关键结果
- 以 DeepSeek OCR 为基线,替换所有 decoder attention 为 R-SWA
- OmniDocBench v1.5**93% Overall**,比 DeepSeek OCR 基线高 6pp
- OmniDocBench v1.6:与 SOTA 持平93.54%
- 长程解析2-40+ 页书籍Distinct-n > 96%Edit Distance < 0.11
- 推理效率6000 token TPS DeepSeek OCR 35%
- 3B 参数MoE 架构激活仅 500M
## 局限性
受限于 prefill 长度当前 32K不能真正无限解析短期方向训练 128K 上下文长期方向构建 prefill pool 模拟翻页效果
## 泛化性
R-SWA 是通用的解析注意力机制 OCR 同样适用于 ASR翻译等基于参考的长程任务