Files
myWiki/concepts/pdf-processing.md

24 lines
661 B
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "PDF Processing"
type: concept
created: 2026-06-04
tags: [pdf, document-processing, parsing, ocr]
---
# PDF ProcessingPDF 处理)
**定义**:从 PDF 文档中提取结构化信息的工具和方法谱系,涵盖文本提取、布局解析、表格识别和公式处理。
## 方法分类
| 方法 | 代表工具 | 特点 |
|------|---------|------|
| 规则式 | pdftotext | 简单快速,但丢失结构 |
| 视觉模型 | [[mineru]] | 保留布局和层级结构 |
| OCR | Tesseract | 处理扫描文档 |
| 深度学习 | Nougat, Grobid | 学术文献专项优化 |
## 相关概念
- [[mineru]] — 视觉模型驱动的 PDF 解析