High-precision document parsing engine for LLM Β· RAG Β· Agent workflows
From wangbindl.github.io:
"We believe better data leads to better models"
MinerU treats data engineering as a standalone research problem β not just a prerequisite for model training.
(Translated from Chinese):
"Our directions β Multimodal LLMs, Intelligent Document Parsing, and Data-Centric AI β are the inevitable path to AGI.
Here, we only do two things:
- Solve the most challenging pain points in the industry
- Tackle unsolved problems in academia
Do work that can be remembered by peers and truly used by developers."
Not a research project β built to solve real problems during InternLM training.
"We are committed to democratizing access to high-quality data for AI research"
# Install
pip install -U "mineru[all]"
# Basic usage (CLI)
mineru -p <input_path> -o <output_path>
# CPU-only mode
mineru -p document.pdf -o ./output -b pipelineInput: PDF / Image / DOCX / PPTX / XLSX
β
ββββββββββββββββββββββββββββββββββββββ
β CLI / API / WebUI / Router β
ββββββββββββββββββββββββββββββββββββββ€
β Backends: β
β βββββββββββ ββββββββββββββ β
β βpipeline β βhybrid β β
β β85+ acc β β95+ acc β β
β βCPU OK β βGPU 8GB+ β β
β βββββββββββ ββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββ€
β Models: Layout / OCR / Formula β
β / Table / VLM β
ββββββββββββββββββββββββββββββββββββββ
β
Output: Markdown / JSON (LLM-ready)
output_dir/
βββ document.md # Markdown with layout preserved
βββ images/ # Extracted images
βββ content.json # Structured JSON by reading order
βββ middle.json # Rich intermediate format
βββ model_output.json # Raw model outputs (optional)
| Backend | Accuracy | Hardware | Best For |
|---|---|---|---|
pipeline |
85+ | CPU (4GB RAM) | Fast, stable, offline |
hybrid-auto-engine |
95+ | GPU (8GB+) | Recommended default |
vlm-auto-engine |
95+ | GPU (8GB+) | High accuracy |
- β Multi-format: PDF, images, DOCX, PPTX, XLSX
- β Layout preservation: Headers, columns, reading order
- β Formula β LaTeX
- β Table β HTML
- β 109-language OCR
- β Pure CPU support
- β Multi-GPU scaling via mineru-router
- β Streaming output for long documents
| Metric | Value |
|---|---|
| GitHub Stars | 60K+ |
| Created | Sept 2024 |
| Lead | Bin Wang (Shanghai AI Lab) |
| License | Apache 2.0 + custom |
- π mineru.net - Online demo
- π¦ GitHub
- π Docs
- π€ HuggingFace Demo