peerasak-u/mineru_guide.md

🎯 MinerU — Concept, Philosophy & Usage Guide

High-precision document parsing engine for LLM · RAG · Agent workflows

🧠 Creator's Philosophy (Bin Wang - 王斌)

From wangbindl.github.io:

1. Data-Centric AI Philosophy

"We believe better data leads to better models"

MinerU treats data engineering as a standalone research problem — not just a prerequisite for model training.

2. Solve Real Problems, Not Just Publish Papers

(Translated from Chinese):

"Our directions — Multimodal LLMs, Intelligent Document Parsing, and Data-Centric AI — are the inevitable path to AGI.

Here, we only do two things:

Solve the most challenging pain points in the industry

Tackle unsolved problems in academia

Do work that can be remembered by peers and truly used by developers."

3. Born from InternLM Pretraining

Not a research project — built to solve real problems during InternLM training.

4. Open Source for Democratization

"We are committed to democratizing access to high-quality data for AI research"

🚀 Quick Start

# Install
pip install -U "mineru[all]"

# Basic usage (CLI)
mineru -p <input_path> -o <output_path>

# CPU-only mode
mineru -p document.pdf -o ./output -b pipeline

📐 Architecture

Input: PDF / Image / DOCX / PPTX / XLSX
       ↓
┌────────────────────────────────────┐
│  CLI / API / WebUI / Router        │
├────────────────────────────────────┤
│  Backends:                         │
│  ┌─────────┐  ┌────────────┐       │
│  │pipeline │  │hybrid      │       │
│  │85+ acc  │  │95+ acc     │       │
│  │CPU OK   │  │GPU 8GB+   │       │
│  └─────────┘  └────────────┘       │
├────────────────────────────────────┤
│  Models: Layout / OCR / Formula   │
│  / Table / VLM                     │
└────────────────────────────────────┘
       ↓
Output: Markdown / JSON (LLM-ready)

📁 Output Format

output_dir/
├── document.md          # Markdown with layout preserved
├── images/              # Extracted images
├── content.json         # Structured JSON by reading order
├── middle.json          # Rich intermediate format
└── model_output.json    # Raw model outputs (optional)

⚙️ Backend Selection

Backend	Accuracy	Hardware	Best For
`pipeline`	85+	CPU (4GB RAM)	Fast, stable, offline
`hybrid-auto-engine`	95+	GPU (8GB+)	Recommended default
`vlm-auto-engine`	95+	GPU (8GB+)	High accuracy

🔑 Key Features

✅ Multi-format: PDF, images, DOCX, PPTX, XLSX
✅ Layout preservation: Headers, columns, reading order
✅ Formula → LaTeX
✅ Table → HTML
✅ 109-language OCR
✅ Pure CPU support
✅ Multi-GPU scaling via mineru-router
✅ Streaming output for long documents

📊 Project Stats

Metric	Value
GitHub Stars	60K+
Created	Sept 2024
Lead	Bin Wang (Shanghai AI Lab)
License	Apache 2.0 + custom

🔗 Resources

🌐 mineru.net - Online demo
📦 GitHub
📚 Docs
🤗 HuggingFace Demo