Skip to content

Instantly share code, notes, and snippets.

@peerasak-u
Created April 26, 2026 14:55
Show Gist options
  • Select an option

  • Save peerasak-u/465864be05236daa98eba4f7d0060640 to your computer and use it in GitHub Desktop.

Select an option

Save peerasak-u/465864be05236daa98eba4f7d0060640 to your computer and use it in GitHub Desktop.
MinerU - Concept, Philosophy & Usage Guide (60K+ stars open-source document parsing tool)

🎯 MinerU β€” Concept, Philosophy & Usage Guide

High-precision document parsing engine for LLM Β· RAG Β· Agent workflows


🧠 Creator's Philosophy (Bin Wang - ηŽ‹ζ–Œ)

From wangbindl.github.io:

1. Data-Centric AI Philosophy

"We believe better data leads to better models"

MinerU treats data engineering as a standalone research problem β€” not just a prerequisite for model training.

2. Solve Real Problems, Not Just Publish Papers

(Translated from Chinese):

"Our directions β€” Multimodal LLMs, Intelligent Document Parsing, and Data-Centric AI β€” are the inevitable path to AGI.

Here, we only do two things:

  1. Solve the most challenging pain points in the industry
  2. Tackle unsolved problems in academia

Do work that can be remembered by peers and truly used by developers."

3. Born from InternLM Pretraining

Not a research project β€” built to solve real problems during InternLM training.

4. Open Source for Democratization

"We are committed to democratizing access to high-quality data for AI research"


πŸš€ Quick Start

# Install
pip install -U "mineru[all]"

# Basic usage (CLI)
mineru -p <input_path> -o <output_path>

# CPU-only mode
mineru -p document.pdf -o ./output -b pipeline

πŸ“ Architecture

Input: PDF / Image / DOCX / PPTX / XLSX
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  CLI / API / WebUI / Router        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Backends:                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚pipeline β”‚  β”‚hybrid      β”‚       β”‚
β”‚  β”‚85+ acc  β”‚  β”‚95+ acc     β”‚       β”‚
β”‚  β”‚CPU OK   β”‚  β”‚GPU 8GB+   β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Models: Layout / OCR / Formula   β”‚
β”‚  / Table / VLM                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
Output: Markdown / JSON (LLM-ready)

πŸ“ Output Format

output_dir/
β”œβ”€β”€ document.md          # Markdown with layout preserved
β”œβ”€β”€ images/              # Extracted images
β”œβ”€β”€ content.json         # Structured JSON by reading order
β”œβ”€β”€ middle.json          # Rich intermediate format
└── model_output.json    # Raw model outputs (optional)

βš™οΈ Backend Selection

Backend Accuracy Hardware Best For
pipeline 85+ CPU (4GB RAM) Fast, stable, offline
hybrid-auto-engine 95+ GPU (8GB+) Recommended default
vlm-auto-engine 95+ GPU (8GB+) High accuracy

πŸ”‘ Key Features

  • βœ… Multi-format: PDF, images, DOCX, PPTX, XLSX
  • βœ… Layout preservation: Headers, columns, reading order
  • βœ… Formula β†’ LaTeX
  • βœ… Table β†’ HTML
  • βœ… 109-language OCR
  • βœ… Pure CPU support
  • βœ… Multi-GPU scaling via mineru-router
  • βœ… Streaming output for long documents

πŸ“Š Project Stats

Metric Value
GitHub Stars 60K+
Created Sept 2024
Lead Bin Wang (Shanghai AI Lab)
License Apache 2.0 + custom

πŸ”— Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment