Skip to content

Instantly share code, notes, and snippets.

@lauhub
Last active June 8, 2021 22:19
Show Gist options
  • Select an option

  • Save lauhub/e0d3522278ba9f98fddccd2541c37119 to your computer and use it in GitHub Desktop.

Select an option

Save lauhub/e0d3522278ba9f98fddccd2541c37119 to your computer and use it in GitHub Desktop.
Generate statistics about font size of text from PDF document

README

Use this script to extract statistics from a PDF file.

My goal was to check if font size 12 was mainly used inside some documents.

Dependency:

Usage:

pdfminer_font_size_statistics.py filepath.mdf

Sample output:

>>> Percentage for 12.0 font size: 96.72% (12.0pt: 88456 chars for total of 91453 chars)
Percentage for 36.0 font size: 0.03% (36.0pt: 24 chars for total of 91453 chars)
Percentage for 27.0 font size: 0.03% (27.0pt: 30 chars for total of 91453 chars)
Percentage for 16.0 font size: 0.25% (16.0pt: 227 chars for total of 91453 chars)
Percentage for 6.5 font size: 0.55% (6.5pt: 499 chars for total of 91453 chars)
Percentage for 10.0 font size: 0.08% (10.0pt: 72 chars for total of 91453 chars)
Percentage for 13.0 font size: 0.86% (13.0pt: 784 chars for total of 91453 chars)
Percentage for 9.0 font size: 1.38% (9.0pt: 1261 chars for total of 91453 chars)
Percentage for 11.0 font size: 0.10% (11.0pt: 87 chars for total of 91453 chars)
Percentage for 14.0 font size: 0.01% (14.0pt: 13 chars for total of 91453 chars)

Inpired from this answer

#!/usr/bin/env python3
# -*- coding: utf8 -*-
from pdfminer.layout import LTTextContainer,LTChar,LTLine,LAParams,LTAnno
from pdfminer.high_level import extract_pages
import os
from sys import argv
def extract_from_file(path):
extracted_data = []
for page_layout in extract_pages(path):
for element in page_layout:
if isinstance(element, LTTextContainer):
for item in element:
if isinstance(item,LTChar):
extracted_data.append([round(item.size, 1), item.get_text(), len(item.get_text())])
elif isinstance(item,LTAnno):
pass
else:
for character in item:
if isinstance(character, LTChar):
extracted_data.append([round(character.size, 1), item.get_text(), len(item.get_text())])
break
font_size_stats = dict()
font_size_count = 0
for info in extracted_data:
font_size = info[0]
txt_length = info[2]
font_size_count += txt_length
if font_size in font_size_stats:
font_size_stats[font_size] += txt_length
else:
font_size_stats[font_size] = txt_length
for size in font_size_stats :
print("{prefix}Percentage for {s} font size: {p:.2f}% ({s}pt: {c} chars for total of {t} chars)"\
.format(s=size, p=100 * font_size_stats[size] / font_size_count, \
c=font_size_stats[size], t=font_size_count, prefix=">>> " if size==12 else ""))
if __name__ == '__main__':
extract_from_file(argv[1])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment