Skip to content

Instantly share code, notes, and snippets.

@outcassed
Last active April 26, 2016 17:24
Show Gist options
  • Select an option

  • Save outcassed/1236289e786abd452668db6453d63380 to your computer and use it in GitHub Desktop.

Select an option

Save outcassed/1236289e786abd452668db6453d63380 to your computer and use it in GitHub Desktop.

Look for abbreviations in pattern files

I put a commented version that explains what this does at the bottom

  1. Install http://brew.sh/
  2. In Terminal: brew install poppler hunspell
  3. mkdir ~/Dictionaries
  4. Download dictionaries from https://cgit.freedesktop.org/libreoffice/dictionaries/tree/en and put them in there. Use the "plain" link. You need both the .dic and the .aff files.
  5. pdftotext pattern.pdf - | tr -s '[:blank:][:punct:]' '\n' | awk 'length($1) >= 2 && length($1) <= 5 { print $1 }' | hunspell -d ~/Dictionaries/en_US -a | awk '{print $1,$2}' | grep -v '^[\*@]' | tr -d '&#' | sort | uniq -c
   3  10cm
   2  4sts
   1  5cm
   1  60cm
   2  7mm
   1  Aran
   3  Liesl
   3  bo
   1  bo2
   1  bo4
   1  dpn
   1  dpns
   2  grey  # funny, this is probably cause I used the en_US dictionary instead of en_GB
  19  k1
   2  k17
   1  k2
  10  k2tog
  21  k3
   2  k31
   5  k4
   2  kwise
   5  liesl
   5  pdf
   2  psso
   3  pwise
   6  rnd
   6  sl
  22  sl1
  40  sts
   3  ws
   6  www
   3  yds
@outcassed
Copy link
Copy Markdown
Author

outcassed commented Apr 26, 2016

Explanatory notes:

  1. Install http://brew.sh and brew install poppler hunspell
    This is the easiest way to install the PDF to text converter and the spell checker.
  2. mkdir ~/Dictionaries, download dictionaries
    Unfortunately you have to manually install the dictionaries
  3. The long command
    To see how this works, you can start from the first command up to the pipe (|) and run it, and then add more pipes and commands to the end, one at a time, and run to see what you get. It's fun :)

pdftotext pattern.pdf -
extract the text from the pattern

tr -s '[:blank:][:punct:]' '\n'
replace all spaces and punctuation with newlines so that we get one word per line

awk 'length($1) >= 2 && length($1) <= 5 { print $1 }'
print out all words that are 2-5 characters long

hunspell -d ~/Dictionaries/en_US -a
run them through the spell checker. In this example we are using the en_US dictionary

awk '{print $1,$2}'"
the spell checker prints out a lot of stuff, we only need to see the result of the check and the word that was submitted

grep -v '^[\*@]
lines that start with * or @ are not bad spellings

tr -d '&#'
bad spellings get a & or # symbol, remove those symbols, we don't need to see them

sort
sort the results so we can do a unique count

uniq -c
do a unique count

@outcassed
Copy link
Copy Markdown
Author

outcassed commented Apr 26, 2016

If you want to find abbreviations by using a master list instead of a spell checker

  1. Install http://brew.sh/
  2. In Terminal, use brew to install the PDF to text utility: brew install poppler
  3. Put abbreviations in abbreviations.txt. Note that k1 will also match things like k17. I figure it's safer to match partial strings.
  4. run pdftotext pattern.pdf - | tr -s '[:blank:][:punct:]' '\n' | grep -io -f abbreviations.txt | sort | uniq -c

  21 k1
  10 k2tog


@outcassed
Copy link
Copy Markdown
Author

outcassed commented Apr 26, 2016

Add this to the end to colorize your known abbreviations 🌈

| grep --color -f abbreviations.txt -e $

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment