Skip to content

Instantly share code, notes, and snippets.

View aaronzhangTechGeek's full-sized avatar

aaronzhangTechGeek

View GitHub Profile
#!/usr/bin/perl
use open qw(:std :utf8);
# Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase
# letters (a-z, converted from A-Z), and spaces (never consecutive).
# All other characters are converted to spaces. Only text which normally appears
1 #!/usr/bin/perl
# in the web browser is displayed. Tables are removed. Image captions are
# preserved. Links are converted to normal text. Digits are spelled out.