Skip to content

Instantly share code, notes, and snippets.

View sh-mahdi's full-sized avatar
🎯
Focusing

sh-mahdi

🎯
Focusing
View GitHub Profile
#!/usr/bin/perl
use open qw(:std :utf8);
# Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase
# letters (a-z, converted from A-Z), and spaces (never consecutive).
# All other characters are converted to spaces. Only text which normally appears
1 #!/usr/bin/perl
# in the web browser is displayed. Tables are removed. Image captions are
# preserved. Links are converted to normal text. Digits are spelled out.