This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/perl | |
| use open qw(:std :utf8); | |
| # Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase | |
| # letters (a-z, converted from A-Z), and spaces (never consecutive). | |
| # All other characters are converted to spaces. Only text which normally appears | |
| 1 #!/usr/bin/perl | |
| # in the web browser is displayed. Tables are removed. Image captions are | |
| # preserved. Links are converted to normal text. Digits are spelled out. |