Last active
August 29, 2015 13:57
-
-
Save xecutioner/9596695 to your computer and use it in GitHub Desktop.
Wikimedia article extractions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump. | |
| ---------------------------- | |
| - Get the latest copy of the articles from wikipedia download page. | |
| ```shell | |
| > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | |
| ``` | |
| - Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor | |
| - Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py | |
| - Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes. | |
| - This might take some time depending upon the processing capacity of your computer. | |
| ```shell | |
| > bzcat enwiki-latest-pages-articles.xml.bz2 | | |
| WikiExtractor.py -cb 250K -o extracted | |
| ``` | |
| - In order to combine the whole extracted text into a single file one can issue: | |
| ```shell | |
| > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml | |
| > rm -rf extracted | |
| ``` | |
| # STEP 2 (preparing for extraction) | |
| -------------------------------- | |
| - Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.) | |
| - Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement. | |
| ```shell | |
| > sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml | |
| ``` | |
| STEP 3 : Generate a valid xml to be parsed. | |
| ----------------------------------------- | |
| - Prepend a head tag to validate a proper xml to be parsed by xml parsers | |
| echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile | |
| OR | |
| ```shell | |
| > sed -i '1s/^/<head> /' file | |
| ``` | |
| - check if the last doc tag is closed or not if not close it. | |
| - You can check if the last lines by following command | |
| ```shell | |
| > tail -200 wiki_parsed.xml | |
| ``` | |
| - Close the head tag | |
| ```shell | |
| > echo "</head>" >> wiki_parsed.xml | |
| ``` | |
| # STEP 4 : Split the large xml into different files with each file being an article. | |
| ------------------------------------------------------------------------------------------- | |
| - Incase we want to go with spliting the big xml into the smaller files one for each article. | |
| awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml | |
| - more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html | |
| OR | |
| - https://github.com/seamusabshere/xml_split (PREFERD). | |
| - requires sgrep or sgrep2 so (brew install sgrep). | |
| - considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday . | |
| ```shell | |
| > xml_split wiki_parsed.xml doc splitted/monday | |
| ``` | |
| - Might take some time but should be really fast as it uses sgrep. | |
| STEP 5: Run the rake task to get the data out and dump into the database. | |
| ------------------------------------------------------------------------------------------- | |
| Considering the splitted docs are in the 'splitted' directory | |
| ```shell | |
| > bundle exec rake 'wiki_extractor:extract_to_db[splitted]' | |
| ``` | |
| And we should have all the articles populated into the wiki_articles table. | |
| Phew , Now this way easy once we figure out the proper tools. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment