Skip to content

Instantly share code, notes, and snippets.

@xecutioner
Last active August 29, 2015 13:57
Show Gist options
  • Select an option

  • Save xecutioner/9596695 to your computer and use it in GitHub Desktop.

Select an option

Save xecutioner/9596695 to your computer and use it in GitHub Desktop.
Wikimedia article extractions
# STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
----------------------------
- Get the latest copy of the articles from wikipedia download page.
```shell
> wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
```
- Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
- Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
- Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes.
- This might take some time depending upon the processing capacity of your computer.
```shell
> bzcat enwiki-latest-pages-articles.xml.bz2 |
WikiExtractor.py -cb 250K -o extracted
```
- In order to combine the whole extracted text into a single file one can issue:
```shell
> find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
> rm -rf extracted
```
# STEP 2 (preparing for extraction)
--------------------------------
- Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.)
- Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement.
```shell
> sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml
```
STEP 3 : Generate a valid xml to be parsed.
-----------------------------------------
- Prepend a head tag to validate a proper xml to be parsed by xml parsers
echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
OR
```shell
> sed -i '1s/^/<head> /' file
```
- check if the last doc tag is closed or not if not close it.
- You can check if the last lines by following command
```shell
> tail -200 wiki_parsed.xml
```
- Close the head tag
```shell
> echo "</head>" >> wiki_parsed.xml
```
# STEP 4 : Split the large xml into different files with each file being an article.
-------------------------------------------------------------------------------------------
- Incase we want to go with spliting the big xml into the smaller files one for each article.
awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html
OR
- https://github.com/seamusabshere/xml_split (PREFERD).
- requires sgrep or sgrep2 so (brew install sgrep).
- considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .
```shell
> xml_split wiki_parsed.xml doc splitted/monday
```
- Might take some time but should be really fast as it uses sgrep.
STEP 5: Run the rake task to get the data out and dump into the database.
-------------------------------------------------------------------------------------------
Considering the splitted docs are in the 'splitted' directory
```shell
> bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
```
And we should have all the articles populated into the wiki_articles table.
Phew , Now this way easy once we figure out the proper tools.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment