xecutioner/Wikimeida_extraction.md

STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.

> wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes.
This might take some time depending upon the processing capacity of your computer.

> bzcat enwiki-latest-pages-articles.xml.bz2 |
 WikiExtractor.py -cb 250K -o extracted

> find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
> rm -rf extracted

Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.)
Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement.

> sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml

Prepend a head tag to validate a proper xml to be parsed by xml parsers echo ""|cat - yourfile > /tmp/out && mv /tmp/out yourfile OR

> sed -i '1s/^/<head> /' file

> tail -200 wiki_parsed.xml

> echo "</head>" >> wiki_parsed.xml

Incase we want to go with spliting the big xml into the smaller files one for each article. awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html

OR

https://github.com/seamusabshere/xml_split (PREFERD).
requires sgrep or sgrep2 so (brew install sgrep).
considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .

> xml_split wiki_parsed.xml doc splitted/monday

Considering the splitted docs are in the 'splitted' directory

> bundle exec rake 'wiki_extractor:extract_to_db[splitted]'

And we should have all the articles populated into the wiki_articles table. Phew , Now this way easy once we figure out the proper tools.