Last active
August 29, 2015 13:57
-
-
Save xecutioner/9596695 to your computer and use it in GitHub Desktop.
Revisions
-
xecutioner revised this gist
May 15, 2014 . 1 changed file with 20 additions and 42 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -16,8 +16,7 @@ ```shell > bzcat enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted ``` - In order to combine the whole extracted text into a single file one can issue: @@ -37,57 +36,36 @@ ```shell > sed -i.bak -e 's/<[^d][^o>]*[^c>]*>//g' wiki_parsed.xml ``` # STEP 3 : Split into articles based files. ----------------------------------------- ```shell > mkdir -p splitted > awk 'BEGIN { system("mkdir -p splitted/sub"++j) } /<doc/{x="F"++i".xml";}{ if (i%1995==0 ){ ++i; system("mkdir -p splitted/sub"++j"/"); } else{ print >> ("splitted/sub"j"/"x); close("splitted/sub"j"/"x); } }' wiki_parsed.xml ``` # STEP 4 : Run the rake task to get the data out and dump into the database. ------------------------------------------------------------------------------------------- Considering the splitted docs are in the 'splitted' directory ```shell -
xecutioner revised this gist
Mar 17, 2014 . 1 changed file with 1 addition and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -82,6 +82,7 @@ OR - considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday . ```shell > gem install xml_split > xml_split wiki_parsed.xml doc splitted/monday ``` - Might take some time but should be really fast as it uses sgrep. -
xecutioner revised this gist
Mar 17, 2014 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -42,7 +42,7 @@ ``` # STEP 3 : Generate a valid xml to be parsed. ----------------------------------------- - Prepend a head tag to validate a proper xml to be parsed by xml parsers echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile @@ -86,7 +86,7 @@ OR ``` - Might take some time but should be really fast as it uses sgrep. # STEP 5: Run the rake task to get the data out and dump into the database. ------------------------------------------------------------------------------------------- Considering the splitted docs are in the 'splitted' directory ```shell -
xecutioner renamed this gist
Mar 17, 2014 . 1 changed file with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes. -
xecutioner revised this gist
Mar 17, 2014 . 1 changed file with 71 additions and 17 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,43 +1,97 @@ # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump. ---------------------------- - Get the latest copy of the articles from wikipedia download page. ```shell > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 ``` - Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor - Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py - Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes. - This might take some time depending upon the processing capacity of your computer. ```shell > bzcat enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted ``` - In order to combine the whole extracted text into a single file one can issue: ```shell > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml > rm -rf extracted ``` # STEP 2 (preparing for extraction) -------------------------------- - Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.) - Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement. ```shell > sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml ``` STEP 3 : Generate a valid xml to be parsed. ----------------------------------------- - Prepend a head tag to validate a proper xml to be parsed by xml parsers echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile OR ```shell > sed -i '1s/^/<head> /' file ``` - check if the last doc tag is closed or not if not close it. - You can check if the last lines by following command ```shell > tail -200 wiki_parsed.xml ``` - Close the head tag ```shell > echo "</head>" >> wiki_parsed.xml ``` # STEP 4 : Split the large xml into different files with each file being an article. ------------------------------------------------------------------------------------------- - Incase we want to go with spliting the big xml into the smaller files one for each article. awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml - more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html OR - https://github.com/seamusabshere/xml_split (PREFERD). - requires sgrep or sgrep2 so (brew install sgrep). - considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday . ```shell > xml_split wiki_parsed.xml doc splitted/monday ``` - Might take some time but should be really fast as it uses sgrep. STEP 5: Run the rake task to get the data out and dump into the database. ------------------------------------------------------------------------------------------- Considering the splitted docs are in the 'splitted' directory ```shell > bundle exec rake 'wiki_extractor:extract_to_db[splitted]' ``` And we should have all the articles populated into the wiki_articles table. Phew , Now this way easy once we figure out the proper tools. -
xecutioner revised this gist
Mar 17, 2014 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,7 +1,7 @@ # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump. ---------------------------- Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 > bzcat enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted -
xecutioner revised this gist
Mar 17, 2014 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,4 @@ # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump. ---------------------------- # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor -
xecutioner revised this gist
Mar 17, 2014 . 1 changed file with 4 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,8 +1,9 @@ #STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump. ---------------------------- # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 > bzcat enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted In order to combine the whole extracted text into a single file one can issue: > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml -
xecutioner revised this gist
Mar 17, 2014 . 1 changed file with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -38,3 +38,5 @@ xml_split wiki_parsed.xml doc splitted/mon STEP 5: Run the rake task to get the data out and dump into the database. ------------------------------------------------------------------------------------------- Considering the splitted docs are in the splitted directory bundle exec rake 'wiki_extractor:extract_to_db[splitted]' -
xecutioner revised this gist
Mar 17, 2014 . 1 changed file with 11 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,3 +1,5 @@ STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump. ---------------------------- # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2 > bzcat itwiki-latest-pages-articles.xml.bz2 | @@ -6,10 +8,14 @@ In order to combine the whole extracted text into a single file one can issue: > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml > rm -rf extracted STEP 2 -------------------------------- # Remove any untouched incomplete tags sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml STEP 3 : Generate a valid xml to be parsed. ----------------------------------------- # Prepend a head tag to validate a proper xml to be parsed by xml parsers echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile OR @@ -20,11 +26,15 @@ check if the last doc tag is closed or not if not close it. #Close the head tag echo "</head>" >> file STEP 4 : Split the large xml into different files with each file being an article. ------------------------------------------------------------------------------------------- # Incase we want to go with spliting the big xml into the smaller files one for each article. awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml OR https://github.com/seamusabshere/xml_split xml_split wiki_parsed.xml doc splitted/mon --- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html STEP 5: Run the rake task to get the data out and dump into the database. ------------------------------------------------------------------------------------------- -
xecutioner revised this gist
Mar 17, 2014 . 1 changed file with 3 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -21,8 +21,10 @@ check if the last doc tag is closed or not if not close it. echo "</head>" >> file # Incase we want to go with spliting the big xml into the smaller files one for each article. awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml OR https://github.com/seamusabshere/xml_split --- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html -
xecutioner created this gist
Mar 17, 2014 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,28 @@ # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2 > bzcat itwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted In order to combine the whole extracted text into a single file one can issue: > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml > rm -rf extracted # Remove any untouched incomplete tags sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml # Prepend a head tag to validate a proper xml to be parsed by xml parsers echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile OR sed -i '1s/^/<head> /' file check if the last doc tag is closed or not if not close it. #Close the head tag echo "</head>" >> file # Incase we want to go with the split the big xml into the smaller files one for each article. awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml --- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html