xecutioner · August 29, 2015 13:57 · May 15, 2014 · Mar 17, 2014 · Mar 17, 2014 · Mar 17, 2014
diff --git a/Wikimeida_extraction.md b/Wikimeida_extraction.md
@@ -16,8 +16,7 @@
 
 ```shell
 
-> bzcat enwiki-latest-pages-articles.xml.bz2 |
- WikiExtractor.py -cb 250K -o extracted
+> bzcat enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted
 
 ```
 - In order to combine the whole extracted text into a single file one can issue:
@@ -37,57 +36,36 @@
 
 ```shell
 
-> sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml
+> sed -i.bak -e 's/<[^d][^o>]*[^c>]*>//g' wiki_parsed.xml
 
-```
-
-
-# STEP 3 : Generate a valid xml to be parsed.
------------------------------------------
-- Prepend a head tag to validate a proper xml to be parsed by xml parsers
-echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
-OR 
-
-```shell
-
-> sed -i '1s/^/<head> /' file
 
 ```
 
-- check if the last doc tag is closed or not if not close it. 
-- You can check if the last lines by following command 
-
-```shell
-
-> tail -200 wiki_parsed.xml
 
-```
-
-- Close the head tag
+# STEP 3 : Split into articles based files.
+-----------------------------------------
 
 ```shell
-> echo "</head>" >> wiki_parsed.xml
-```
-
-# STEP 4 : Split the large xml into different files with each file being an article.
--------------------------------------------------------------------------------------------
-- Incase we want to go with  spliting the big xml into the smaller files one for each article.
-awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
-- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html
-
-OR 
+> mkdir -p splitted
+> awk 'BEGIN { system("mkdir -p splitted/sub"++j) }
+/<doc/{x="F"++i".xml";}{
+  
+  if (i%1995==0 ){
+    ++i;
+    system("mkdir -p splitted/sub"++j"/");
+  }
+  else{
+    print >> ("splitted/sub"j"/"x);
+    close("splitted/sub"j"/"x);
+  }
+ 
+}' wiki_parsed.xml
 
-- https://github.com/seamusabshere/xml_split (PREFERD).
-- requires sgrep or sgrep2 so (brew install sgrep).
-- considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .
 
-```shell
-> gem install xml_split
-> xml_split wiki_parsed.xml doc splitted/monday
 ```
-- Might take some time but should be really fast as it uses sgrep.
 
-# STEP 5: Run the rake task to get the data out and dump into the database.
+# STEP 4
+: Run the rake task to get the data out and dump into the database.
 -------------------------------------------------------------------------------------------
 Considering the splitted docs are in the 'splitted' directory
 ```shell

diff --git a/Wikimeida_extraction.md b/Wikimeida_extraction.md
@@ -82,6 +82,7 @@ OR
 - considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .
 
 ```shell
+> gem install xml_split
 > xml_split wiki_parsed.xml doc splitted/monday
 ```
 - Might take some time but should be really fast as it uses sgrep.

diff --git a/Wikimeida_extraction.md b/Wikimeida_extraction.md
@@ -42,7 +42,7 @@
 ```
 
 
-STEP 3 : Generate a valid xml to be parsed.
+# STEP 3 : Generate a valid xml to be parsed.
 -----------------------------------------
 - Prepend a head tag to validate a proper xml to be parsed by xml parsers
 echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
@@ -86,7 +86,7 @@ OR
 ```
 - Might take some time but should be really fast as it uses sgrep.
 
-STEP 5: Run the rake task to get the data out and dump into the database.
+# STEP 5: Run the rake task to get the data out and dump into the database.
 -------------------------------------------------------------------------------------------
 Considering the splitted docs are in the 'splitted' directory
 ```shell

diff --git a/Wikimeida_extraction → Wikimeida_extraction.md b/Wikimeida_extraction → Wikimeida_extraction.md
diff --git a/Wikimeida_extraction b/Wikimeida_extraction
@@ -1,43 +1,97 @@
 # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
 ----------------------------
 
-Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
+- Get the latest copy of the articles from wikipedia download page.
+
+```shell
+
 > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
+
+
+```
+- Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
+- Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
+- Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes.
+- This might take some time depending upon the processing capacity of your computer.
+
+```shell
+
 > bzcat enwiki-latest-pages-articles.xml.bz2 |
  WikiExtractor.py -cb 250K -o extracted
-In order to combine the whole extracted text into a single file one can issue:
+
+```
+- In order to combine the whole extracted text into a single file one can issue:
+
+```shell
+
 > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
 > rm -rf extracted
 
-STEP 2
+```
+
+
+# STEP 2 (preparing for extraction)
 --------------------------------
-# Remove any untouched incomplete tags
-sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml
+- Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.)
+- Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement.
+
+```shell
+
+> sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml
+
+```
 
 
 STEP 3 : Generate a valid xml to be parsed.
 -----------------------------------------
-# Prepend a head tag to validate a proper xml to be parsed by xml parsers
+- Prepend a head tag to validate a proper xml to be parsed by xml parsers
 echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
 OR 
-sed -i '1s/^/<head> /' file
 
-check if the last doc tag is closed or not if not close it. 
+```shell
+
+> sed -i '1s/^/<head> /' file
+
+```
+
+- check if the last doc tag is closed or not if not close it. 
+- You can check if the last lines by following command 
+
+```shell
+
+> tail -200 wiki_parsed.xml
 
-#Close the head tag
-echo "</head>" >> file
+```
 
-STEP 4 : Split the large xml into different files with each file being an article.
+- Close the head tag
+
+```shell
+> echo "</head>" >> wiki_parsed.xml
+```
+
+# STEP 4 : Split the large xml into different files with each file being an article.
 -------------------------------------------------------------------------------------------
-# Incase we want to go with  spliting the big xml into the smaller files one for each article.
+- Incase we want to go with  spliting the big xml into the smaller files one for each article.
 awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
+- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html
 
-OR https://github.com/seamusabshere/xml_split
+OR 
 
-xml_split wiki_parsed.xml doc splitted/mon
---- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html
+- https://github.com/seamusabshere/xml_split (PREFERD).
+- requires sgrep or sgrep2 so (brew install sgrep).
+- considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .
+
+```shell
+> xml_split wiki_parsed.xml doc splitted/monday
+```
+- Might take some time but should be really fast as it uses sgrep.
 
 STEP 5: Run the rake task to get the data out and dump into the database.
 -------------------------------------------------------------------------------------------
-Considering the splitted docs are in the splitted directory
-bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
+Considering the splitted docs are in the 'splitted' directory
+```shell
+> bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
+```
+
+And we should have all the articles populated into the wiki_articles table. 
+Phew , Now this way easy once we figure out the proper tools. 
diff --git a/Wikimeida_extraction b/Wikimeida_extraction
@@ -1,7 +1,7 @@
 # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
 ----------------------------
 
-# Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
+Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
 > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
 > bzcat enwiki-latest-pages-articles.xml.bz2 |
  WikiExtractor.py -cb 250K -o extracted

diff --git a/Wikimeida_extraction b/Wikimeida_extraction
@@ -1,4 +1,4 @@
-#STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
+# STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
 ----------------------------
 
 # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

diff --git a/Wikimeida_extraction b/Wikimeida_extraction
@@ -1,8 +1,9 @@
-STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
+#STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
 ----------------------------
+
 # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
-> wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
-> bzcat itwiki-latest-pages-articles.xml.bz2 |
+> wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
+> bzcat enwiki-latest-pages-articles.xml.bz2 |
  WikiExtractor.py -cb 250K -o extracted
 In order to combine the whole extracted text into a single file one can issue:
 > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml

diff --git a/Wikimeida_extraction b/Wikimeida_extraction
@@ -38,3 +38,5 @@ xml_split wiki_parsed.xml doc splitted/mon
 
 STEP 5: Run the rake task to get the data out and dump into the database.
 -------------------------------------------------------------------------------------------
+Considering the splitted docs are in the splitted directory
+bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
diff --git a/Wikimeida_extraction b/Wikimeida_extraction
@@ -1,3 +1,5 @@
+STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
+----------------------------
 # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
 > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
 > bzcat itwiki-latest-pages-articles.xml.bz2 |
@@ -6,10 +8,14 @@ In order to combine the whole extracted text into a single file one can issue:
 > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
 > rm -rf extracted
 
+STEP 2
+--------------------------------
 # Remove any untouched incomplete tags
 sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml
 
 
+STEP 3 : Generate a valid xml to be parsed.
+-----------------------------------------
 # Prepend a head tag to validate a proper xml to be parsed by xml parsers
 echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
 OR 
@@ -20,11 +26,15 @@ check if the last doc tag is closed or not if not close it.
 #Close the head tag
 echo "</head>" >> file
 
-
+STEP 4 : Split the large xml into different files with each file being an article.
+-------------------------------------------------------------------------------------------
 # Incase we want to go with  spliting the big xml into the smaller files one for each article.
 awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
 
 OR https://github.com/seamusabshere/xml_split
 
+xml_split wiki_parsed.xml doc splitted/mon
 --- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html
 
+STEP 5: Run the rake task to get the data out and dump into the database.
+-------------------------------------------------------------------------------------------
diff --git a/Wikimeida_extraction b/Wikimeida_extraction
@@ -21,8 +21,10 @@ check if the last doc tag is closed or not if not close it.
 echo "</head>" >> file
 
 
-# Incase we want to go with the split the big xml into the smaller files one for each article.
+# Incase we want to go with  spliting the big xml into the smaller files one for each article.
 awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
 
+OR https://github.com/seamusabshere/xml_split
+
 --- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html
 
diff --git a/Wikimeida_extraction b/Wikimeida_extraction
@@ -0,0 +1,28 @@
+# Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
+> wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
+> bzcat itwiki-latest-pages-articles.xml.bz2 |
+ WikiExtractor.py -cb 250K -o extracted
+In order to combine the whole extracted text into a single file one can issue:
+> find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
+> rm -rf extracted
+
+# Remove any untouched incomplete tags
+sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml
+
+
+# Prepend a head tag to validate a proper xml to be parsed by xml parsers
+echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
+OR 
+sed -i '1s/^/<head> /' file
+
+check if the last doc tag is closed or not if not close it. 
+
+#Close the head tag
+echo "</head>" >> file
+
+
+# Incase we want to go with the split the big xml into the smaller files one for each article.
+awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
+
+--- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html
+
No results found