xecutioner · August 29, 2015 13:57
diff --git a/Wikimeida_extraction b/Wikimeida_extraction
 # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
 ----------------------------

 - Get the latest copy of the articles from wikipedia download page.

 ```shell

 > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2


 ```
 - Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
 - Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
 - Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes.
 - This might take some time depending upon the processing capacity of your computer.

 ```shell

 > bzcat enwiki-latest-pages-articles.xml.bz2 |
 WikiExtractor.py -cb 250K -o extracted
 
 ```
 - In order to combine the whole extracted text into a single file one can issue:

 ```shell

 > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
 > rm -rf extracted

 ```


 # STEP 2 (preparing for extraction)
 --------------------------------
 - Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.)
 - Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement.

 ```shell

 > sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml

 ```


 STEP 3 : Generate a valid xml to be parsed.
 -----------------------------------------
 - Prepend a head tag to validate a proper xml to be parsed by xml parsers
 echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
 OR 

 ```shell

 > sed -i '1s/^/<head> /' file

 ```

 - check if the last doc tag is closed or not if not close it. 
 - You can check if the last lines by following command 

 ```shell

 > tail -200 wiki_parsed.xml

 ```

 - Close the head tag

 ```shell
 > echo "</head>" >> wiki_parsed.xml
 ```

 # STEP 4 : Split the large xml into different files with each file being an article.
 -------------------------------------------------------------------------------------------
 - Incase we want to go with  spliting the big xml into the smaller files one for each article.
 awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
 - more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html

 OR 

 - https://github.com/seamusabshere/xml_split (PREFERD).
 - requires sgrep or sgrep2 so (brew install sgrep).
 - considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .

 ```shell
 > xml_split wiki_parsed.xml doc splitted/monday
 ```
 - Might take some time but should be really fast as it uses sgrep.

 STEP 5: Run the rake task to get the data out and dump into the database.
 -------------------------------------------------------------------------------------------
 Considering the splitted docs are in the 'splitted' directory
 ```shell
 > bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
 ```

 And we should have all the articles populated into the wiki_articles table. 
 Phew , Now this way easy once we figure out the proper tools.
	# STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
	----------------------------

	- Get the latest copy of the articles from wikipedia download page.

	```shell

	> wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2


	```
	- Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
	- Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
	- Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes.
	- This might take some time depending upon the processing capacity of your computer.

	```shell

	> bzcat enwiki-latest-pages-articles.xml.bz2 \|
	WikiExtractor.py -cb 250K -o extracted

	```
	- In order to combine the whole extracted text into a single file one can issue:

	```shell

	> find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
	> rm -rf extracted

	```


	# STEP 2 (preparing for extraction)
	--------------------------------
	- Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.)
	- Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement.

	```shell

	> sed -i.bak -e 's/<[^doc>/!][^ >][^>]>//g;s/<\/[^doc>][^>]>//g' wiki_parsed.xml

	```


	STEP 3 : Generate a valid xml to be parsed.
	-----------------------------------------
	- Prepend a head tag to validate a proper xml to be parsed by xml parsers
	echo "<head>"\|cat - yourfile > /tmp/out && mv /tmp/out yourfile
	OR

	```shell

	> sed -i '1s/^/<head> /' file

	```

	- check if the last doc tag is closed or not if not close it.
	- You can check if the last lines by following command

	```shell

	> tail -200 wiki_parsed.xml

	```

	- Close the head tag

	```shell
	> echo "</head>" >> wiki_parsed.xml
	```

	# STEP 4 : Split the large xml into different files with each file being an article.
	-------------------------------------------------------------------------------------------
	- Incase we want to go with spliting the big xml into the smaller files one for each article.
	awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
	- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html

	OR

	- https://github.com/seamusabshere/xml_split (PREFERD).
	- requires sgrep or sgrep2 so (brew install sgrep).
	- considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .

	```shell
	> xml_split wiki_parsed.xml doc splitted/monday
	```
	- Might take some time but should be really fast as it uses sgrep.

	STEP 5: Run the rake task to get the data out and dump into the database.
	-------------------------------------------------------------------------------------------
	Considering the splitted docs are in the 'splitted' directory
	```shell
	> bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
	```

	And we should have all the articles populated into the wiki_articles table.
	Phew , Now this way easy once we figure out the proper tools.
No results found