Skip to content

Instantly share code, notes, and snippets.

@xecutioner
Last active August 29, 2015 13:57
Show Gist options
  • Select an option

  • Save xecutioner/9596695 to your computer and use it in GitHub Desktop.

Select an option

Save xecutioner/9596695 to your computer and use it in GitHub Desktop.

Revisions

  1. xecutioner revised this gist May 15, 2014. 1 changed file with 20 additions and 42 deletions.
    62 changes: 20 additions & 42 deletions Wikimeida_extraction.md
    Original file line number Diff line number Diff line change
    @@ -16,8 +16,7 @@

    ```shell

    > bzcat enwiki-latest-pages-articles.xml.bz2 |
    WikiExtractor.py -cb 250K -o extracted
    > bzcat enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted

    ```
    - In order to combine the whole extracted text into a single file one can issue:
    @@ -37,57 +36,36 @@

    ```shell

    > sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml
    > sed -i.bak -e 's/<[^d][^o>]*[^c>]*>//g' wiki_parsed.xml

    ```


    # STEP 3 : Generate a valid xml to be parsed.
    -----------------------------------------
    - Prepend a head tag to validate a proper xml to be parsed by xml parsers
    echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
    OR

    ```shell

    > sed -i '1s/^/<head> /' file

    ```

    - check if the last doc tag is closed or not if not close it.
    - You can check if the last lines by following command

    ```shell

    > tail -200 wiki_parsed.xml

    ```

    - Close the head tag
    # STEP 3 : Split into articles based files.
    -----------------------------------------

    ```shell
    > echo "</head>" >> wiki_parsed.xml
    ```

    # STEP 4 : Split the large xml into different files with each file being an article.
    -------------------------------------------------------------------------------------------
    - Incase we want to go with spliting the big xml into the smaller files one for each article.
    awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
    - more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html

    OR
    > mkdir -p splitted
    > awk 'BEGIN { system("mkdir -p splitted/sub"++j) }
    /<doc/{x="F"++i".xml";}{
    if (i%1995==0 ){
    ++i;
    system("mkdir -p splitted/sub"++j"/");
    }
    else{
    print >> ("splitted/sub"j"/"x);
    close("splitted/sub"j"/"x);
    }
    }' wiki_parsed.xml

    - https://github.com/seamusabshere/xml_split (PREFERD).
    - requires sgrep or sgrep2 so (brew install sgrep).
    - considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .

    ```shell
    > gem install xml_split
    > xml_split wiki_parsed.xml doc splitted/monday
    ```
    - Might take some time but should be really fast as it uses sgrep.

    # STEP 5: Run the rake task to get the data out and dump into the database.
    # STEP 4
    : Run the rake task to get the data out and dump into the database.
    -------------------------------------------------------------------------------------------
    Considering the splitted docs are in the 'splitted' directory
    ```shell
  2. xecutioner revised this gist Mar 17, 2014. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions Wikimeida_extraction.md
    Original file line number Diff line number Diff line change
    @@ -82,6 +82,7 @@ OR
    - considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .

    ```shell
    > gem install xml_split
    > xml_split wiki_parsed.xml doc splitted/monday
    ```
    - Might take some time but should be really fast as it uses sgrep.
  3. xecutioner revised this gist Mar 17, 2014. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions Wikimeida_extraction.md
    Original file line number Diff line number Diff line change
    @@ -42,7 +42,7 @@
    ```


    STEP 3 : Generate a valid xml to be parsed.
    # STEP 3 : Generate a valid xml to be parsed.
    -----------------------------------------
    - Prepend a head tag to validate a proper xml to be parsed by xml parsers
    echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
    @@ -86,7 +86,7 @@ OR
    ```
    - Might take some time but should be really fast as it uses sgrep.

    STEP 5: Run the rake task to get the data out and dump into the database.
    # STEP 5: Run the rake task to get the data out and dump into the database.
    -------------------------------------------------------------------------------------------
    Considering the splitted docs are in the 'splitted' directory
    ```shell
  4. xecutioner renamed this gist Mar 17, 2014. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  5. xecutioner revised this gist Mar 17, 2014. 1 changed file with 71 additions and 17 deletions.
    88 changes: 71 additions & 17 deletions Wikimeida_extraction
    Original file line number Diff line number Diff line change
    @@ -1,43 +1,97 @@
    # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
    ----------------------------

    Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
    - Get the latest copy of the articles from wikipedia download page.

    ```shell

    > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2


    ```
    - Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
    - Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
    - Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes.
    - This might take some time depending upon the processing capacity of your computer.

    ```shell

    > bzcat enwiki-latest-pages-articles.xml.bz2 |
    WikiExtractor.py -cb 250K -o extracted
    In order to combine the whole extracted text into a single file one can issue:

    ```
    - In order to combine the whole extracted text into a single file one can issue:

    ```shell

    > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
    > rm -rf extracted

    STEP 2
    ```


    # STEP 2 (preparing for extraction)
    --------------------------------
    # Remove any untouched incomplete tags
    sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml
    - Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.)
    - Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement.

    ```shell

    > sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml

    ```


    STEP 3 : Generate a valid xml to be parsed.
    -----------------------------------------
    # Prepend a head tag to validate a proper xml to be parsed by xml parsers
    - Prepend a head tag to validate a proper xml to be parsed by xml parsers
    echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
    OR
    sed -i '1s/^/<head> /' file

    check if the last doc tag is closed or not if not close it.
    ```shell

    > sed -i '1s/^/<head> /' file

    ```

    - check if the last doc tag is closed or not if not close it.
    - You can check if the last lines by following command

    ```shell

    > tail -200 wiki_parsed.xml

    #Close the head tag
    echo "</head>" >> file
    ```

    STEP 4 : Split the large xml into different files with each file being an article.
    - Close the head tag

    ```shell
    > echo "</head>" >> wiki_parsed.xml
    ```

    # STEP 4 : Split the large xml into different files with each file being an article.
    -------------------------------------------------------------------------------------------
    # Incase we want to go with spliting the big xml into the smaller files one for each article.
    - Incase we want to go with spliting the big xml into the smaller files one for each article.
    awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml
    - more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html

    OR https://github.com/seamusabshere/xml_split
    OR

    xml_split wiki_parsed.xml doc splitted/mon
    --- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html
    - https://github.com/seamusabshere/xml_split (PREFERD).
    - requires sgrep or sgrep2 so (brew install sgrep).
    - considering you want to store the splitted article files into the directory called 'splitted' and the file names to be prepended by monday .

    ```shell
    > xml_split wiki_parsed.xml doc splitted/monday
    ```
    - Might take some time but should be really fast as it uses sgrep.

    STEP 5: Run the rake task to get the data out and dump into the database.
    -------------------------------------------------------------------------------------------
    Considering the splitted docs are in the splitted directory
    bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
    Considering the splitted docs are in the 'splitted' directory
    ```shell
    > bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
    ```

    And we should have all the articles populated into the wiki_articles table.
    Phew , Now this way easy once we figure out the proper tools.
  6. xecutioner revised this gist Mar 17, 2014. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Wikimeida_extraction
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,7 @@
    # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
    ----------------------------

    # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
    Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
    > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
    > bzcat enwiki-latest-pages-articles.xml.bz2 |
    WikiExtractor.py -cb 250K -o extracted
  7. xecutioner revised this gist Mar 17, 2014. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion Wikimeida_extraction
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    #STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
    # STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
    ----------------------------

    # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
  8. xecutioner revised this gist Mar 17, 2014. 1 changed file with 4 additions and 3 deletions.
    7 changes: 4 additions & 3 deletions Wikimeida_extraction
    Original file line number Diff line number Diff line change
    @@ -1,8 +1,9 @@
    STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
    #STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
    ----------------------------

    # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
    > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
    > bzcat itwiki-latest-pages-articles.xml.bz2 |
    > wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
    > bzcat enwiki-latest-pages-articles.xml.bz2 |
    WikiExtractor.py -cb 250K -o extracted
    In order to combine the whole extracted text into a single file one can issue:
    > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
  9. xecutioner revised this gist Mar 17, 2014. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions Wikimeida_extraction
    Original file line number Diff line number Diff line change
    @@ -38,3 +38,5 @@ xml_split wiki_parsed.xml doc splitted/mon

    STEP 5: Run the rake task to get the data out and dump into the database.
    -------------------------------------------------------------------------------------------
    Considering the splitted docs are in the splitted directory
    bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
  10. xecutioner revised this gist Mar 17, 2014. 1 changed file with 11 additions and 1 deletion.
    12 changes: 11 additions & 1 deletion Wikimeida_extraction
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,5 @@
    STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.
    ----------------------------
    # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
    > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
    > bzcat itwiki-latest-pages-articles.xml.bz2 |
    @@ -6,10 +8,14 @@ In order to combine the whole extracted text into a single file one can issue:
    > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
    > rm -rf extracted

    STEP 2
    --------------------------------
    # Remove any untouched incomplete tags
    sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml


    STEP 3 : Generate a valid xml to be parsed.
    -----------------------------------------
    # Prepend a head tag to validate a proper xml to be parsed by xml parsers
    echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
    OR
    @@ -20,11 +26,15 @@ check if the last doc tag is closed or not if not close it.
    #Close the head tag
    echo "</head>" >> file


    STEP 4 : Split the large xml into different files with each file being an article.
    -------------------------------------------------------------------------------------------
    # Incase we want to go with spliting the big xml into the smaller files one for each article.
    awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml

    OR https://github.com/seamusabshere/xml_split

    xml_split wiki_parsed.xml doc splitted/mon
    --- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html

    STEP 5: Run the rake task to get the data out and dump into the database.
    -------------------------------------------------------------------------------------------
  11. xecutioner revised this gist Mar 17, 2014. 1 changed file with 3 additions and 1 deletion.
    4 changes: 3 additions & 1 deletion Wikimeida_extraction
    Original file line number Diff line number Diff line change
    @@ -21,8 +21,10 @@ check if the last doc tag is closed or not if not close it.
    echo "</head>" >> file


    # Incase we want to go with the split the big xml into the smaller files one for each article.
    # Incase we want to go with spliting the big xml into the smaller files one for each article.
    awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml

    OR https://github.com/seamusabshere/xml_split

    --- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html

  12. xecutioner created this gist Mar 17, 2014.
    28 changes: 28 additions & 0 deletions Wikimeida_extraction
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,28 @@
    # Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
    > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
    > bzcat itwiki-latest-pages-articles.xml.bz2 |
    WikiExtractor.py -cb 250K -o extracted
    In order to combine the whole extracted text into a single file one can issue:
    > find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
    > rm -rf extracted

    # Remove any untouched incomplete tags
    sed -i.bak -e 's/<[^doc>/!][^ >][^>]*>//g;s/<\/[^doc>][^>]*>//g' wiki_parsed.xml


    # Prepend a head tag to validate a proper xml to be parsed by xml parsers
    echo "<head>"|cat - yourfile > /tmp/out && mv /tmp/out yourfile
    OR
    sed -i '1s/^/<head> /' file

    check if the last doc tag is closed or not if not close it.

    #Close the head tag
    echo "</head>" >> file


    # Incase we want to go with the split the big xml into the smaller files one for each article.
    awk '/<doc/{x="F"++i;}{print > x;}' wiki_parsed.xml

    --- more http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html