Last active
August 8, 2021 17:17
-
-
Save SandieIJ/69fc80c372e823fecfd4eeeda2156936 to your computer and use it in GitHub Desktop.
Revisions
-
SandieIJ revised this gist
Apr 23, 2020 . 1 changed file with 21 additions and 6 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -7,7 +7,9 @@ "**TEXT DATA PREPRCOESSING FOR NATURAL LANGUAGE PROCESSING USING SPACY AND GENSIM**\n", "\n", "The code in this gist corresponds to a Medium article in which I present a tutorial on how to preprocess text data for the purpose of natural language processing modeling. \n", "The tutorial can be found by following this link: \n", "\n", "https://medium.com/@calmscientist/text-data-preprocessing-for-nlp-using-gensim-and-spacy-23347fcf3648" ] }, { @@ -43,7 +45,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ "**CLEANING AND FORMATTING THE DATA**\n", "\n", "Before we begin the preprocessing steps, we format the data, containing only game descriptions, as a list, each item in the list corresponding to a single description." ] }, { @@ -195,7 +199,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ "**REMOVING NEWLINE CHARACTERS AND NONLETTER CHARACTERS**\n", "\n", "When processing text, newline characters and nonletter characters do not add any valuable information to our text, however, they do add to the size of our text. It is, therefore, considered best practice to remove these characters from your data." ] }, { @@ -217,7 +223,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ "**GENERATING N-GRAMS**\n", "\n", "In order for our models to infer the correct meanings from words, it is important to identify n-grams in the text data you are training your model on." ] }, { @@ -274,7 +282,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ "**LEMMATIZATION**\n", "\n", "Lemmatization is the process of attempting to identify and structure any relationships contained in the given tokenized document to accurately identify the lemma, which is the dictionary form of a word, including nouns, adjectives, verbs, and adverbs." ] }, { @@ -313,7 +323,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ "**CREATING A DICTIONARY OF WORD MAPPINGS AND A BAG-OF-WORDS**\n", "\n", "The process of creating a dictionary sounds complex but it is simply assigning an integer value to each word in the document for purposes of modeling.\n", "\n", "Once we have a document, we create a bag-of-words which is a list containing the token identification number assigned in the creation of the dictionary and a value representing the frequency in which this token appears in the document (tokenization_id, frequency)." ] }, { @@ -443,6 +457,7 @@ ], "source": [ "# Human readable format of corpus (term-frequency)\n", "\n", "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" ] } -
SandieIJ renamed this gist
Apr 23, 2020 . 1 changed file with 122 additions and 45 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,34 +1,54 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**TEXT DATA PREPRCOESSING FOR NATURAL LANGUAGE PROCESSING USING SPACY AND GENSIM**\n", "\n", "The code in this gist corresponds to a Medium article in which I present a tutorial on how to preprocess text data for the purpose of natural language processing modeling. \n", "The tutorial can be found by following this link https://medium.com/@calmscientist/text-data-preprocessing-for-nlp-using-gensim-and-spacy-23347fcf3648" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# install the required modules\n", "!pip install spacy\n", "!pip install gensim\n", "!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import all required packages\n", "import spacy\n", "import re\n", "import pandas as pd\n", "import numpy as np\n", "import gensim\n", "from gensim.utils import simple_preprocess \n", "import gensim.corpora as corpora\n", "from pprint import pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**CLEANING AND FORMATTING THE DATA**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { @@ -137,7 +157,7 @@ "9 Một tựa game cũng như cách chơi ko thể quen th... " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } @@ -146,87 +166,120 @@ "# Reading loading/data\n", "data = pd.read_csv(\"https://raw.githubusercontent.com/SandieIJ/Capstone/master/data/sandra_csv_results-20190723-155508.csv\")\n", "\n", "# Sample of the output\n", "data.head(10)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Brick Breaker 3D is a single-tap hyper casual game that will keep you hooked for hours!Hold the screen to aim, swipe the ball to the brick and break all the bricks easily!The game features unlimited levels and 20 beautiful color balls.\n" ] } ], "source": [ "# convert the descriptions from a data frame column into a list\n", "descriptions = data.description.values.tolist()\n", "\n", "# sample of the output\n", "print(descriptions[10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**REMOVING NEWLINE CHARACTERS AND NONLETTER CHARACTERS**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "no_new_lines = [re.sub('\\s+', ' ', sent) for sent in descriptions] \n", "\n", "#Remove non letter characters\n", "non_letters = [re.sub('[^a-zA-Z]', ' ', no_new_line) for no_new_line in no_new_lines]\n", "\n", "# Remove distracting single quotes\n", "no_quotes = [re.sub(\"\\'\", '', non_letter) for non_letter in non_letters]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**GENERATING N-GRAMS**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#break down sentences into words\n", "def sent_to_words(sentences): \n", " for sentence in sentences:\n", " yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))\n", "\n", "data_words = list(sent_to_words(no_quotes))\n", "\n", "# Build the bigram and trigram models\n", "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) \n", "trigram = gensim.models.Phrases(bigram[no_quotes], threshold=100)\n", "\n", "# Faster way to get a sentence clubbed as a trigram/bigram\n", "bigram_mod = gensim.models.phrases.Phraser(bigram) \n", "\n", "trigram_mod = gensim.models.phrases.Phraser(trigram)\n", "\n", "def make_bigrams(texts):\n", " return [bigram_mod[doc] for doc in texts]\n", "\n", "def make_trigrams(texts):\n", " return [trigram_mod[bigram_mod[doc]] for doc in texts]\n", "\n", "# Form Bigrams\n", "data_words_bigrams = make_bigrams(data_words)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['brick_breaker', 'is', 'single', 'tap', 'hyper_casual', 'game', 'that', 'will', 'keep', 'you', 'hooked', 'for', 'hours', 'hold', 'the', 'screen', 'to', 'aim', 'swipe', 'the', 'ball', 'to', 'the', 'brick', 'and', 'break', 'all', 'the', 'bricks', 'easily', 'the', 'game', 'features', 'unlimited', 'levels', 'and', 'beautiful', 'color', 'balls']\n" ] } ], "source": [ "# preview of a description with bigrams identified\n", "print(data_words_bigrams[10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**LEMMATIZATION**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { @@ -238,15 +291,34 @@ } ], "source": [ "# Initialize spacy\n", "nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])\n", "\n", "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): \n", " \"\"\"https://spacy.io/api/annotation\"\"\"\n", " texts_out = []\n", " for sent in texts:\n", " doc = nlp(\" \".join(sent))\n", " texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) \n", " return texts_out\n", "\n", "# Perform lemmatization keeping only nouns, adjectives, verbs and adjectives\n", "data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']) \n", "\n", "# preview of lemmatized data taking into account nouns, adverbs, verbs and adjectives\n", "print(data_lemmatized[:1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**CREATING A DICTIONARY OF WORD MAPPINGS AND A BAG-OF-WORDS**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { @@ -258,21 +330,26 @@ } ], "source": [ "# a mapping between words and their corresponding integer values\n", "id2word = corpora.Dictionary(data_lemmatized) \n", "\n", "# Term Document Frequency and gensim creates a unique id for each word in the document\n", "corpus = [id2word.doc2bow(text) for text in data_lemmatized]\n", "\n", "# This corpus is a mapping of (word_id, word_frequency)\n", "print(corpus[:1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TEXT REPRESENTED AS A BAG OF WORDS DISREGARDING GRAMMAR**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { @@ -359,7 +436,7 @@ " ('world', 2)]]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } -
SandieIJ renamed this gist
Apr 23, 2020 . 1 changed file with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes. -
SandieIJ revised this gist
Apr 23, 2020 . 2 changed files with 394 additions and 377 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,377 +0,0 @@ This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,394 @@ { "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install spacy\n", "!pip install gensim\n", "!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "import re\n", "import pandas as pd\n", "import gensim\n", "from gensim.utils import simple_preprocess \n", "import gensim.corpora as corpora\n", "from pprint import pprint" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>name</th>\n", " <th>description</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>Legend Fire Squad survival: Free Fire Battlegr...</td>\n", " <td>Ready to play an amazing and exciting best sho...</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>Ambulance Game</td>\n", " <td>You must be a fan of the driving games. We ass...</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>Beam Drive NG Death Stair Car Crash Simulator</td>\n", " <td>Beam Drive NG Death Stair Car Crash Accidents ...</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>Kelime İncileri</td>\n", " <td>Yeni Kelime Bulmaca Oyununuz! Kelime Arama ve ...</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>Word Blocks</td>\n", " <td>Word Blocks is a new kind of word search puzzl...</td>\n", " </tr>\n", " <tr>\n", " <td>5</td>\n", " <td>Free Fire Commando - Counter Attack FPS 2019</td>\n", " <td>Free Fire Commando - Counter Attack FPS 2019 i...</td>\n", " </tr>\n", " <tr>\n", " <td>6</td>\n", " <td>Fall Race 3D</td>\n", " <td>The most exciting sky race!Run through the sky...</td>\n", " </tr>\n", " <tr>\n", " <td>7</td>\n", " <td>Math School Game Basic: Crazy Principal</td>\n", " <td>Your school principal went crazy and locked yo...</td>\n", " </tr>\n", " <tr>\n", " <td>8</td>\n", " <td>Jump Cube</td>\n", " <td>Jump Cube is an addictive game, tap the right ...</td>\n", " </tr>\n", " <tr>\n", " <td>9</td>\n", " <td>Tien Len Offline</td>\n", " <td>Một tựa game cũng như cách chơi ko thể quen th...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " name \\\n", "0 Legend Fire Squad survival: Free Fire Battlegr... \n", "1 Ambulance Game \n", "2 Beam Drive NG Death Stair Car Crash Simulator \n", "3 Kelime İncileri \n", "4 Word Blocks \n", "5 Free Fire Commando - Counter Attack FPS 2019 \n", "6 Fall Race 3D \n", "7 Math School Game Basic: Crazy Principal \n", "8 Jump Cube \n", "9 Tien Len Offline \n", "\n", " description \n", "0 Ready to play an amazing and exciting best sho... \n", "1 You must be a fan of the driving games. We ass... \n", "2 Beam Drive NG Death Stair Car Crash Accidents ... \n", "3 Yeni Kelime Bulmaca Oyununuz! Kelime Arama ve ... \n", "4 Word Blocks is a new kind of word search puzzl... \n", "5 Free Fire Commando - Counter Attack FPS 2019 i... \n", "6 The most exciting sky race!Run through the sky... \n", "7 Your school principal went crazy and locked yo... \n", "8 Jump Cube is an addictive game, tap the right ... \n", "9 Một tựa game cũng như cách chơi ko thể quen th... " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reading loading/data\n", "data = pd.read_csv(\"https://raw.githubusercontent.com/SandieIJ/Capstone/master/data/sandra_csv_results-20190723-155508.csv\")\n", "\n", "data.head(10)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "descriptions = data.description.values.tolist()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Remove new line characters\n", "no_new_lines = [re.sub('\\s+', ' ', sent) for sent in descriptions] \n", "\n", "#Remove non letter characters\n", "non_letters = [re.sub('[^a-zA-Z]', ' ', no_new_line) for no_new_line in no_new_lines]\n", "\n", "# Remove distracting single quotes\n", "no_quotes = [re.sub(\"\\'\", '', non_letter) for non_letter in non_letters]\n", "\n", "#break down sentences into words\n", "def sent_to_words(sentences): \n", " for sentence in sentences:\n", " yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))\n", "\n", "data_words = list(sent_to_words(descriptions)) " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Build the bigram and trigram models\n", "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) \n", "trigram = gensim.models.Phrases(bigram[data_words], threshold=100)\n", "\n", "# Faster way to get a sentence clubbed as a trigram/bigram\n", "bigram_mod = gensim.models.phrases.Phraser(bigram) \n", "\n", "trigram_mod = gensim.models.phrases.Phraser(trigram)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Initialize spacy\n", "nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])\n", "\n", "def make_bigrams(texts):\n", " return [bigram_mod[doc] for doc in texts]\n", "\n", "def make_trigrams(texts):\n", " return [trigram_mod[bigram_mod[doc]] for doc in texts]\n", "\n", "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): \n", " \"\"\"https://spacy.io/api/annotation\"\"\"\n", " texts_out = []\n", " for sent in texts:\n", " doc = nlp(\" \".join(sent))\n", " texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) \n", " return texts_out\n", "\n", "# Form Bigrams\n", "data_words_bigrams = make_bigrams(data_words)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['ready', 'play', 'amazing', 'exciting', 'good', 'shooting', 'game', 'fire', 'shoot', 'game', 'war', 'shooting', 'game', 'free', 'unknown', 'battle', 'strike', 'free', 'survival', 'mission', 'free', 'fire', 'unknown', 'shoot', 'action', 'game', 'face', 'dangerous', 'death', 'mission', 'exciting', 'survival', 'free', 'firing_squad', 'free', 'fire', 'shooting', 'game', 'commando', 'shoot', 'survival', 'game', 'army', 'soldier', 'crazy', 'challenging', 'shooting', 'arena', 'where', 'training', 'face', 'crazy', 'dangerous', 'death', 'mission', 'enemy', 'free', 'fire', 'shoot', 'unknown', 'battleground', 'mission', 'best', 'offline', 'shoot', 'game', 'commando', 'training', 'skill', 'squad', 'survival', 'mission', 'battleground', 'survival', 'free', 'fire', 'game', 'depend', 'war', 'shoot', 'squad', 'free', 'fire', 'battleground', 'war', 'battleground', 'game', 'army', 'last', 'player', 'firing_squad', 'face', 'crazy', 'death', 'mission', 'legend', 'fire', 'fire', 'free', 'fire', 'battleground', 'battleground', 'cross', 'fire', 'surgical_strike', 'fill', 'fierce', 'shooting', 'game', 'training', 'skill', 'fire', 'battleground', 'game', 'world', 'war', 'mission', 'where', 'commando', 'mission', 'good', 'shooting', 'survival', 'unknown', 'battle', 'strike', 'control', 'wait', 'sniper', 'shooting', 'skill', 'start', 'survival', 'battleground', 'strike', 'journey', 'modern', 'weapon', 'free', 'fire', 'survival', 'shoot', 'mission', 'sniper', 'gun', 'other', 'shoot', 'battlefield', 'weapon', 'graphic', 'real', 'firing_squad', 'mind_blowing', 'fire', 'squad', 'survival', 'mission', 'survival', 'strike', 'journey', 'legend', 'battle', 'strike', 'game', 'good', 'shooting', 'game', 'lot', 'gun', 'see', 'game', 'feel', 'good', 'gun', 'game', 'show', 'world', 'war', 'commando', 'training', 'skill', 'modern', 'weapon', 'sniper', 'gun', 'unknown', 'enemy', 'squad', 'commando', 'training', 'skill', 'free', 'fire', 'battleground', 'feature', 'variety', 'weapon', 'available', 'free', 'fire', 'shoot', 'missionsdozen', 'mission', 'war', 'shoot', 'squadreal', 'enemy', 'terrorist', 'ai', 'unknown_battleground', 'environment', 'system', 'detect', 'enemy', 'position', 'surgical', 'strikesimple', 'smooth', 'control', 'download', 'play', 'store', 'good', 'legend', 'free', 'fire', 'totally', 'free']]\n" ] } ], "source": [ "# Perform lemmatization keeping only nouns, adjectives, verbs and adjectives\n", "data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']) \n", "\n", "print(data_lemmatized[:1])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 3), (7, 1), (8, 9), (9, 1), (10, 1), (11, 5), (12, 2), (13, 3), (14, 1), (15, 2), (16, 3), (17, 1), (18, 1), (19, 1), (20, 4), (21, 1), (22, 2), (23, 3), (24, 1), (25, 1), (26, 1), (27, 1), (28, 16), (29, 3), (30, 14), (31, 15), (32, 5), (33, 1), (34, 4), (35, 2), (36, 1), (37, 3), (38, 1), (39, 1), (40, 11), (41, 1), (42, 2), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 10), (52, 8), (53, 1), (54, 5), (55, 1), (56, 3), (57, 1), (58, 4), (59, 1), (60, 1), (61, 1), (62, 5), (63, 1), (64, 1), (65, 1), (66, 10), (67, 1), (68, 1), (69, 1), (70, 5), (71, 5), (72, 1), (73, 1), (74, 1), (75, 6), (76, 4), (77, 2), (78, 2)]]\n" ] } ], "source": [ "texts = data_lemmatized\n", "\n", "# a mapping between words and their corresponding integer values\n", "id2word = corpora.Dictionary(texts) \n", "\n", "# Term Document Frequency and gensim creates a unique id for each word in the document\n", "corpus = [id2word.doc2bow(text) for text in texts]\n", "\n", "# This corpus is a mapping of (word_id, word_frequency)\n", "print(corpus[:1])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[('action', 1),\n", " ('ai', 1),\n", " ('amazing', 1),\n", " ('arena', 1),\n", " ('army', 2),\n", " ('available', 1),\n", " ('battle', 3),\n", " ('battlefield', 1),\n", " ('battleground', 9),\n", " ('best', 1),\n", " ('challenging', 1),\n", " ('commando', 5),\n", " ('control', 2),\n", " ('crazy', 3),\n", " ('cross', 1),\n", " ('dangerous', 2),\n", " ('death', 3),\n", " ('depend', 1),\n", " ('detect', 1),\n", " ('download', 1),\n", " ('enemy', 4),\n", " ('environment', 1),\n", " ('exciting', 2),\n", " ('face', 3),\n", " ('feature', 1),\n", " ('feel', 1),\n", " ('fierce', 1),\n", " ('fill', 1),\n", " ('fire', 16),\n", " ('firing_squad', 3),\n", " ('free', 14),\n", " ('game', 15),\n", " ('good', 5),\n", " ('graphic', 1),\n", " ('gun', 4),\n", " ('journey', 2),\n", " ('last', 1),\n", " ('legend', 3),\n", " ('lot', 1),\n", " ('mind_blowing', 1),\n", " ('mission', 11),\n", " ('missionsdozen', 1),\n", " ('modern', 2),\n", " ('offline', 1),\n", " ('other', 1),\n", " ('play', 2),\n", " ('player', 1),\n", " ('position', 1),\n", " ('ready', 1),\n", " ('real', 1),\n", " ('see', 1),\n", " ('shoot', 10),\n", " ('shooting', 8),\n", " ('show', 1),\n", " ('skill', 5),\n", " ('smooth', 1),\n", " ('sniper', 3),\n", " ('soldier', 1),\n", " ('squad', 4),\n", " ('squadreal', 1),\n", " ('start', 1),\n", " ('store', 1),\n", " ('strike', 5),\n", " ('strikesimple', 1),\n", " ('surgical', 1),\n", " ('surgical_strike', 1),\n", " ('survival', 10),\n", " ('system', 1),\n", " ('terrorist', 1),\n", " ('totally', 1),\n", " ('training', 5),\n", " ('unknown', 5),\n", " ('unknown_battleground', 1),\n", " ('variety', 1),\n", " ('wait', 1),\n", " ('war', 6),\n", " ('weapon', 4),\n", " ('where', 2),\n", " ('world', 2)]]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Human readable format of corpus (term-frequency)\n", "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 } -
SandieIJ revised this gist
Apr 23, 2020 . 2 changed files with 377 additions and 273 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,377 @@ { "cells": [ { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: spacy in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (2.2.4)\n", "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (1.0.0)\n", "Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (0.6.0)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (3.0.2)\n", "Requirement already satisfied: plac<1.2.0,>=0.9.6 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (1.1.3)\n", "Requirement already satisfied: numpy>=1.15.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (1.17.2)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (1.0.2)\n", "Requirement already satisfied: setuptools in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (41.4.0)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (2.0.3)\n", "Requirement already satisfied: thinc==7.4.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (7.4.0)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (2.22.0)\n", "Requirement already satisfied: blis<0.5.0,>=0.4.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (0.4.1)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (4.43.0)\n", "Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy) (1.0.2)\n", "Requirement already satisfied: importlib-metadata>=0.20; python_version < \"3.8\" in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy) (0.23)\n", "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)\n", "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.2)\n", "Requirement already satisfied: certifi>=2017.4.17 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2019.9.11)\n", "Requirement already satisfied: idna<2.9,>=2.5 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.8)\n", "Requirement already satisfied: zipp>=0.5 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy) (0.6.0)\n", "Requirement already satisfied: more-itertools in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy) (7.2.0)\n", "Requirement already satisfied: gensim in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (3.8.1)\n", "Requirement already satisfied: six>=1.5.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from gensim) (1.12.0)\n", "Requirement already satisfied: numpy>=1.11.3 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from gensim) (1.17.2)\n", "Requirement already satisfied: smart-open>=1.8.1 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from gensim) (1.9.0)\n", "Requirement already satisfied: scipy>=0.18.1 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from gensim) (1.4.1)\n", "Requirement already satisfied: boto>=2.32 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from smart-open>=1.8.1->gensim) (2.49.0)\n", "Requirement already satisfied: boto3 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from smart-open>=1.8.1->gensim) (1.12.11)\n", "Requirement already satisfied: requests in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from smart-open>=1.8.1->gensim) (2.22.0)\n", "Requirement already satisfied: botocore<1.16.0,>=1.15.11 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim) (1.15.11)\n", "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim) (0.9.5)\n", "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim) (0.3.3)\n", "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests->smart-open>=1.8.1->gensim) (1.24.2)\n", "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests->smart-open>=1.8.1->gensim) (3.0.4)\n", "Requirement already satisfied: certifi>=2017.4.17 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests->smart-open>=1.8.1->gensim) (2019.9.11)\n", "Requirement already satisfied: idna<2.9,>=2.5 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests->smart-open>=1.8.1->gensim) (2.8)\n", "Requirement already satisfied: docutils<0.16,>=0.10 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from botocore<1.16.0,>=1.15.11->boto3->smart-open>=1.8.1->gensim) (0.15.2)\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from botocore<1.16.0,>=1.15.11->boto3->smart-open>=1.8.1->gensim) (2.8.0)\n", "Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz\n", "\u001b[?25l Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz (12.0MB)\n", "\u001b[K |████████████████████████████████| 12.0MB 11.5MB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied (use --upgrade to upgrade): en-core-web-sm==2.2.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages\n", "Requirement already satisfied: spacy>=2.2.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from en-core-web-sm==2.2.0) (2.2.4)\n", "Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (0.6.0)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (1.0.2)\n", "Requirement already satisfied: plac<1.2.0,>=0.9.6 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (1.1.3)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (2.0.3)\n", "Requirement already satisfied: numpy>=1.15.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (1.17.2)\n", "Requirement already satisfied: blis<0.5.0,>=0.4.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (0.4.1)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (4.43.0)\n", "Requirement already satisfied: thinc==7.4.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (7.4.0)\n", "Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (1.0.2)\n", "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (1.0.0)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (2.22.0)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (3.0.2)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: setuptools in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from spacy>=2.2.0->en-core-web-sm==2.2.0) (41.4.0)\n", "Requirement already satisfied: importlib-metadata>=0.20; python_version < \"3.8\" in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.0->en-core-web-sm==2.2.0) (0.23)\n", "Requirement already satisfied: certifi>=2017.4.17 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.0->en-core-web-sm==2.2.0) (2019.9.11)\n", "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.0->en-core-web-sm==2.2.0) (3.0.4)\n", "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.0->en-core-web-sm==2.2.0) (1.24.2)\n", "Requirement already satisfied: idna<2.9,>=2.5 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.0->en-core-web-sm==2.2.0) (2.8)\n", "Requirement already satisfied: zipp>=0.5 in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.0->en-core-web-sm==2.2.0) (0.6.0)\n", "Requirement already satisfied: more-itertools in /Users/sandiejirongo/opt/anaconda3/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.0->en-core-web-sm==2.2.0) (7.2.0)\n", "Building wheels for collected packages: en-core-web-sm\n", " Building wheel for en-core-web-sm (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.0-cp37-none-any.whl size=12019125 sha256=63a6868af18bc78b6d100ef11c2b99366673b8d9ac50a3aa3ca4c5fa3d1e120b\n", " Stored in directory: /Users/sandiejirongo/Library/Caches/pip/wheels/48/5c/1c/15f9d02afc8221a668d2172446dd8467b20cdb9aef80a172a4\n", "Successfully built en-core-web-sm\n" ] } ], "source": [ "!pip install spacy\n", "!pip install gensim\n", "!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "import re\n", "import pandas as pd\n", "import gensim\n", "from gensim.utils import simple_preprocess \n", "import gensim.corpora as corpora\n", "from pprint import pprint" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>name</th>\n", " <th>description</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>Legend Fire Squad survival: Free Fire Battlegr...</td>\n", " <td>Ready to play an amazing and exciting best sho...</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>Ambulance Game</td>\n", " <td>You must be a fan of the driving games. We ass...</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>Beam Drive NG Death Stair Car Crash Simulator</td>\n", " <td>Beam Drive NG Death Stair Car Crash Accidents ...</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>Kelime İncileri</td>\n", " <td>Yeni Kelime Bulmaca Oyununuz! Kelime Arama ve ...</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>Word Blocks</td>\n", " <td>Word Blocks is a new kind of word search puzzl...</td>\n", " </tr>\n", " <tr>\n", " <td>5</td>\n", " <td>Free Fire Commando - Counter Attack FPS 2019</td>\n", " <td>Free Fire Commando - Counter Attack FPS 2019 i...</td>\n", " </tr>\n", " <tr>\n", " <td>6</td>\n", " <td>Fall Race 3D</td>\n", " <td>The most exciting sky race!Run through the sky...</td>\n", " </tr>\n", " <tr>\n", " <td>7</td>\n", " <td>Math School Game Basic: Crazy Principal</td>\n", " <td>Your school principal went crazy and locked yo...</td>\n", " </tr>\n", " <tr>\n", " <td>8</td>\n", " <td>Jump Cube</td>\n", " <td>Jump Cube is an addictive game, tap the right ...</td>\n", " </tr>\n", " <tr>\n", " <td>9</td>\n", " <td>Tien Len Offline</td>\n", " <td>Một tựa game cũng như cách chơi ko thể quen th...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " name \\\n", "0 Legend Fire Squad survival: Free Fire Battlegr... \n", "1 Ambulance Game \n", "2 Beam Drive NG Death Stair Car Crash Simulator \n", "3 Kelime İncileri \n", "4 Word Blocks \n", "5 Free Fire Commando - Counter Attack FPS 2019 \n", "6 Fall Race 3D \n", "7 Math School Game Basic: Crazy Principal \n", "8 Jump Cube \n", "9 Tien Len Offline \n", "\n", " description \n", "0 Ready to play an amazing and exciting best sho... \n", "1 You must be a fan of the driving games. We ass... \n", "2 Beam Drive NG Death Stair Car Crash Accidents ... \n", "3 Yeni Kelime Bulmaca Oyununuz! Kelime Arama ve ... \n", "4 Word Blocks is a new kind of word search puzzl... \n", "5 Free Fire Commando - Counter Attack FPS 2019 i... \n", "6 The most exciting sky race!Run through the sky... \n", "7 Your school principal went crazy and locked yo... \n", "8 Jump Cube is an addictive game, tap the right ... \n", "9 Một tựa game cũng như cách chơi ko thể quen th... " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reading loading/data\n", "data = pd.read_csv(\"https://raw.githubusercontent.com/SandieIJ/Capstone/master/data/sandra_csv_results-20190723-155508.csv\")\n", "\n", "data.head(10)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "descriptions = data.description.values.tolist()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Remove new line characters\n", "no_new_lines = [re.sub('\\s+', ' ', sent) for sent in descriptions] \n", "\n", "#Remove non letter characters\n", "non_letters = [re.sub('[^a-zA-Z]', ' ', no_new_line) for no_new_line in no_new_lines]\n", "\n", "# Remove distracting single quotes\n", "no_quotes = [re.sub(\"\\'\", '', non_letter) for non_letter in non_letters]\n", "\n", "#break down sentences into words\n", "def sent_to_words(sentences): \n", " for sentence in sentences:\n", " yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))\n", "\n", "data_words = list(sent_to_words(descriptions)) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build the bigram and trigram models\n", "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) \n", "trigram = gensim.models.Phrases(bigram[data_words], threshold=100)\n", "\n", "# Faster way to get a sentence clubbed as a trigram/bigram\n", "bigram_mod = gensim.models.phrases.Phraser(bigram) \n", "\n", "trigram_mod = gensim.models.phrases.Phraser(trigram)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize spacy\n", "nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])\n", "\n", "def make_bigrams(texts):\n", " return [bigram_mod[doc] for doc in texts]\n", "\n", "def make_trigrams(texts):\n", " return [trigram_mod[bigram_mod[doc]] for doc in texts]\n", "\n", "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): \n", " \"\"\"https://spacy.io/api/annotation\"\"\"\n", " texts_out = []\n", " for sent in texts:\n", " doc = nlp(\" \".join(sent))\n", " texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) \n", " return texts_out\n", "\n", "# Form Bigrams\n", "data_words_bigrams = make_bigrams(data_words)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Perform lemmatization keeping only nouns, adjectives, verbs and adjectives\n", "data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']) \n", "\n", "print(data_lemmatized[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "texts = data_lemmatized\n", "\n", "# a mapping between words and their corresponding integer values\n", "id2word = corpora.Dictionary(texts) \n", "\n", "# Term Document Frequency and gensim creates a unique id for each word in the document\n", "corpus = [id2word.doc2bow(text) for text in texts]\n", "\n", "# This corpus is a mapping of (word_id, word_frequency)\n", "print(corpus[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Human readable format of corpus (term-frequency)\n", "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 } This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,273 +0,0 @@ -
SandieIJ revised this gist
Apr 23, 2020 . 1 changed file with 0 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -243,7 +243,6 @@ "source": [ "# Human readable format of corpus (term-frequency)\n", "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" }, { "cell_type": "markdown", -
SandieIJ revised this gist
Apr 23, 2020 . 1 changed file with 0 additions and 324 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -249,330 +249,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ ], "metadata": { "kernelspec": { -
SandieIJ revised this gist
Apr 23, 2020 . 1 changed file with 0 additions and 598 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,598 +0,0 @@ -
SandieIJ revised this gist
Apr 23, 2020 . 2 changed files with 0 additions and 598 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,598 +0,0 @@ File renamed without changes. -
SandieIJ revised this gist
Apr 23, 2020 . 1 changed file with 598 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,598 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Install SpaCy and Gensim as well as a preferred SpaCy model, here I use the English model**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install -U spacy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install pyLDAvis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install osqp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install gensim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**In order to carry out pre-processing we will use the following libraries and modules**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "import re\n", "import pandas as pd\n", "import gensim\n", "from gensim.utils import simple_preprocess \n", "import gensim.corpora as corpora\n", "from pprint import pprint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # Loading the data\n", "data = pd.read_csv(\"https://raw.githubusercontent.com/SandieIJ/Capstone/master/data/sandra_csv_results-20190723-155508.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Preview the data\n", "data.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Isolate the descriptions\n", "descriptions = data.description.values.tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Remove new line characters\n", "no_new_lines = [re.sub('\\s+', ' ', sent) for sent in descriptions] \n", "\n", "#Remove non letter characters\n", "non_letters = [re.sub('[^a-zA-Z]', ' ', no_new_line) for no_new_line in no_new_lines]\n", "\n", "# Remove distracting single quotes\n", "no_quotes = [re.sub(\"\\'\", '', non_letter) for non_letter in non_letters]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Preview the data\n", "\n", "pprint(descriptions[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Break down sentences into words\n", "def sent_to_words(sentences): \n", " for sentence in sentences:\n", " yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_words = list(sent_to_words(descriptions)) \n", "\n", "print(data_words[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build the bigram and trigram models\n", "\n", "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) \n", "trigram = gensim.models.Phrases(bigram[data_words], threshold=100)\n", "\n", "# Faster way to get a sentence clubbed as a trigram/bigram\n", "bigram_mod = gensim.models.phrases.Phraser(bigram) \n", "trigram_mod = gensim.models.phrases.Phraser(trigram)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize spacy\n", "\n", "nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_bigrams(texts):\n", " return [bigram_mod[doc] for doc in texts]\n", "\n", "def make_trigrams(texts):\n", " return [trigram_mod[bigram_mod[doc]] for doc in texts]\n", "\n", "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): \n", " \"\"\"https://spacy.io/api/annotation\"\"\"\n", " texts_out = []\n", " for sent in texts:\n", " doc = nlp(\" \".join(sent))\n", " texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) \n", " return texts_out" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Form Bigrams\n", "data_words_bigrams = make_bigrams(data_words)\n", "\n", "# Perform lemmatization keeping only nouns, adjectives, verbs and adjectives\n", "data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']) \n", "\n", "print(data_lemmatized[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "texts = data_lemmatized\n", "\n", "# Creates a mapping between words and their corresponding integer values\n", "id2word = corpora.Dictionary(texts) \n", "\n", "# Term Document Frequency and gensim creates a unique id for each word in the document\n", "corpus = [id2word.doc2bow(text) for text in texts]\n", "\n", "# This corpus is a mapping of (word_id, word_frequency)\n", "print(corpus[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Human readable format of corpus (term-frequency)\n", "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Import the packages needed to run the LDA Model on the pre-processed data**" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from gensim.models.wrappers import LdaMallet\n", "from gensim.models import CoherenceModel\n", "import pyLDAvis\n", "import pyLDAvis.gensim \n", "import osqp\n", "import time\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,\n", " id2word=id2word,\n", " num_topics=20, \n", " random_state=100, \n", " update_every=1, \n", " chunksize=100, \n", " passes=8, \n", " alpha='auto',\n", " per_word_topics=True)\n", "\n", "# Print the Keyword in the topics\n", "pprint(lda_model.print_topics())\n", "doc_lda = lda_model[corpus]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute Perplexity\n", "print('\\nPerplexity: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.\n", "\n", "# Compute Coherence Score\n", "coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n", "coherence_lda = coherence_model_lda.get_coherence()\n", "print('\\nCoherence Score: ', coherence_lda)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Visualize the model without any parameter tuning**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyLDAvis.enable_notebook()\n", "vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds='tsne')\n", "vis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip. The mallet path downloads mallet which is a package containing tools for document classification. Here we are using it for topic modeling since our data is not labeled we are using inference of topic distribution on new, from the mallet package unseen documents**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mallet_path = '/Users/sandiejirongo/Mallet/bin/mallet'" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def compute_coherence_values(dictionary, corpus, texts, limit, start, step):\n", " \"\"\"\n", " Compute c_v coherence for various number of topics\n", " Parameters are as follows \n", " dictionary : Gensim dictionary (downloaded above)\n", " corpus : Gensim corpus (the mapping of word_id, word_frequency we created above after lemmatization)\n", " texts : List of input texts (data to be analyzed)\n", " limit : Max num of topics (for now we use an educated guess or random number of topics, we will optimize for this\n", " value below)\n", " Returns:\n", " -------\n", " model_list : List of LDA topic models (number, not currently labeled)\n", " coherence_values : Coherence values corresponding to the LDA model with respective number of topics\n", " \"\"\"\n", " coherence_values = []\n", " model_list = []\n", " for num_topics in range(start, limit, step):\n", " model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)\n", " model_list.append(model)\n", " coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')\n", " coherence_values.append(coherencemodel.get_coherence())\n", " return model_list, coherence_values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, \\\n", " limit=40, start=2, step=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using this graph, we can see how many topics we should have\n", "# This graph helps us answer the question beyond which number of topics will the accuracy of our model not improve?\n", "\n", "limit=40; start=2; step=4;\n", "\n", "x = range(start, limit, step)\n", "\n", "plt.plot(x, coherence_values)\n", "plt.xlabel(\"Number of Topics\")\n", "plt.ylabel(\"Coherence score\")\n", "plt.legend((\"coherence_values\"), loc='best')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for m, cv in zip(x, coherence_values):\n", " print(\"Num Topics =\", m, \" has Coherence Value of\", round(cv, 4))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=14, id2word=id2word)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute Coherence Score\n", "\n", "coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n", "coherence_ldamallet = coherence_model_ldamallet.get_coherence()\n", "print('\\nCoherence Score: ', coherence_ldamallet) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lda_model2 = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=14,\\\n", " random_state=100,update_every=1, chunksize=100, passes=10, \\\n", " alpha='auto',per_word_topics=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyLDAvis.enable_notebook()\n", "visualization = pyLDAvis.gensim.prepare(lda_model2, corpus, id2word)\n", "visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Format and preview the output such that it is easily readable and transfereable to any database**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data, name=data):\n", " sent_topics_df = pd.DataFrame()\n", " # Get main topic in each document\n", " for i, row in enumerate(ldamodel[corpus]):\n", " row = sorted(row[0], key=lambda x: x[1], reverse=True)\n", " # Get the Dominant topic, Perc Contribution and Keywords for each document\n", " for j, (topic_num, prop_topic) in enumerate(row):\n", " if j == 0: # => dominant topic\n", " wp = ldamodel.show_topic(topic_num)\n", " topic_keywords = \", \".join([word for word, prop in wp])\n", " sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]),\n", " else:\n", " break\n", " sent_topics_df.columns = ['Dominant Topic', 'Percentage Contribution', 'Topic Keywords']\n", " # Add original text to the end of the output\n", " contents = pd.Series(texts)\n", " game_name = pd.Series(name)\n", " sent_topics_df = pd.concat([sent_topics_df, contents, game_name], axis=1)\n", " return(sent_topics_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model2, corpus=corpus, texts=descriptions, name=titles)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_dominant_topic = df_topic_sents_keywords.reset_index()\n", "\n", "df_dominant_topic.columns = ['Document No.', 'Dominant Topic', 'Topic Percentage Contribution', \\\n", " 'Keywords', 'Game Description\", \"Game Name\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_dominant_topic.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Group top 5 sentences under each topic\n", "\n", "sent_topics_sorteddf_mallet = pd.DataFrame()\n", "\n", "sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant Topic')\n", "\n", "for i, grp in sent_topics_outdf_grpd:\n", " sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet,\\\n", " grp.sort_values(['Percentage Contribution'], ascending=[0]).head(1)],\\ \n", " axis=0)\n", "# Set Index\n", "sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)\n", "\n", "\n", "sent_topics_sorteddf_mallet.columns = ['Topic Number', \"Topic Percentage Contribution\", \"Keywords\", \\\n", " \"Game Description\", \"Game Name\"]\n", "# Show\n", "sent_topics_sorteddf_mallet.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Number of Documents for Each Topic\n", "topic_counts = df_topic_sents_keywords['Dominant Topic'].value_counts()\n", "\n", "# Percentage of Documents for Each Topic\n", "topic_contribution = round(topic_counts/topic_counts.sum(), 4)\n", "\n", "# Topic Number and Keywords\n", "topic_num_keywords = df_topic_sents_keywords[['Dominant Topic', 'Topic Keywords']]\n", "\n", "# Concatenate Column wise\n", "df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)\n", "\n", "# Change Column names\n", "df_dominant_topics.columns = ['Dominant Topic', 'Topic Keywords', 'Number of Documents', 'Percentage of Documents']\n", "\n", "# Show\n", "df_dominant_topics.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 } -
SandieIJ created this gist
Apr 23, 2020 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,598 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Install SpaCy and Gensim as well as a preferred SpaCy model, here I use the English model**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install -U spacy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install pyLDAvis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install osqp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install gensim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**In order to carry out pre-processing we will use the following libraries and modules**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "import re\n", "import pandas as pd\n", "import gensim\n", "from gensim.utils import simple_preprocess \n", "import gensim.corpora as corpora\n", "from pprint import pprint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # Loading the data\n", "data = pd.read_csv(\"https://raw.githubusercontent.com/SandieIJ/Capstone/master/data/sandra_csv_results-20190723-155508.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Preview the data\n", "data.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Isolate the descriptions\n", "descriptions = data.description.values.tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Remove new line characters\n", "no_new_lines = [re.sub('\\s+', ' ', sent) for sent in descriptions] \n", "\n", "#Remove non letter characters\n", "non_letters = [re.sub('[^a-zA-Z]', ' ', no_new_line) for no_new_line in no_new_lines]\n", "\n", "# Remove distracting single quotes\n", "no_quotes = [re.sub(\"\\'\", '', non_letter) for non_letter in non_letters]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Preview the data\n", "\n", "pprint(descriptions[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Break down sentences into words\n", "def sent_to_words(sentences): \n", " for sentence in sentences:\n", " yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_words = list(sent_to_words(descriptions)) \n", "\n", "print(data_words[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build the bigram and trigram models\n", "\n", "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) \n", "trigram = gensim.models.Phrases(bigram[data_words], threshold=100)\n", "\n", "# Faster way to get a sentence clubbed as a trigram/bigram\n", "bigram_mod = gensim.models.phrases.Phraser(bigram) \n", "trigram_mod = gensim.models.phrases.Phraser(trigram)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize spacy\n", "\n", "nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_bigrams(texts):\n", " return [bigram_mod[doc] for doc in texts]\n", "\n", "def make_trigrams(texts):\n", " return [trigram_mod[bigram_mod[doc]] for doc in texts]\n", "\n", "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): \n", " \"\"\"https://spacy.io/api/annotation\"\"\"\n", " texts_out = []\n", " for sent in texts:\n", " doc = nlp(\" \".join(sent))\n", " texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) \n", " return texts_out" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Form Bigrams\n", "data_words_bigrams = make_bigrams(data_words)\n", "\n", "# Perform lemmatization keeping only nouns, adjectives, verbs and adjectives\n", "data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']) \n", "\n", "print(data_lemmatized[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "texts = data_lemmatized\n", "\n", "# Creates a mapping between words and their corresponding integer values\n", "id2word = corpora.Dictionary(texts) \n", "\n", "# Term Document Frequency and gensim creates a unique id for each word in the document\n", "corpus = [id2word.doc2bow(text) for text in texts]\n", "\n", "# This corpus is a mapping of (word_id, word_frequency)\n", "print(corpus[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Human readable format of corpus (term-frequency)\n", "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Import the packages needed to run the LDA Model on the pre-processed data**" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from gensim.models.wrappers import LdaMallet\n", "from gensim.models import CoherenceModel\n", "import pyLDAvis\n", "import pyLDAvis.gensim \n", "import osqp\n", "import time\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,\n", " id2word=id2word,\n", " num_topics=20, \n", " random_state=100, \n", " update_every=1, \n", " chunksize=100, \n", " passes=8, \n", " alpha='auto',\n", " per_word_topics=True)\n", "\n", "# Print the Keyword in the topics\n", "pprint(lda_model.print_topics())\n", "doc_lda = lda_model[corpus]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute Perplexity\n", "print('\\nPerplexity: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.\n", "\n", "# Compute Coherence Score\n", "coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n", "coherence_lda = coherence_model_lda.get_coherence()\n", "print('\\nCoherence Score: ', coherence_lda)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Visualize the model without any parameter tuning**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyLDAvis.enable_notebook()\n", "vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds='tsne')\n", "vis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip. The mallet path downloads mallet which is a package containing tools for document classification. Here we are using it for topic modeling since our data is not labeled we are using inference of topic distribution on new, from the mallet package unseen documents**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mallet_path = '/Users/sandiejirongo/Mallet/bin/mallet'" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def compute_coherence_values(dictionary, corpus, texts, limit, start, step):\n", " \"\"\"\n", " Compute c_v coherence for various number of topics\n", " Parameters are as follows \n", " dictionary : Gensim dictionary (downloaded above)\n", " corpus : Gensim corpus (the mapping of word_id, word_frequency we created above after lemmatization)\n", " texts : List of input texts (data to be analyzed)\n", " limit : Max num of topics (for now we use an educated guess or random number of topics, we will optimize for this\n", " value below)\n", " Returns:\n", " -------\n", " model_list : List of LDA topic models (number, not currently labeled)\n", " coherence_values : Coherence values corresponding to the LDA model with respective number of topics\n", " \"\"\"\n", " coherence_values = []\n", " model_list = []\n", " for num_topics in range(start, limit, step):\n", " model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)\n", " model_list.append(model)\n", " coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')\n", " coherence_values.append(coherencemodel.get_coherence())\n", " return model_list, coherence_values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, \\\n", " limit=40, start=2, step=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using this graph, we can see how many topics we should have\n", "# This graph helps us answer the question beyond which number of topics will the accuracy of our model not improve?\n", "\n", "limit=40; start=2; step=4;\n", "\n", "x = range(start, limit, step)\n", "\n", "plt.plot(x, coherence_values)\n", "plt.xlabel(\"Number of Topics\")\n", "plt.ylabel(\"Coherence score\")\n", "plt.legend((\"coherence_values\"), loc='best')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for m, cv in zip(x, coherence_values):\n", " print(\"Num Topics =\", m, \" has Coherence Value of\", round(cv, 4))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=14, id2word=id2word)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute Coherence Score\n", "\n", "coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n", "coherence_ldamallet = coherence_model_ldamallet.get_coherence()\n", "print('\\nCoherence Score: ', coherence_ldamallet) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lda_model2 = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=14,\\\n", " random_state=100,update_every=1, chunksize=100, passes=10, \\\n", " alpha='auto',per_word_topics=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyLDAvis.enable_notebook()\n", "visualization = pyLDAvis.gensim.prepare(lda_model2, corpus, id2word)\n", "visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Format and preview the output such that it is easily readable and transfereable to any database**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data, name=data):\n", " sent_topics_df = pd.DataFrame()\n", " # Get main topic in each document\n", " for i, row in enumerate(ldamodel[corpus]):\n", " row = sorted(row[0], key=lambda x: x[1], reverse=True)\n", " # Get the Dominant topic, Perc Contribution and Keywords for each document\n", " for j, (topic_num, prop_topic) in enumerate(row):\n", " if j == 0: # => dominant topic\n", " wp = ldamodel.show_topic(topic_num)\n", " topic_keywords = \", \".join([word for word, prop in wp])\n", " sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]),\n", " else:\n", " break\n", " sent_topics_df.columns = ['Dominant Topic', 'Percentage Contribution', 'Topic Keywords']\n", " # Add original text to the end of the output\n", " contents = pd.Series(texts)\n", " game_name = pd.Series(name)\n", " sent_topics_df = pd.concat([sent_topics_df, contents, game_name], axis=1)\n", " return(sent_topics_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model2, corpus=corpus, texts=descriptions, name=titles)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_dominant_topic = df_topic_sents_keywords.reset_index()\n", "\n", "df_dominant_topic.columns = ['Document No.', 'Dominant Topic', 'Topic Percentage Contribution', \\\n", " 'Keywords', 'Game Description\", \"Game Name\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_dominant_topic.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Group top 5 sentences under each topic\n", "\n", "sent_topics_sorteddf_mallet = pd.DataFrame()\n", "\n", "sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant Topic')\n", "\n", "for i, grp in sent_topics_outdf_grpd:\n", " sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet,\\\n", " grp.sort_values(['Percentage Contribution'], ascending=[0]).head(1)],\\ \n", " axis=0)\n", "# Set Index\n", "sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)\n", "\n", "\n", "sent_topics_sorteddf_mallet.columns = ['Topic Number', \"Topic Percentage Contribution\", \"Keywords\", \\\n", " \"Game Description\", \"Game Name\"]\n", "# Show\n", "sent_topics_sorteddf_mallet.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Number of Documents for Each Topic\n", "topic_counts = df_topic_sents_keywords['Dominant Topic'].value_counts()\n", "\n", "# Percentage of Documents for Each Topic\n", "topic_contribution = round(topic_counts/topic_counts.sum(), 4)\n", "\n", "# Topic Number and Keywords\n", "topic_num_keywords = df_topic_sents_keywords[['Dominant Topic', 'Topic Keywords']]\n", "\n", "# Concatenate Column wise\n", "df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)\n", "\n", "# Change Column names\n", "df_dominant_topics.columns = ['Dominant Topic', 'Topic Keywords', 'Number of Documents', 'Percentage of Documents']\n", "\n", "# Show\n", "df_dominant_topics.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 } This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,598 @@ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Install SpaCy and Gensim as well as a preferred SpaCy model, here I use the English model**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install -U spacy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install pyLDAvis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install osqp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install gensim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**In order to carry out pre-processing we will use the following libraries and modules**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "import re\n", "import pandas as pd\n", "import gensim\n", "from gensim.utils import simple_preprocess \n", "import gensim.corpora as corpora\n", "from pprint import pprint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # Loading the data\n", "data = pd.read_csv(\"https://raw.githubusercontent.com/SandieIJ/Capstone/master/data/sandra_csv_results-20190723-155508.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Preview the data\n", "data.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Isolate the descriptions\n", "descriptions = data.description.values.tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Remove new line characters\n", "no_new_lines = [re.sub('\\s+', ' ', sent) for sent in descriptions] \n", "\n", "#Remove non letter characters\n", "non_letters = [re.sub('[^a-zA-Z]', ' ', no_new_line) for no_new_line in no_new_lines]\n", "\n", "# Remove distracting single quotes\n", "no_quotes = [re.sub(\"\\'\", '', non_letter) for non_letter in non_letters]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Preview the data\n", "\n", "pprint(descriptions[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Break down sentences into words\n", "def sent_to_words(sentences): \n", " for sentence in sentences:\n", " yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_words = list(sent_to_words(descriptions)) \n", "\n", "print(data_words[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build the bigram and trigram models\n", "\n", "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) \n", "trigram = gensim.models.Phrases(bigram[data_words], threshold=100)\n", "\n", "# Faster way to get a sentence clubbed as a trigram/bigram\n", "bigram_mod = gensim.models.phrases.Phraser(bigram) \n", "trigram_mod = gensim.models.phrases.Phraser(trigram)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize spacy\n", "\n", "nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_bigrams(texts):\n", " return [bigram_mod[doc] for doc in texts]\n", "\n", "def make_trigrams(texts):\n", " return [trigram_mod[bigram_mod[doc]] for doc in texts]\n", "\n", "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): \n", " \"\"\"https://spacy.io/api/annotation\"\"\"\n", " texts_out = []\n", " for sent in texts:\n", " doc = nlp(\" \".join(sent))\n", " texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) \n", " return texts_out" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Form Bigrams\n", "data_words_bigrams = make_bigrams(data_words)\n", "\n", "# Perform lemmatization keeping only nouns, adjectives, verbs and adjectives\n", "data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']) \n", "\n", "print(data_lemmatized[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "texts = data_lemmatized\n", "\n", "# Creates a mapping between words and their corresponding integer values\n", "id2word = corpora.Dictionary(texts) \n", "\n", "# Term Document Frequency and gensim creates a unique id for each word in the document\n", "corpus = [id2word.doc2bow(text) for text in texts]\n", "\n", "# This corpus is a mapping of (word_id, word_frequency)\n", "print(corpus[:1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Human readable format of corpus (term-frequency)\n", "[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Import the packages needed to run the LDA Model on the pre-processed data**" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from gensim.models.wrappers import LdaMallet\n", "from gensim.models import CoherenceModel\n", "import pyLDAvis\n", "import pyLDAvis.gensim \n", "import osqp\n", "import time\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,\n", " id2word=id2word,\n", " num_topics=20, \n", " random_state=100, \n", " update_every=1, \n", " chunksize=100, \n", " passes=8, \n", " alpha='auto',\n", " per_word_topics=True)\n", "\n", "# Print the Keyword in the topics\n", "pprint(lda_model.print_topics())\n", "doc_lda = lda_model[corpus]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute Perplexity\n", "print('\\nPerplexity: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.\n", "\n", "# Compute Coherence Score\n", "coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n", "coherence_lda = coherence_model_lda.get_coherence()\n", "print('\\nCoherence Score: ', coherence_lda)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Visualize the model without any parameter tuning**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyLDAvis.enable_notebook()\n", "vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds='tsne')\n", "vis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip. The mallet path downloads mallet which is a package containing tools for document classification. Here we are using it for topic modeling since our data is not labeled we are using inference of topic distribution on new, from the mallet package unseen documents**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mallet_path = '/Users/sandiejirongo/Mallet/bin/mallet'" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def compute_coherence_values(dictionary, corpus, texts, limit, start, step):\n", " \"\"\"\n", " Compute c_v coherence for various number of topics\n", " Parameters are as follows \n", " dictionary : Gensim dictionary (downloaded above)\n", " corpus : Gensim corpus (the mapping of word_id, word_frequency we created above after lemmatization)\n", " texts : List of input texts (data to be analyzed)\n", " limit : Max num of topics (for now we use an educated guess or random number of topics, we will optimize for this\n", " value below)\n", " Returns:\n", " -------\n", " model_list : List of LDA topic models (number, not currently labeled)\n", " coherence_values : Coherence values corresponding to the LDA model with respective number of topics\n", " \"\"\"\n", " coherence_values = []\n", " model_list = []\n", " for num_topics in range(start, limit, step):\n", " model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)\n", " model_list.append(model)\n", " coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')\n", " coherence_values.append(coherencemodel.get_coherence())\n", " return model_list, coherence_values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, \\\n", " limit=40, start=2, step=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using this graph, we can see how many topics we should have\n", "# This graph helps us answer the question beyond which number of topics will the accuracy of our model not improve?\n", "\n", "limit=40; start=2; step=4;\n", "\n", "x = range(start, limit, step)\n", "\n", "plt.plot(x, coherence_values)\n", "plt.xlabel(\"Number of Topics\")\n", "plt.ylabel(\"Coherence score\")\n", "plt.legend((\"coherence_values\"), loc='best')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for m, cv in zip(x, coherence_values):\n", " print(\"Num Topics =\", m, \" has Coherence Value of\", round(cv, 4))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=14, id2word=id2word)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute Coherence Score\n", "\n", "coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n", "coherence_ldamallet = coherence_model_ldamallet.get_coherence()\n", "print('\\nCoherence Score: ', coherence_ldamallet) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lda_model2 = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=14,\\\n", " random_state=100,update_every=1, chunksize=100, passes=10, \\\n", " alpha='auto',per_word_topics=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pyLDAvis.enable_notebook()\n", "visualization = pyLDAvis.gensim.prepare(lda_model2, corpus, id2word)\n", "visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Format and preview the output such that it is easily readable and transfereable to any database**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data, name=data):\n", " sent_topics_df = pd.DataFrame()\n", " # Get main topic in each document\n", " for i, row in enumerate(ldamodel[corpus]):\n", " row = sorted(row[0], key=lambda x: x[1], reverse=True)\n", " # Get the Dominant topic, Perc Contribution and Keywords for each document\n", " for j, (topic_num, prop_topic) in enumerate(row):\n", " if j == 0: # => dominant topic\n", " wp = ldamodel.show_topic(topic_num)\n", " topic_keywords = \", \".join([word for word, prop in wp])\n", " sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]),\n", " else:\n", " break\n", " sent_topics_df.columns = ['Dominant Topic', 'Percentage Contribution', 'Topic Keywords']\n", " # Add original text to the end of the output\n", " contents = pd.Series(texts)\n", " game_name = pd.Series(name)\n", " sent_topics_df = pd.concat([sent_topics_df, contents, game_name], axis=1)\n", " return(sent_topics_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model2, corpus=corpus, texts=descriptions, name=titles)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_dominant_topic = df_topic_sents_keywords.reset_index()\n", "\n", "df_dominant_topic.columns = ['Document No.', 'Dominant Topic', 'Topic Percentage Contribution', \\\n", " 'Keywords', 'Game Description\", \"Game Name\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_dominant_topic.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Group top 5 sentences under each topic\n", "\n", "sent_topics_sorteddf_mallet = pd.DataFrame()\n", "\n", "sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant Topic')\n", "\n", "for i, grp in sent_topics_outdf_grpd:\n", " sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet,\\\n", " grp.sort_values(['Percentage Contribution'], ascending=[0]).head(1)],\\ \n", " axis=0)\n", "# Set Index\n", "sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)\n", "\n", "\n", "sent_topics_sorteddf_mallet.columns = ['Topic Number', \"Topic Percentage Contribution\", \"Keywords\", \\\n", " \"Game Description\", \"Game Name\"]\n", "# Show\n", "sent_topics_sorteddf_mallet.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Number of Documents for Each Topic\n", "topic_counts = df_topic_sents_keywords['Dominant Topic'].value_counts()\n", "\n", "# Percentage of Documents for Each Topic\n", "topic_contribution = round(topic_counts/topic_counts.sum(), 4)\n", "\n", "# Topic Number and Keywords\n", "topic_num_keywords = df_topic_sents_keywords[['Dominant Topic', 'Topic Keywords']]\n", "\n", "# Concatenate Column wise\n", "df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)\n", "\n", "# Change Column names\n", "df_dominant_topics.columns = ['Dominant Topic', 'Topic Keywords', 'Number of Documents', 'Percentage of Documents']\n", "\n", "# Show\n", "df_dominant_topics.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }