Skip to content

Instantly share code, notes, and snippets.

View bhanpuramufaddal's full-sized avatar

Mufaddal Bhanpurawala bhanpuramufaddal

  • Valuence INC
  • Tokyo, remote
View GitHub Profile
@bhanpuramufaddal
bhanpuramufaddal / HexToStringEn.py
Created July 11, 2023 08:05
Convert hex characters in a string back
def convert_hex_char(string):
decoded_string = bytes(string, 'utf-8').decode('unicode-escape')
return decoded_string
@bhanpuramufaddal
bhanpuramufaddal / TokenizeLemmatizeMecab.py
Created June 20, 2023 12:50
Tokenization and Lemmatization using Mecab and Fugashi
from typing import Iterable
from typing import Callable
from typing import Text
from MeCab import Tagger
def _get_tagger() -> Tagger:
opts = getenv('MECAB_OPTS', '-d /usr/local/Cellar/mecab-ipadic/2.7.0-20070801/lib/mecab/dic/ipadic')
tagger = Tagger(opts)
# for some reason the first request to the tagger doesn't produce output
@bhanpuramufaddal
bhanpuramufaddal / stopwords.py
Created June 18, 2023 07:09
Stopwords in different languages
# Reference
# https://advertools.readthedocs.io/en/master/advertools.stopwords.html
# install using 'pip install advertools'
import advertools as adv
adv.stopwords.keys()
# languages supported
stopwords = adv.stopwords['japanese']
# returns list of stopwords in japanese
from fugashi import GenericTagger as Tagger
tagger = Tagger('-r /dev/null -d /usr/local/Cellar/mecab-ipadic/2.7.0-20070801/lib/mecab/dic/ipadic')
def tokenize_lemmatize(text, remove_stopwords = True, lemmatize = False):
tokens = tagger.parseToNodeList(text)
if remove_stopwords:
tokens = filter(lambda token : not is_stopword(token.surface.strip()), tokens)
if lemmatize:
from MeCab import Tagger
def _get_tagger() -> Tagger:
opts = getenv('MECAB_OPTS', '-d /usr/local/Cellar/mecab-ipadic/2.7.0-20070801/lib/mecab/dic/ipadic')
tagger = Tagger(opts)
# for some reason the first request to the tagger doesn't produce output
# so pre-warming it here once to avoid serving daft results later
parsed = tagger.parseToNode('サザエさんは走った')
while parsed:
parsed = parsed.next
@bhanpuramufaddal
bhanpuramufaddal / load_dotenv.sh
Created May 31, 2023 11:02 — forked from mihow/load_dotenv.sh
Load environment variables from dotenv / .env file in Bash
if [ ! -f .env ]
then
export $(cat .env | xargs)
fi
@bhanpuramufaddal
bhanpuramufaddal / jupyter-ipykernel-virtualenv.txt
Last active May 27, 2023 10:23 — forked from swedishmike/gist:902fb27d627313c31a95e31c44e302ac
Adding and removing virtual environments to Jupyter notebook
## Create the virtual environment
conda create -n 'environment_name'
## Activate the virtual environment
conda activate 'environment_name'
## Make sure that ipykernel is installed
pip install --user ipykernel
## Add the new virtual environment to Jupyter
@bhanpuramufaddal
bhanpuramufaddal / KonohaLangchainChunking.md
Last active May 23, 2023 07:08
Use Konoha and LangchainCharectarSplitter to generate chunks of specific size

Installation

pip install 'konoha[SentenceTokenizer]'
@bhanpuramufaddal
bhanpuramufaddal / SpacyJapaneseChunking.md
Last active May 22, 2023 06:22
Split Text into Sentences
@bhanpuramufaddal
bhanpuramufaddal / unicodeNormalization.py
Created May 21, 2023 16:56
Unicode Normalization and get rid of "\u3000" (ideographic space)
import unicodedata
jText = "あ 32"
jTextNormal = unicodedata.normalize('NFKD', jText)