Mufaddal Bhanpurawala bhanpuramufaddal

AI and Software Engineer

bhanpuramufaddal / HexToStringEn.py

Created July 11, 2023 08:05

Convert hex characters in a string back

	def convert_hex_char(string):
	decoded_string = bytes(string, 'utf-8').decode('unicode-escape')
	return decoded_string

bhanpuramufaddal / TokenizeLemmatizeMecab.py

Created June 20, 2023 12:50

Tokenization and Lemmatization using Mecab and Fugashi

	from typing import Iterable
	from typing import Callable
	from typing import Text

	from MeCab import Tagger

	def _get_tagger() -> Tagger:
	opts = getenv('MECAB_OPTS', '-d /usr/local/Cellar/mecab-ipadic/2.7.0-20070801/lib/mecab/dic/ipadic')
	tagger = Tagger(opts)
	# for some reason the first request to the tagger doesn't produce output

bhanpuramufaddal / stopwords.py

Created June 18, 2023 07:09

Stopwords in different languages

	# Reference
	# https://advertools.readthedocs.io/en/master/advertools.stopwords.html
	# install using 'pip install advertools'

	import advertools as adv
	adv.stopwords.keys()
	# languages supported

	stopwords = adv.stopwords['japanese']
	# returns list of stopwords in japanese

bhanpuramufaddal / tokenize_lemmatize_fugashi.py

Created June 18, 2023 06:56

	from fugashi import GenericTagger as Tagger

	tagger = Tagger('-r /dev/null -d /usr/local/Cellar/mecab-ipadic/2.7.0-20070801/lib/mecab/dic/ipadic')

	def tokenize_lemmatize(text, remove_stopwords = True, lemmatize = False):
	tokens = tagger.parseToNodeList(text)
	if remove_stopwords:
	tokens = filter(lambda token : not is_stopword(token.surface.strip()), tokens)

	if lemmatize:

bhanpuramufaddal / tokenize_lemmatize_mecab.py

Created June 18, 2023 06:39

	from MeCab import Tagger

	def _get_tagger() -> Tagger:
	opts = getenv('MECAB_OPTS', '-d /usr/local/Cellar/mecab-ipadic/2.7.0-20070801/lib/mecab/dic/ipadic')
	tagger = Tagger(opts)
	# for some reason the first request to the tagger doesn't produce output
	# so pre-warming it here once to avoid serving daft results later
	parsed = tagger.parseToNode('サザエさんは走った')
	while parsed:
	parsed = parsed.next

bhanpuramufaddal / load_dotenv.sh

Created May 31, 2023 11:02 — forked from mihow/load_dotenv.sh

Load environment variables from dotenv / .env file in Bash

bhanpuramufaddal / jupyter-ipykernel-virtualenv.txt

Last active May 27, 2023 10:23 — forked from swedishmike/gist:902fb27d627313c31a95e31c44e302ac

Adding and removing virtual environments to Jupyter notebook

	## Create the virtual environment
	conda create -n 'environment_name'

	## Activate the virtual environment
	conda activate 'environment_name'

	## Make sure that ipykernel is installed
	pip install --user ipykernel

	## Add the new virtual environment to Jupyter

bhanpuramufaddal / KonohaLangchainChunking.md

Last active May 23, 2023 07:08

Use Konoha and LangchainCharectarSplitter to generate chunks of specific size

pip install 'konoha[SentenceTokenizer]'

bhanpuramufaddal / SpacyJapaneseChunking.md

Last active May 22, 2023 06:22

Split Text into Sentences

pip install spacy
python -m spacy download ja_core_news_sm

bhanpuramufaddal / unicodeNormalization.py

Created May 21, 2023 16:56

Unicode Normalization and get rid of "\u3000" (ideographic space)

	import unicodedata

	jText = "あ　３２"
	jTextNormal = unicodedata.normalize('NFKD', jText)