Skip to content

Instantly share code, notes, and snippets.

@bhanpuramufaddal
Created June 18, 2023 06:56
Show Gist options
  • Select an option

  • Save bhanpuramufaddal/7008b5053226456d9a21030a91d7e0c1 to your computer and use it in GitHub Desktop.

Select an option

Save bhanpuramufaddal/7008b5053226456d9a21030a91d7e0c1 to your computer and use it in GitHub Desktop.
from fugashi import GenericTagger as Tagger
tagger = Tagger('-r /dev/null -d /usr/local/Cellar/mecab-ipadic/2.7.0-20070801/lib/mecab/dic/ipadic')
def tokenize_lemmatize(text, remove_stopwords = True, lemmatize = False):
tokens = tagger.parseToNodeList(text)
if remove_stopwords:
tokens = filter(lambda token : not is_stopword(token.surface.strip()), tokens)
if lemmatize:
tokens = (token.feature[6] if not token.feature[6] == '*' else token.surface.strip() for token in tokens)
else:
tokens = (token.surface.strip() for token in tokens)
return " ".join(tokens)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment