Skip to content

Instantly share code, notes, and snippets.

@bhanpuramufaddal
Created June 20, 2023 12:50
Show Gist options
  • Select an option

  • Save bhanpuramufaddal/895ebfad88060def0f3178fff176b0df to your computer and use it in GitHub Desktop.

Select an option

Save bhanpuramufaddal/895ebfad88060def0f3178fff176b0df to your computer and use it in GitHub Desktop.
Tokenization and Lemmatization using Mecab and Fugashi
from typing import Iterable
from typing import Callable
from typing import Text
from MeCab import Tagger
def _get_tagger() -> Tagger:
opts = getenv('MECAB_OPTS', '-d /usr/local/Cellar/mecab-ipadic/2.7.0-20070801/lib/mecab/dic/ipadic')
tagger = Tagger(opts)
# for some reason the first request to the tagger doesn't produce output
# so pre-warming it here once to avoid serving daft results later
parsed = tagger.parseToNode('サザエさんは走った')
while parsed:
parsed = parsed.next
return tagger
def _tokenize(sentence: Text) -> Iterable[Text]:
parsed = _get_tagger().parseToNode(sentence)
while parsed:
token = parsed.surface.strip()
if token:
yield token
parsed = parsed.next
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment