Skip to content

Instantly share code, notes, and snippets.

@dbuos
Created December 27, 2023 21:27
Show Gist options
  • Select an option

  • Save dbuos/4beaf276ce7cedc5bde974580cd2ca16 to your computer and use it in GitHub Desktop.

Select an option

Save dbuos/4beaf276ce7cedc5bde974580cd2ca16 to your computer and use it in GitHub Desktop.
Detect Language Text Dataset
from datasets import load_dataset
from transformers import pipeline
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = pipeline("text-classification", model="papluca/xlm-roberta-base-language-detection", device=device)
def classify_lang(elem):
results = pipe(elem['text'], truncation=True)
return {'lang': [r['label'] for r in results]}
oass = load_dataset("OpenAssistant/oasst_top1_2023-08-25", split="train")
oass = oass.map(classify_lang, batched=True, batch_size=32)
oass.set_format(type='pandas')
oass_df = oass[:]
oass_df['lang'].value_counts()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment