Skip to content

Instantly share code, notes, and snippets.

@myedibleenso
Last active August 29, 2025 00:19
Show Gist options
  • Select an option

  • Save myedibleenso/91b856d29a0e2ac3ecddb501cea3aea2 to your computer and use it in GitHub Desktop.

Select an option

Save myedibleenso/91b856d29a0e2ac3ecddb501cea3aea2 to your computer and use it in GitHub Desktop.
A summary of how to automatically transcribe English-language mp3 files using Wav2vec2.

Overview

This short README details the process I followed to perform automatic speech recognition on a 48+ minute audio interview.

1. Convert m4a to mp3

I Converted m4a to mp3 using ffmpeg. Assuming you have an audio file named interview.m4a to process, ...

ffmpeg -i interview.m4a -codec:a libmp3lame -qscale:a 1 interview.mp3

2. ASR using Wav2vec2 with an n-gram language model

For this experiment, I used a version of wav2vec2 to perform automatic speech recognition (ASR) over the audio in overlapping 10-second chunks. Wav2vec2 is deep neural network architecture for ASR that relies on Connectionist Temporal Classification (CTC). Introduced in 2020, variants of Wav2vec to remain state of the art on many public benchmarks.

To further improve performance, this experiment pairs wav2vec2 with a 5-gram language model originating from a 05/01/2020 dump of the English Wikipedia.

I wrote a simple script for generating a transcript for an audio file (see. stt.py). The dependencies can be installed via Anaconda:

# create a virtual environment
conda create -n asr python=3.9 ipython -y
conda activate asr
curl — proto ‘=https’ — tlsv1.2 -sSf https://sh.rustup.rs | sh
pip install -r requirements.txt

Once the project dependencies are all installed, the script can be run against mp3 files using the following command:

python stt.py -i interview.mp3 -o transcript.txt

Available options for running the script can be reviewd by running the following command:

python stt.py -h

NOTE: The gxbag/wav2vec2-large-960h-lv60-self-with-wikipedia-lm pre-trained model is not optimized for production (ex. no quantization, etc) and its peak RAM use is quite high (ex. 17G).

Future improvements

  • Before performing ASR, segment the audio into speaker-specific segments using speaker diarization.
  • Train a domain-specific language model using a large corpus
ffmpeg==1.4
tokenizers==0.12.1
transformers==4.20.1
datasets==2.3.2
speechbrain==0.5.12
torchaudio~=0.14.0
pyctcdecode==0.3.0
kenlm @ https://github.com/kpu/kenlm/archive/master.zip
#!/usr/bin/python
from typing import Text
import argparse
import sys
import os
from transformers import (
pipeline,
Pipeline,
AutoProcessor,
Wav2Vec2ProcessorWithLM
)
def create_pipeline(model_name: Text) -> Pipeline:
processor = AutoProcessor.from_pretrained(model_name)
vocab_dict = processor.tokenizer.get_vocab()
# see https://github.com/huggingface/transformers/issues/16759
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)
return pipeline(
task="automatic-speech-recognition",
model=model_name,
tokenizer=processor_with_lm,
feature_extractor=processor_with_lm.feature_extractor,
framework="pt",
decoder=processor_with_lm.decoder
)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Perform ASR over the provided mp3 file.")
# see https://huggingface.co/gxbag/wav2vec2-large-960h-lv60-self-with-wikipedia-lm
DEFAULT_MODEL_NAME = "gxbag/wav2vec2-large-960h-lv60-self-with-wikipedia-lm"
parser.add_argument(
"-i",
"--input",
dest="input_file",
type=str,
default="interview.mp3",
help="mp3 file to process."
)
parser.add_argument(
"-m",
"--model",
dest="model_name",
type=str,
default=DEFAULT_MODEL_NAME,
help="Huggingface model to use. See https://huggingface.co/models?pipeline_tag=automatic-speech-recognition"
)
parser.add_argument(
"-o",
"--out",
dest="output_file",
type=str,
default="transcript.txt",
help="Output file (transcript)."
)
args = parser.parse_args()
# ensure input exists
if not os.path.exists(args.input_file):
print(f"{args.input_file} does not exist.")
sys.exit(-1)
# ensure we're processing MP3s
if not args.input_file.lower().endswith("mp3"):
print(f"{args.input_file} must be an mp3 file.")
sys.exit(-1)
pipe = create_pipeline(model_name=args.model_name)
res = pipe(
args.input_file,
# avoid OOM errors
chunk_length_s=10,
# see https://huggingface.co/blog/asr-chunking
stride_length_s=(4, 2)
)
with open(args.output_file, "w") as out:
out.write(res["text"])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment