This short README details the process I followed to perform automatic speech recognition on a 48+ minute audio interview.
I Converted m4a to mp3 using ffmpeg. Assuming you have an audio file named interview.m4a to process, ...
ffmpeg -i interview.m4a -codec:a libmp3lame -qscale:a 1 interview.mp3For this experiment, I used a version of wav2vec2 to perform automatic speech recognition (ASR) over the audio in overlapping 10-second chunks. Wav2vec2 is deep neural network architecture for ASR that relies on Connectionist Temporal Classification (CTC). Introduced in 2020, variants of Wav2vec to remain state of the art on many public benchmarks.
To further improve performance, this experiment pairs wav2vec2 with a 5-gram language model originating from a 05/01/2020 dump of the English Wikipedia.
I wrote a simple script for generating a transcript for an audio file (see. stt.py). The dependencies can be installed via Anaconda:
# create a virtual environment
conda create -n asr python=3.9 ipython -y
conda activate asr
curl — proto ‘=https’ — tlsv1.2 -sSf https://sh.rustup.rs | sh
pip install -r requirements.txtOnce the project dependencies are all installed, the script can be run against mp3 files using the following command:
python stt.py -i interview.mp3 -o transcript.txtAvailable options for running the script can be reviewd by running the following command:
python stt.py -hNOTE: The gxbag/wav2vec2-large-960h-lv60-self-with-wikipedia-lm pre-trained model is not optimized for production (ex. no quantization, etc) and its peak RAM use is quite high (ex. 17G).
- Before performing ASR, segment the audio into speaker-specific segments using speaker diarization.
- Train a domain-specific language model using a large corpus