This guide is intended to provide a comprehensive overview of audio processing in Python. It will cover the basics of audio processing, including reading and writing audio files, audio manipulation, and audio analysis.
I recommend using either pydub AudioSegment or librosa to read and write audio files in Python. Pydub is a simple and easy-to-use library for audio manipulation in Python, while librosa is a more advanced library that provides additional functionality for audio analysis.
from pydub import AudioSegment
# Read an audio file
audio = AudioSegment.from_file("audio.wav")
# convert to different sample rate, channels, and sample width
audio = audio.set_channels(1) # Convert stereo to mono
audio = audio.set_frame_rate(16000) # Set sample rate to 16 kHz
audio = audio.set_sample_width(2) # Set sample width to 16 bits
# Write an audio file
audio.export("output.mp3", format="mp3")Pydub provides a wide range of audio manipulation functions, including trimming, fading, and overlaying audio files. Here are some examples of common audio manipulation tasks using pydub:
from pydub import AudioSegment
# Read an audio file
audio = AudioSegment.from_file("audio.wav")
# Trim the audio file
trimmed_audio = audio[1000:5000] # Trim from 1s to 5s
# Fade in and out
faded_audio = trimmed_audio.fade_in(1000).fade_out(1000) # Fade in and out over 1s
# Overlay audio files
overlay_audio = trimmed_audio.overlay(faded_audio) # Overlay the trimmed and faded audioYou can use the provided functions to seamlessly convert pydub audio to numpy arrays and vice versa. This is useful for further audio processing and analysis using numpy. Make sure to convert to 1channel, 16000hz, 2byte audio before converting to numpy array. Most of the asr systems are trained on 16khz audio, so 16khz should be the default sampling rate.
import numpy as np
from pydub import AudioSegment
def numpy_to_audio_segment(array, sampling_rate):
array = np.array(np.array(array, dtype=np.float32) * 2**15, dtype=np.int16)
return AudioSegment(
array.tobytes(), frame_rate=sampling_rate, sample_width=2, channels=1
)
def audio_segment_to_numpy(audio):
samples = [audio.get_array_of_samples()]
fp_arr = np.array(samples).T.astype(np.float32)
fp_arr /= np.iinfo(samples[0].typecode).max
return np.array(fp_arr, dtype=np.float32).squeeze()
from pydub import AudioSegment
# Read an audio file
audio = AudioSegment.from_file("audio.wav")
# convert to different sample rate, channels, and sample width
audio = audio.set_channels(1) # Convert stereo to mono
audio = audio.set_frame_rate(16000) # Set sample rate to 16 kHz
audio = audio.set_sample_width(2) # Set sample width to 16 bits
np_audio = audio_segment_to_numpy(audio)
audio = numpy_to_audio_segment(np_audio, 16000)
# Write an audio file
audio.export("output.mp3", format="mp3")Since you are implementing a realtime application, you should be aware of the transcoding latency. The transcoding latency is the time it takes to convert an audio file from one format to another. This latency can vary depending on the complexity of the audio file and the format you are converting it to. To minimize transcoding latency, you should use a simple and efficient audio format like WAV.
audio.export("output.wav", format="wav")pydub has some basic audio analysis functions, such as getting the duration and rms of an audio file. Here are some examples of audio analysis using pydub:
from pydub import AudioSegment
# Read an audio file
audio = AudioSegment.from_file("audio.wav")
# Get the duration of the audio file
duration = len(audio) / 1000 # Duration in seconds
# Get the root mean square (RMS) of the audio file
rms = audio.rms # RMS valueDepending on the choice of the asr system, you might need to perform voice activity detection. Voice activity detection is the process of detecting the presence of human speech in an audio signal. You can use librosa to perform voice activity detection on your side. There are multiple ways to do it:
Pydub provides a simple silence detection function that can be used to perform voice activity detection. Here is an example of voice activity detection using pydub:
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
audio = AudioSegment.from_file("audio.wav")
# Detect non-silent parts of the audio
nonsilent = detect_nonsilent(audio, min_silence_len=100, silence_thresh=-40)
# Get the start and end times of the non-silent parts
vad = [(start, end) for start, end in nonsilent]
print(vad)Sometimes pydub silence detection may not work as expected due to background noise. In such cases, you can use denoising techniques to improve the accuracy of voice activity detection. You can use pydub to denoise the audio before performing voice activity detection. For example, you can use https://pypi.org/project/denoiser/ for denoising:
from denoiser import pretrained
import torch
denoise_model = None
def init_denoiser():
global denoise_model
if denoise_model is not None:
return denoise_model
denoise_model = pretrained.master64().cuda()
# disable autograd
for param in denoise_model.parameters():
param.requires_grad = False
return denoise_model
def denoise_audio(audio):
with torch.no_grad():
wav = torch.from_numpy(audio).t().cuda()
denoise_model = init_denoiser()
denoised = denoise_model(wav[None])[0]
result = denoised.cpu().numpy()
return result
from pydub import AudioSegment
audio = AudioSegment.from_file("audio.wav")
# convert to different sample rate, channels, and sample width
audio = audio.set_channels(1) # Convert stereo to mono
audio = audio.set_frame_rate(16000) # Set sample rate to 16 kHz
audio = audio.set_sample_width(2) # Set sample width to 16 bits
np_audio = audio_segment_to_numpy(audio)
denoised_audio = denoise_audio(np_audio)
audio = numpy_to_audio_segment(denoised_audio, 16000)
audio.export("denoised_audio.wav", format="wav")If CUDA or pytorch isn't available for you, you can try to use simple denoising with https://pypi.org/project/noisereduce/
import noisereduce as nr
def rough_denoise(audio):
np_audio = audio_segment_to_numpy(audio)
np_audio = nr.reduce_noise(y=np_audio, sr=audio.frame_rate)
audio = numpy_to_audio_segment(np_audio, audio.frame_rate)
return audio
audio = rough_denoise(audio)The downside of machine learning approach:
- It can be slow on some devices
- It may be inaccuracy
Here is an example of VAD using pyannote/segmentation model from huggingface.
from torch import Tensor
from pyannote.audio.pipelines import VoiceActivityDetection
from pyannote.audio import Model
_model = None
_pipeline = None
def init_pipeline(device="cuda"):
global _model
global _pipeline
if _model is None:
_model = Model.from_pretrained(
"pyannote/segmentation",
use_auth_token="<TOKEN FROM HUGGING FACE>",
)
_model.to(device)
_pipeline = VoiceActivityDetection(segmentation=_model)
_pipeline.instantiate(
{
"onset": 0.710,
"offset": 0.581,
"min_duration_on": 0.055,
"min_duration_off": 0.098,
}
)
return _pipeline
def vad(np_audio, device="cuda"):
pipeline = init_pipeline(device)
return [
(round(turn.start * 1000), round(turn.end * 1000))
for turn, _, label in pipeline(
{
"waveform": Tensor(np_audio).to("cuda"),
"sample_rate": 16000,
}
).itertracks(yield_label=True)
]
from pydub import AudioSegment
audio = AudioSegment.from_file("audio.wav")
# convert to different sample rate, channels, and sample width
audio = audio.set_channels(1) # Convert stereo to mono
audio = audio.set_frame_rate(16000) # Set sample rate to 16 kHz
audio = audio.set_sample_width(2) # Set sample width to 16 bits
np_audio = audio_segment_to_numpy(audio)
print(vad([np_audio]))