DavraYoung/guide.md

Python Audio processing Guide

Introduction

This guide is intended to provide a comprehensive overview of audio processing in Python. It will cover the basics of audio processing, including reading and writing audio files, audio manipulation, and audio analysis.

Reading and Writing Audio Files

I recommend using either pydub AudioSegment or librosa to read and write audio files in Python. Pydub is a simple and easy-to-use library for audio manipulation in Python, while librosa is a more advanced library that provides additional functionality for audio analysis.

from pydub import AudioSegment

# Read an audio file
audio = AudioSegment.from_file("audio.wav")

# convert to different sample rate, channels, and sample width
audio = audio.set_channels(1)  # Convert stereo to mono
audio = audio.set_frame_rate(16000)  # Set sample rate to 16 kHz
audio = audio.set_sample_width(2)  # Set sample width to 16 bits


# Write an audio file
audio.export("output.mp3", format="mp3")

Audio Manipulation

Pydub provides a wide range of audio manipulation functions, including trimming, fading, and overlaying audio files. Here are some examples of common audio manipulation tasks using pydub:

from pydub import AudioSegment

# Read an audio file
audio = AudioSegment.from_file("audio.wav")

# Trim the audio file
trimmed_audio = audio[1000:5000]  # Trim from 1s to 5s

# Fade in and out
faded_audio = trimmed_audio.fade_in(1000).fade_out(1000)  # Fade in and out over 1s

# Overlay audio files
overlay_audio = trimmed_audio.overlay(faded_audio)  # Overlay the trimmed and faded audio

Convert pydub audio to numpy

You can use the provided functions to seamlessly convert pydub audio to numpy arrays and vice versa. This is useful for further audio processing and analysis using numpy. Make sure to convert to 1channel, 16000hz, 2byte audio before converting to numpy array. Most of the asr systems are trained on 16khz audio, so 16khz should be the default sampling rate.

import numpy as np
from pydub import AudioSegment

def numpy_to_audio_segment(array, sampling_rate):
    array = np.array(np.array(array, dtype=np.float32) * 2**15, dtype=np.int16)

    return AudioSegment(
        array.tobytes(), frame_rate=sampling_rate, sample_width=2, channels=1
    )

def audio_segment_to_numpy(audio):
    samples = [audio.get_array_of_samples()]
    fp_arr = np.array(samples).T.astype(np.float32)
    fp_arr /= np.iinfo(samples[0].typecode).max
    return np.array(fp_arr, dtype=np.float32).squeeze()

from pydub import AudioSegment

# Read an audio file
audio = AudioSegment.from_file("audio.wav")

# convert to different sample rate, channels, and sample width
audio = audio.set_channels(1)  # Convert stereo to mono
audio = audio.set_frame_rate(16000)  # Set sample rate to 16 kHz
audio = audio.set_sample_width(2)  # Set sample width to 16 bits

np_audio = audio_segment_to_numpy(audio)
audio = numpy_to_audio_segment(np_audio, 16000)
# Write an audio file
audio.export("output.mp3", format="mp3")

Transcoding latency.

Since you are implementing a realtime application, you should be aware of the transcoding latency. The transcoding latency is the time it takes to convert an audio file from one format to another. This latency can vary depending on the complexity of the audio file and the format you are converting it to. To minimize transcoding latency, you should use a simple and efficient audio format like WAV.

audio.export("output.wav", format="wav")

Audio Analysis

pydub has some basic audio analysis functions, such as getting the duration and rms of an audio file. Here are some examples of audio analysis using pydub:

from pydub import AudioSegment

# Read an audio file
audio = AudioSegment.from_file("audio.wav")

# Get the duration of the audio file
duration = len(audio) / 1000  # Duration in seconds

# Get the root mean square (RMS) of the audio file
rms = audio.rms  # RMS value

Voice activity detection

Depending on the choice of the asr system, you might need to perform voice activity detection. Voice activity detection is the process of detecting the presence of human speech in an audio signal. You can use librosa to perform voice activity detection on your side. There are multiple ways to do it:

Using pydub silence detection

Pydub provides a simple silence detection function that can be used to perform voice activity detection. Here is an example of voice activity detection using pydub:

from pydub import AudioSegment
from pydub.silence import detect_nonsilent

audio = AudioSegment.from_file("audio.wav")

# Detect non-silent parts of the audio
nonsilent = detect_nonsilent(audio, min_silence_len=100, silence_thresh=-40)

# Get the start and end times of the non-silent parts
vad = [(start, end) for start, end in nonsilent]
print(vad)

Denoising the audio

Sometimes pydub silence detection may not work as expected due to background noise. In such cases, you can use denoising techniques to improve the accuracy of voice activity detection. You can use pydub to denoise the audio before performing voice activity detection. For example, you can use https://pypi.org/project/denoiser/ for denoising:

from denoiser import pretrained
import torch

denoise_model = None
def init_denoiser():
    global denoise_model
    if denoise_model is not None:
        return denoise_model


    denoise_model = pretrained.master64().cuda()
    # disable autograd
    for param in denoise_model.parameters():
        param.requires_grad = False
    return denoise_model


def denoise_audio(audio):

    with torch.no_grad():
        wav = torch.from_numpy(audio).t().cuda()
        denoise_model = init_denoiser()
        denoised = denoise_model(wav[None])[0]
        result = denoised.cpu().numpy()
        return result

from pydub import AudioSegment
audio = AudioSegment.from_file("audio.wav")

# convert to different sample rate, channels, and sample width
audio = audio.set_channels(1)  # Convert stereo to mono
audio = audio.set_frame_rate(16000)  # Set sample rate to 16 kHz
audio = audio.set_sample_width(2)  # Set sample width to 16 bits

np_audio = audio_segment_to_numpy(audio)

denoised_audio = denoise_audio(np_audio)

audio = numpy_to_audio_segment(denoised_audio, 16000)
audio.export("denoised_audio.wav", format="wav")

Denoising without GPU

If CUDA or pytorch isn't available for you, you can try to use simple denoising with https://pypi.org/project/noisereduce/

import noisereduce as nr
def rough_denoise(audio):
    np_audio = audio_segment_to_numpy(audio)
    np_audio = nr.reduce_noise(y=np_audio, sr=audio.frame_rate)
    audio = numpy_to_audio_segment(np_audio, audio.frame_rate)
    return audio

audio = rough_denoise(audio)

Using Machine learning VAD

The downside of machine learning approach:

It can be slow on some devices
It may be inaccuracy

Here is an example of VAD using pyannote/segmentation model from huggingface.

from torch import Tensor

from pyannote.audio.pipelines import VoiceActivityDetection
from pyannote.audio import Model

_model = None
_pipeline = None


def init_pipeline(device="cuda"):
    global _model
    global _pipeline
    if _model is None:
        _model = Model.from_pretrained(
            "pyannote/segmentation",
            use_auth_token="<TOKEN FROM HUGGING FACE>",
        )
        _model.to(device)
        _pipeline = VoiceActivityDetection(segmentation=_model)
        _pipeline.instantiate(
            {
                "onset": 0.710,
                "offset": 0.581,
                "min_duration_on": 0.055,
                "min_duration_off": 0.098,
            }
        )
    return _pipeline


def vad(np_audio, device="cuda"):
    pipeline = init_pipeline(device)
    return [
        (round(turn.start * 1000), round(turn.end * 1000))
        for turn, _, label in pipeline(
            {
                "waveform": Tensor(np_audio).to("cuda"),
                "sample_rate": 16000,
            }
        ).itertracks(yield_label=True)
    ]

from pydub import AudioSegment
audio = AudioSegment.from_file("audio.wav")
# convert to different sample rate, channels, and sample width
audio = audio.set_channels(1)  # Convert stereo to mono
audio = audio.set_frame_rate(16000)  # Set sample rate to 16 kHz
audio = audio.set_sample_width(2)  # Set sample width to 16 bits
np_audio = audio_segment_to_numpy(audio)
print(vad([np_audio]))