Building a speech-to-text model using Transformer library from huggingface

Written by: Keerti Banweer

An application that translates audio to text can be valuable for problems as simple as translating a class lecture to generate text data for machine learning applications. As a student, making notes is critical for an individual’s learning in a classroom, and one might not be able to write or type as fast without missing key concepts. Having audio to text translator can be an essential asset. Many data scientists also need this tool to collect vast amounts of data for different machine learning applications. Another use of transcriptions is that transcription of audio or video files can help create subtitles to help reach wider audiences when building online tools. With the popularity of deep learning models, many existing libraries allow us to use pre-trained speech-to-text models to build a simple application to translate audio or video.
This blog will use the huggingface interface because it is straightforward and can help us load any state-of-the-art model for our application. Huggingface provides an excellent collection of pre-trained models that convert audio/video files to text. Transformers in huggingace interface provides many architectures for Natural Language Processing (NLP) and Natural Language Generation (NLG).

We will use the Wav2Vec2 model to translate the audio files. This model is proposed in the paper wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. Using huggingface transformers library, we can load the pretrained Wav2Vec2 model using facebook/wav2vec2-base-960h.

To explore more architectures in huggingface transformers library, here is the link to the documentation huggingface.

We will be use librosa library to load the audio files. Here is the documentation on librosa library.

Install the Required libraries

For using the pre-trained models from huggingface, we need to install transformers. For loading the audio files, we will install the librosa library. Following are the steps to install transformers and librosa:

    #Installing transformers
    !pip install -q transformers
    #Installing librosa to manage the audio file
    !pip install librosa

Import the libraries

In the next step, we will import the required libraries for managing and listening an audio and load pre-trained models. The libraries are pytorch, librosa, IPython.display, and transformers

    import torch
    import librosa as lb
    import IPython.display as listen
    from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

Load the pre-trained models

We are using pre-trained Wav2Vec2 model. The model output has to be decoded by using Wav2Vec2Processor. Wav2Vec2-Base-960h model by facebook is pre-trained and fine-tuned on 960 hours of Librispeech on 16000Hz of sampled speech audio, as per documentation on huggingface. We will initialize the processor using the pretrained Wav2Vec2Processor. The model can be load using Wav2Vec2ForCTC.

    # Initialize the processor
    processor = Wav2Vec2Processor.from_pretrained('facebook/wav2vec2-base-960h')

    # Initialize the model
    model = Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-base-960h')

Sample audio files

We will read the sound file from Librispeech dataset available at this link. I have picked one random audio file from the collection of audio from test-clean dataset folder. The details about the Librispeech dataset can be found here. I have converted the flac file to wav file from this dataset. This dataset provides example of clean audio files without any background noise. I am also using a famous line from the Lord of the rings movie trilogy. These audio files are not clean and contains background sound as in the movie. One-ring file is one of the famous dialogues by Ian McKellen as Gandalf the gray. You can listen to the audio file using Ipython.display. The audio file is in .wav format. We can listen the audio file using Ipython.display. The Audio function allows us to open the .wav file and listen the audio. This way we can compare what we see and what the model translates

    # Listen to the clean data files from Librispeech
    listen.Audio("librispeech-audio-sample.wav", autoplay=True)
    # Listen to the audio with background noise
    listen.Audio("one-ring.wav", autoplay=True)

Load the audio files as waveform

We will load the audio files using the librosa.load. It will load the audio file as a waveform and save the sampling rate. For the wav2Vec2 model, we will use the sampling rate of 16000Hz as the model is pre-trained on this frequency. When using the different models from the transformers library, set the sampling rate to the appropriate value for the model when loading the audio file.

    # Loading the clean audio file
    wf_clean, sr_clean = lb.load('librispeech-audio-sample.wav', sr = 16000)

    # Loading the audio file with background sound
    wf_noise, sr_noise = lb.load('one-ring.wav', sr = 16000)

The generated waveform is then tokenized to generate the tensors in the pytorch format with the argument return_tensors='pt'

    # Tokenize the waveform generated from the clean audio using librosa
    input_clean = processor(wf_clean, return_tensors='pt').input_values

    # Tokenize the waveform generated from the clean audio using librosa
    input_noise = processor(wf_noise, return_tensors='pt').input_values

The tensor is then fit in to the model to generate the non-normalized predicted values

    # retrieve the logits from the model
    logts_clean = model(input_clean).logits
    
    # retrieve the logits from the model
    logts_noise = model(input_noise).logits

We will then take the argmax to get the predicted values.

    # We will take the argmax value from the logits retrieved from the model
    pred_clean = torch.argmax(logts_clean, dim=-1)
    transcription_clean = processor.batch_decode(pred_clean)

    pred_noise = torch.argmax(logts_noise, dim=-1)
    transcription_noise = processor.batch_decode(pred_noise)

Printing the translated text along with the original text of the audio

    # Print the transcription text for the clean audio file
    print(transcription_clean)

    # The original transcription of the audio
    print("['THE STROLLERS TOOK THEIR PART IN IT WITH HEARTY ZEST NOW THAT THEY HAD SOME CHANCE OF BEATING OFF THEIR FOES']")

The predictions by the model of the audio file with the background sound is not accurate.

    # Print the transcription text for the noisy audio file
    print(transcription_noise)

    # The original transcription of the audio
    print("['One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them']")

The Translated output is as below. The first text is the translated text by the model and the second text is the original text:

# Translated text
['THE STROLLERS TOOK THEIR PART IN IT WITH HEARTY ZEST NOW THAT THEY HAD SOME CHANCE OF BEATING OFF THEIR FOES']

# Original text
['THE STROLLERS TOOK THEIR PART IN IT WITH HEARTY ZEST NOW THAT THEY HAD SOME CHANCE OF BEATING OFF THEIR FOES']
# Translated text
['ONE RING   TO WHOM T WHOMONE RING TO FINDONE RING  ING THE MORE AAADARA SWITE']

# Original text
['One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them']

Building a speech to text model

When comparing the translated text from clean vs. noisy audio files, the model is not able to predict the noisy audio accurately. The Librosa library when loads the audio to a waveform, it includes the background noise in the waveform and confuses the wav2Vec2 model. In the next section, we will talk about evaluating your speech-to-text models on some raw data, audio files from podcasts or televised interviews. Also, we will talk about soundfile library to load the audio files.