Convert Video/Audio to Text with Mozilla DeepSpeech

2 min readOct 27, 2021

Since the previous method cost money (google speech2text) https://reneelin2019.medium.com/convert-video-to-srt-subtitle-file-with-google-speech-to-text-service-cloud-storage-and-jupyter-bc32bbd01746,

I’d like to study an open-source library for extracting text from videos or audio. Mozilla DeepSpeech is a widely used one, training our own model might be super difficult but using the pre-trained models are relatively simple. The Jupyter notebook is here https://github.com/reneelin1712/autoTranslation/blob/main/deepspeech_audio2text.ipynb

Prepare the audio file

First, we need to extract audio files from videos, ffmpeg is a useful library for this. The audio needs to be WAV at 16kHz or 8kHz because the pre-trained model used this format.

!pip3 install ffmpeg# convert video to audio
!ffmpeg -i dogs.mp4 -b:a 64K -ar 16000 -ac 1 -vn dogs_audio.wav# if you have audio but not the right fomat
!ffmpeg -i dogs.wav -vn -ar 16000 -ac 1 dogs_audio.wav

Install DeepSpeech & download pre-trained model

!pip3 install deepspeech!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

Extract text

from deepspeech import Model 
import wave
import numpy as npmodel = Model('./deepspeech-0.9.3-models.pbmm')
model.enableExternalScorer('./deepspeech-0.9.3-models.scorer')fin = wave.open(audio_name, 'rb')
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
fin.close()# Perform inference
infered_text = model.stt(audio)

It seems the text is not quite accurate though, it is free anyway. I’d like to study more on how to transcribe other languages and utilize GPU in Colab in the future.

Convert Video/Audio to Text with Mozilla DeepSpeech

Prepare the audio file

Install DeepSpeech & download pre-trained model

Extract text

Written by Renee LIN

Responses (1)