Convert Video/Audio to Text with Mozilla DeepSpeech

Renee LIN
2 min readOct 27, 2021


Since the previous method cost money (google speech2text),

I’d like to study an open-source library for extracting text from videos or audio. Mozilla DeepSpeech is a widely used one, training our own model might be super difficult but using the pre-trained models are relatively simple. The Jupyter notebook is here

Prepare the audio file

First, we need to extract audio files from videos, ffmpeg is a useful library for this. The audio needs to be WAV at 16kHz or 8kHz because the pre-trained model used this format.

!pip3 install ffmpeg# convert video to audio
!ffmpeg -i dogs.mp4 -b:a 64K -ar 16000 -ac 1 -vn dogs_audio.wav
# if you have audio but not the right fomat
!ffmpeg -i dogs.wav -vn -ar 16000 -ac 1 dogs_audio.wav

Install DeepSpeech & download pre-trained model

!pip3 install deepspeech!curl -LO!curl -LO

Extract text

from deepspeech import Model 
import wave
import numpy as np
model = Model('./deepspeech-0.9.3-models.pbmm')
fin =, 'rb')
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
# Perform inference
infered_text = model.stt(audio)

It seems the text is not quite accurate though, it is free anyway. I’d like to study more on how to transcribe other languages and utilize GPU in Colab in the future.



Renee LIN

Passionate about web dev and data analysis. Huge FFXIV fan. Aiming to work with healthcare data for a living in 2024.