Since the previous method cost money (google speech2text) https://reneelin2019.medium.com/convert-video-to-srt-subtitle-file-with-google-speech-to-text-service-cloud-storage-and-jupyter-bc32bbd01746,
I’d like to study an open-source library for extracting text from videos or audio. Mozilla DeepSpeech is a widely used one, training our own model might be super difficult but using the pre-trained models are relatively simple. The Jupyter notebook is here https://github.com/reneelin1712/autoTranslation/blob/main/deepspeech_audio2text.ipynb
Prepare the audio file
First, we need to extract audio files from videos, ffmpeg is a useful library for this. The audio needs to be WAV at 16kHz or 8kHz because the pre-trained model used this format.
!pip3 install ffmpeg# convert video to audio
!ffmpeg -i dogs.mp4 -b:a 64K -ar 16000 -ac 1 -vn dogs_audio.wav# if you have audio but not the right fomat
!ffmpeg -i dogs.wav -vn -ar 16000 -ac 1 dogs_audio.wav
Install DeepSpeech & download pre-trained model
!pip3 install deepspeech!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
Extract text
from deepspeech import Model
import wave
import numpy as npmodel = Model('./deepspeech-0.9.3-models.pbmm')
model.enableExternalScorer('./deepspeech-0.9.3-models.scorer')fin = wave.open(audio_name, 'rb')
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
fin.close()# Perform inference
infered_text = model.stt(audio)
It seems the text is not quite accurate though, it is free anyway. I’d like to study more on how to transcribe other languages and utilize GPU in Colab in the future.