Convert Video to srt/subtitle File with Google Speech-to-text Service,cloud storage and Jupyter notebook

Renee LIN
3 min readOct 20, 2021


I used SpeechRecognition lib to recognise speech into text, but found out its length is constrained by Google(SpeechRecognition used Google API), as I want to transcribe longer video I decided to use Google’s API directly. Besides, you can get text file as well as srt file.


Open a google cloud account, set up storage, audio file is saved here

2.enable Speech-to-text API and generate user key. You might need to watch their official tutorial to figure it out. Level Up — Automated Subtitles with AI-

3.I use colab too, so I save my API key and jupyter notebook on Google Drive

%pip install --upgrade google-cloud-speech
%pip install --upgrade google-cloud-storage
# this is because I put the API json key on my google drive
# skip this if you use local env
from google.colab import drive
#ensure the file is accessible
!ls /content/gdrive/'My Drive'/'Colab Notebooks'/temp
# refer to google tutorial to check how to save the key
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/content/gdrive/My Drive/Colab Notebooks/temp/temp_speech.json"
#ensure the path is set correctly

Connect to the cloud storage

from google.colab import auth
from import storage
project_id = 'prac-259701'
!gcloud config set project {project_id}
!gsutil ls
bucket_name = 'sage_ff14'

test if it is connected

# test, check if connecting with the cloud storage
from import storage
project_id = 'prac-259701'
bucket_name = 'sage_ff14'
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# Note: Client.list_blobs requires at least package version 1.17.0.
storage_client = storage.Client(project_id)
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:

Now we can start using the service

!pip install srt
import srt
from import speech
# parameters about the audio file
sample_rate_hertz = 44100
language_code = "en-US"
audio_channel_count = 2
encoding = 'LINEAR16'
out_file = "subtitle"
max_chars = 40

Below is the Google official code for requesting Speech to text service, I only make a little adjustments

  1. function to call the service
def long_running_recognize(uri):
Transcribe long audio file from Cloud Storage using asynchronous speech
storage_uri URI for audio file in GCS, e.g. gs://[BUCKET]/[FILE]
# print("Transcribing {} ...".format(args.storage_uri))
client = speech.SpeechClient()
# Encoding of audio data sent.
operation = client.long_running_recognize(
"enable_word_time_offsets": True,
"enable_automatic_punctuation": True,
"sample_rate_hertz": sample_rate_hertz,
"language_code": language_code,
"audio_channel_count": audio_channel_count,
"encoding": encoding,
audio={"uri": storage_uri},
response = operation.result()
subs = [] for result in response.results:
# First alternative is the most probable result
# alternative = result.alternatives[0]
# subs.append(alternative)
# print(u"Transcript: {}".format(alternative.transcript))
subs = break_sentences( subs, result.alternatives[0]) print("Transcribing finished")
return subs

2. break the long text to sentences because we need srt file

def break_sentences(subs, alternative):
firstword = True
charcount = 0
idx = len(subs) + 1
content = ""
for w in alternative.words:
if firstword:
# first word in sentence, record start time
# start = w.start_time.ToTimedelta()
start = w.start_time
charcount += len(w.word)
content += " " + w.word.strip()
if ("." in w.word or "!" in w.word or "?" in w.word or
charcount > max_chars or
("," in w.word and not firstword)):
# break sentence at: . ! ? or line length exceeded
# also break if , and not first word
# end=w.end_time.ToTimedelta(),
firstword = True
idx += 1
content = ""
charcount = 0
firstword = False
return subs

3. save the file

def write_srt(subs):
srt_file = out_file + ".srt"
print("Writing {} subtitles to: {}".format(language_code, srt_file))
f = open(srt_file, 'w')
def write_txt(subs):
txt_file = out_file + ".txt"
print("Writing text to: {}".format(txt_file))
f = open(txt_file, 'w')
for s in subs:
f.write(s.content.strip() + "\n")

Put in the file address, the files we need can be generated

storage_uri = 'gs://' + 'sage_ff14' + '/' +'converted-sage.wav'
subs = long_running_recognize(storage_uri)

The complete notebook is here

I need to check mozilla deepspeech in the future since Google API is not free, it is only allow 60minutes of free audio recognition



Renee LIN
Renee LIN

Written by Renee LIN

Passionate about web dev and data analysis. Huge FFXIV fan. Interested in health data now.

No responses yet