Post Cover Image
Generating Word-By-Word Captions w/ Python
Loading Likes...

The Gist

Here's a link to a GitHub Gist containing the code, I'll also paste it below for ease of access. This post is based on a YouTube Video I posted a while back, feel free to check it out if you prefer video over text!

from moviepy.editor import VideoFileClip, TextClip, CompositeVideoClip
import whisper_timestamped as whisper

filename = "example.mp4"

screen_width = 1080
screen_height = 1920

def get_transcribed_text(filename):
    audio = whisper.load_audio(filename)
    model = whisper.load_model("small", device="cpu")
    results = whisper.transcribe(model, audio, language="en")

    return results["segments"]

def get_text_clips(text, fontsize):
    text_clips = []
    for segment in text:
        for word in segment["words"]:
    return text_clips

# Loading the video as a VideoFileClip
original_clip = VideoFileClip(filename)

# Load the audio in the video to transcribe it and get transcribed text
transcribed_text = get_transcribed_text(filename)
# Generate text elements for video using transcribed text
text_clip_list = get_text_clips(text=transcribed_text, fontsize=90)
# Create a CompositeVideoClip that we write to a file
final_clip = CompositeVideoClip([original_clip] + text_clip_list)

final_clip.write_videofile("final.mp4", codec="libx264")

The Idea

Not too long ago I started looking at auto-generating short-form content for platforms like TikTok and Youtube Shorts. I started off by making a simple generator for

"Your Month, Your X"

slideshows and showed it to some of my homies. One of my CompSci buddies from UCF, Andrew Proneks, came up with the ingenious idea of creating a

Daily Horoscope Video Generator

in Python.

We thought this would be a pretty fun project to play with on our free time. As I was researching how to generate

word-by-word captions

(like the ones on TikTok narration-style videos) I found that there weren't all that many video resources describing how to emulate these. I coded up a simple implementation and made a YouTube Video for anyone looking to do something similar. This post serves as the written version of that video :D

Installing Dependencies

We'll need

Python 3.7

or greater, and PIP for installing




. MoviePy is a handy python library for programmatically editing videos, while whisper-timestamped is going to help us transcribe the audio in our video for us to generate captions.

Another dependency we'll need that does some stuff under the hood for both MoviePy and Whisper is


. It's used for loading media in a way that MoviePy and Whisper can understand, as well as for writing the resulting output to a file. You can learn how to download it on the official ffmpeg website.

Once we have ffmpeg ready, we're good to start pip installing everything!

# depending on your system, you might need to install ez_setup for
# moviepy to handle setting up ImageMagick in your system. This
# is used to help with adding captions on a video :D
pip install ez_setup

pip install moviepy

pip install whisper-timestamped

Setting Up Our Script

Let's create our python script and start off by importing everything we need from MoviePy and whisper-timestamped. Lets also store the path to the video we'll be generating captions for in a variable called filename

from moviepy.editor import VideoFileClip, TextClip, CompositeVideoClip
import whisper_timestamped as whisper

filename = "example.mp4"

At a high level, we want our script to do the following:

Loading the video into MoviePy is simple, we can just do the following:

# Loading the video as a VideoFileClip
original_clip = VideoFileClip(filename)

Transcribing text

Next, let's take care of loading our model and transcribing our text. We'll abstract it out to a function get_transcribed_text. All we need to do is load our audio, load our model, and call whisper.transcribe to get a structure representing our text.

def get_transcribed_text(filename):
    audio = whisper.load_audio(filename)
    model = whisper.load_model("small", device="cpu")
    results = whisper.transcribe(model, audio, language="en")

    return results["segments"]

Generating our Captions

Running whisper.trancribe returns a dictionary with a couple of useful attributes. I'll paste an example output from the (whisper-timestamped github)[https://github.com/linto-ai/whisper-timestamped] below.

  "text": " Bonjour! Est-ce que vous allez bien?",
  "segments": [
      "id": 0,
      "seek": 0,
      "start": 0.5,
      "end": 1.2,
      "text": " Bonjour!",
      "tokens": [ 25431, 2298 ],
      "temperature": 0.0,
      "avg_logprob": -0.6674491882324218,
      "compression_ratio": 0.8181818181818182,
      "no_speech_prob": 0.10241222381591797,
      "confidence": 0.51,
      "words": [
          "text": "Bonjour!",
          "start": 0.5,
          "end": 1.2,
          "confidence": 0.51
      "id": 1,
      "seek": 200,
      "start": 2.02,
      "end": 4.48,
      "text": " Est-ce que vous allez bien?",
      "tokens": [ 50364, 4410, 12, 384, 631, 2630, 18146, 3610, 2506, 50464 ],
      "temperature": 0.0,
      "avg_logprob": -0.43492694334550336,
      "compression_ratio": 0.7714285714285715,
      "no_speech_prob": 0.06502953916788101,
      "confidence": 0.595,
      "words": [
          "text": "Est-ce",
          "start": 2.02,
          "end": 3.78,
          "confidence": 0.441
          "text": "que",
          "start": 3.78,
          "end": 3.84,
          "confidence": 0.948
          "text": "vous",
          "start": 3.84,
          "end": 4.0,
          "confidence": 0.935
          "text": "allez",
          "start": 4.0,
          "end": 4.14,
          "confidence": 0.347
          "text": "bien?",
          "start": 4.14,
          "end": 4.48,
          "confidence": 0.998
  "language": "fr"

The attribute we care about the most is segments. This contains a list of segment dictionaries, which themselves have a property words that stores a list of word objects. Each word object has the text representation of the word, the start timestamp of the word, and the end timestamp.

We want to loop over all the words and create a TextClip/caption element for each of these words. Let's create a function that makes a TextClip for each word and returns them all in an array.

def get_text_clips(text, fontsize):
    text_clips = []
    for segment in text:
        for word in segment["words"]:
    return text_clips

Feel free to play around with the parameters passed into TextClip based on how you want your text to look!

Putting it all together

We can use the functions we've created to put everything together. We'll use the array of TextClip elements to create a CompositeVideoClip we can write to a file 🥳

# Load the audio in the video to transcribe it and get transcribed text
transcribed_text = get_transcribed_text(filename)
# Generate text elements for video using transcribed text
text_clip_list = get_text_clips(text=transcribed_text, fontsize=90)
# Create a CompositeVideoClip that we write to a file
final_clip = CompositeVideoClip([original_clip] + text_clip_list)

final_clip.write_videofile("final.mp4", codec="libx264")

Congrats, you made it!

Wooooo we did it 🥳 We now have a pretty simple word-by-word caption generator. Here's a scuffed example video I made a while back using this script with a few changes (namely the font).

Keep in mind I was using the


version of Whisper, you can change the parameters on whisper.load_model() to use a heavier model like




. You can even pull a compatible model from HuggingFace! Check out this link for more details on how to do this.

Thanks for reading this far! Hope this helps you produce some PEAK content on your favorite platform 👁️👅👁️

What I'm BUMPIN Today

It's been forever!! Here's a couple of tracks I've been listening to as of late.