The Gist
Here's a link to a GitHub Gist containing the code, I'll also paste it below for ease of access. This post is based on a YouTube Video I posted a while back, feel free to check it out if you prefer video over text!
from moviepy.editor import VideoFileClip, TextClip, CompositeVideoClip
import whisper_timestamped as whisper
filename = "example.mp4"
screen_width = 1080
screen_height = 1920
def get_transcribed_text(filename):
audio = whisper.load_audio(filename)
model = whisper.load_model("small", device="cpu")
results = whisper.transcribe(model, audio, language="en")
return results["segments"]
def get_text_clips(text, fontsize):
text_clips = []
for segment in text:
for word in segment["words"]:
text_clips.append(
TextClip(word["text"],
fontsize=fontsize,
method='caption',
stroke_width=5,
stroke_color="white",
font="Arial-Bold",
color="white")
.set_start(word["start"])
.set_end(word["end"])
.set_position("center")
)
return text_clips
# Loading the video as a VideoFileClip
original_clip = VideoFileClip(filename)
# Load the audio in the video to transcribe it and get transcribed text
transcribed_text = get_transcribed_text(filename)
# Generate text elements for video using transcribed text
text_clip_list = get_text_clips(text=transcribed_text, fontsize=90)
# Create a CompositeVideoClip that we write to a file
final_clip = CompositeVideoClip([original_clip] + text_clip_list)
final_clip.write_videofile("final.mp4", codec="libx264")
The Idea
Not too long ago I started looking at auto-generating short-form content for platforms like TikTok and Youtube Shorts. I started off by making a simple generator for
"Your Month, Your X"
slideshows and showed it to some of my homies. One of my CompSci buddies from UCF, Andrew Proneks, came up with the ingenious idea of creating aDaily Horoscope Video Generator
in Python.We thought this would be a pretty fun project to play with on our free time. As I was researching how to generate
word-by-word captions
(like the ones on TikTok narration-style videos) I found that there weren't all that many video resources describing how to emulate these. I coded up a simple implementation and made a YouTube Video for anyone looking to do something similar. This post serves as the written version of that video :DInstalling Dependencies
We'll need
Python 3.7
or greater, and PIP for installingMoviePy
andwhisper-timestamped
. MoviePy is a handy python library for programmatically editing videos, whilewhisper-timestamped
is going to help us transcribe the audio in our video for us to generate captions.
Another dependency we'll need that does some stuff under the hood for both MoviePy and Whisper is
ffmpeg
. It's used for loading media in a way that MoviePy and Whisper can understand, as well as for writing the resulting output to a file. You can learn how to download it on the official ffmpeg website.Once we have ffmpeg ready, we're good to start pip installing everything!
# depending on your system, you might need to install ez_setup for
# moviepy to handle setting up ImageMagick in your system. This
# is used to help with adding captions on a video :D
pip install ez_setup
pip install moviepy
pip install whisper-timestamped
Setting Up Our Script
Let's create our python script and start off by importing everything we need from MoviePy and whisper-timestamped. Lets also store the path to the video we'll be generating captions for in a variable called filename
from moviepy.editor import VideoFileClip, TextClip, CompositeVideoClip
import whisper_timestamped as whisper
filename = "example.mp4"
At a high level, we want our script to do the following:
- Load our original video in a way MoviePy can understand
- Use Whisper to transcribe our video's audio and generate timestamps for each word
- Use transcribed text + timestamps to create a text/caption element for each word.
- Combine our original video and all our captions to create our final video and write it to a file.
Loading the video into MoviePy is simple, we can just do the following:
# Loading the video as a VideoFileClip
original_clip = VideoFileClip(filename)
Transcribing text
Next, let's take care of loading our model and transcribing our text. We'll abstract it out to a function get_transcribed_text
. All we need to do is load our audio, load our model, and call whisper.transcribe to get a structure representing our text.
def get_transcribed_text(filename):
audio = whisper.load_audio(filename)
model = whisper.load_model("small", device="cpu")
results = whisper.transcribe(model, audio, language="en")
return results["segments"]
Generating our Captions
Running whisper.trancribe
returns a dictionary with a couple of useful attributes. I'll paste an example output from the (whisper-timestamped github)[https://github.com/linto-ai/whisper-timestamped] below.
{
"text": " Bonjour! Est-ce que vous allez bien?",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.5,
"end": 1.2,
"text": " Bonjour!",
"tokens": [ 25431, 2298 ],
"temperature": 0.0,
"avg_logprob": -0.6674491882324218,
"compression_ratio": 0.8181818181818182,
"no_speech_prob": 0.10241222381591797,
"confidence": 0.51,
"words": [
{
"text": "Bonjour!",
"start": 0.5,
"end": 1.2,
"confidence": 0.51
}
]
},
{
"id": 1,
"seek": 200,
"start": 2.02,
"end": 4.48,
"text": " Est-ce que vous allez bien?",
"tokens": [ 50364, 4410, 12, 384, 631, 2630, 18146, 3610, 2506, 50464 ],
"temperature": 0.0,
"avg_logprob": -0.43492694334550336,
"compression_ratio": 0.7714285714285715,
"no_speech_prob": 0.06502953916788101,
"confidence": 0.595,
"words": [
{
"text": "Est-ce",
"start": 2.02,
"end": 3.78,
"confidence": 0.441
},
{
"text": "que",
"start": 3.78,
"end": 3.84,
"confidence": 0.948
},
{
"text": "vous",
"start": 3.84,
"end": 4.0,
"confidence": 0.935
},
{
"text": "allez",
"start": 4.0,
"end": 4.14,
"confidence": 0.347
},
{
"text": "bien?",
"start": 4.14,
"end": 4.48,
"confidence": 0.998
}
]
}
],
"language": "fr"
}
The attribute we care about the most is segments
. This contains a list of segment
dictionaries, which themselves have a property words
that stores a list of word objects. Each word object has the text
representation of the word, the start
timestamp of the word, and the end
timestamp.
We want to loop over all the words and create a TextClip/caption element for each of these words. Let's create a function that makes a TextClip for each word and returns them all in an array.
def get_text_clips(text, fontsize):
text_clips = []
for segment in text:
for word in segment["words"]:
text_clips.append(
TextClip(word["text"],
fontsize=fontsize,
stroke_width=5,
stroke_color="white",
font="Arial-Bold",
color="white")
.set_start(word["start"])
.set_end(word["end"])
.set_position("center")
)
return text_clips
Feel free to play around with the parameters passed into TextClip based on how you want your text to look!
Putting it all together
We can use the functions we've created to put everything together. We'll use the array of TextClip elements to create a CompositeVideoClip we can write to a file 🥳
# Load the audio in the video to transcribe it and get transcribed text
transcribed_text = get_transcribed_text(filename)
# Generate text elements for video using transcribed text
text_clip_list = get_text_clips(text=transcribed_text, fontsize=90)
# Create a CompositeVideoClip that we write to a file
final_clip = CompositeVideoClip([original_clip] + text_clip_list)
final_clip.write_videofile("final.mp4", codec="libx264")
Congrats, you made it!
Wooooo we did it 🥳 We now have a pretty simple word-by-word caption generator. Here's a scuffed example video I made a while back using this script with a few changes (namely the font).
Keep in mind I was using the
tiny
version of Whisper, you can change the parameters onwhisper.load_model()
to use a heavier model like medium
orlarge
. You can even pull a compatible model from HuggingFace! Check out this link for more details on how to do this.Thanks for reading this far! Hope this helps you produce some PEAK content on your favorite platform 👁️👅👁️