OpenAI is on everyone's lips, but this is not about their recent Chatbot but about a language model for transcribing audio they released back in September. This post will show how to apply it on YouTube videos to generate a full transcript of the spoken words.
Install Dependencies
Install the Python packages for Whisper, PyTube and Pandas. Whisper should be installed from GitHub to pick up the latest commit. PyTube is available on PyPi, but it has a lot of open issues and pull requests. So installing it from GitHub allows us to cherry-pick some PRs if needed later on.
pip install git+https://github.com/openai/whisper.git
pip install git+https://github.com/pytube/pytube.git
pip install pandas
Download the YouTube Video
We will use PyTube's YouTube class to download the given video_url
as audio file locally. The URL must be a valid watch URL. I would suggest to use a short video (around 5 minutes) so you don't have to wait too long for the results.
from pytube import YouTube
video_url = "https://www.youtube.com/watch?v=oHWuv1Aqrzk"
audio_file = YouTube(video_url).streams.filter(only_audio=True).first().download(filename="audio.mp4")
Load the Whisper Model
This will load the tiny
Whisper language model. It's a multi-lingual model that is relatively fast. It's also available as English-only model as tiny.en
. There are more language models available that are larger and more accurate.
import whisper
whisper_model = whisper.load_model("tiny")
Transcribe the Video
This will run the language model on the provided audio file.
transcription = whisper_model.transcribe(audio_file)
Display the Transcription
This will display the transcription result in segments with start and end time. The full concatenated string is available as transcription['text']
import pandas as pd
# print as DataFrame
df = pd.DataFrame(transcription['segments'], columns=['start', 'end', 'text'])
print(df)
# or, print as String
print(transcription['text'])
This will print the following table:
index | start | end | text |
---|---|---|---|
0 | 0.0 | 9.7 | Is there cool small projects like archive sanity and so on that you're thinking about the |
1 | 9.7 | 12.96 | world, the ML world can anticipate? |
2 | 12.96 | 16.32 | There's some always like some fun side projects. |
3 | 16.32 | 17.72 | Archive sanity is one. |
4 | 17.72 | 21.8 | Basically like there's way too many archive papers, how can I organize it and recommend |
5 | 21.8 | 23.2 | papers and so on. |
6 | 23.2 | 25.8 | I transcribed all of your podcasts. |
7 | 25.8 | 29.92 | What did you learn from that experience from transcribing the process? |
8 | 29.92 | 33.92 | Like you like consuming audiobooks and podcasts and so on. |
9 | 33.92 | 39.92 | Here's a process that achieves closer to human level performance and annotation. |
10 | 39.92 | 40.92 | Yeah. |
11 | 40.92 | 45.92 | Well I definitely was surprised that transcription with opening as whisper was working so well. |
12 | 45.92 | 50.56 | Compared to what I'm familiar with from Siri and like a few other systems I guess, it works |
13 | 50.56 | 51.56 | so well. |
14 | 51.56 | 56.2 | And that's what gave me some energy to like try it out and I thought it could be fun to |
15 | 56.2 | 57.56 | run on podcasts. |
16 | 57.56 | 62.04 | It's kind of not obvious to me why whisper is so much better compared to anything else |
17 | 62.04 | 64.76 | because I feel like there should be a lot of incentive for a lot of companies to produce |
18 | 64.76 | 67.72 | transcription systems and that they've done so over a long time. |
19 | 67.72 | 69.36 | Whisper is not a super exotic model. |
20 | 69.36 | 71.16 | It's a transformer. |
21 | 71.16 | 75.08 | It takes smell spectrograms and you know it just outputs tokens of text. |
22 | 75.08 | 76.56 | It's not crazy. |
23 | 76.56 | 79.24 | The model and everything has been around for a long time. |
24 | 79.24 | 80.56 | I'm not actually 100% sure why. |
How to Run It
I put all this code into an interactive Jupyter Notebook on Colab, so you you can try it out without having to install all of this.
The complete code is also available as GitHub repository, so you can simply clone it and run it locally.
Transcribe YouTube Videos with Whisper
Example on how to transcribe YouTube videos with OpenAI's Whisper language model.
Install
Install the Python packages for Whisper, PyTube and Pandas. Whisper requires Python 3.7 or later.
pip install git+https://github.com/openai/whisper.git
pip install git+https://github.com/pytube/pytube.git
pip install pandas
Usage
Specify YouTube video URL with --video
option. The URL must be valid watch URL.
python3 main.py --video "https://www.youtube.com/watch?v=oHWuv1Aqrzk"
Jupyter
This code is available as interactive Jupyter Notebook on Colab.
Top comments (1)
Nice Idea and great Post! Love it!