Export text from the video with Python

#python #tutorial #showdev #productivity

In today's post, I will show you how can you export text from the video. We are going to use SpeechRecognition: This is a library for or performing speech recognition with the Google Speech Recognition API.
Also, we will be using moviepy library. MoviePy is a Python library for video editing: cutting, concatenations, title insertions, video compositing (a.k.a. non-linear editing), video processing, and creation of custom effects. MoviePy can read and write all the most common audio and video formats, including GIF, and runs on Windows/Mac/Linux, with Python 2.7+ and 3 (or only Python 3.4+ from v.1.0).
Let's start

import speech_recognition as sr
import moviepy.editor as me

We need to specified, video_file, output_audio_file, and output_text_file

VIDEO_FILE = "test.mp4"
OUTPUT_AUDIO_FILE = "converted.wav"
OUTPUT_TEXT_FILE = "recognized.txt"

The concept will be like this: the script will convert the mp4 file to a wav file, and from that file, it will output text file.
Let's do that - Extracting audio from video

video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE))

The next thing we need to do is define the recognizer.

recognizer =  sr.Recognizer()

We need to import audio file for recognition

audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE))

Now the magic begins - we will start the conversion to text

    with audio_clip as source:
        audio_file = recognizer.record(source)
    print("Please wait ...")

    result = recognizer.recognize_google(audio_file)


    with open(OUTPUT_TEXT_FILE, 'w') as file:
        file.write(result)
        print("Speech to text conversion successfull.")

except Exception as e:
    print("Attempt failed -- ", e)

This is the whole code:

import speech_recognition as sr
import moviepy.editor as me

VIDEO_FILE = "video.mp4"
OUTPUT_AUDIO_FILE = "converted.wav"
OUTPUT_TEXT_FILE = "recognized.txt"
try:
    video_clip = me.VideoFileClip(r"{}".format(VIDEO_FILE))
    video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE))
    recognizer =  sr.Recognizer()
    audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE))
    with audio_clip as source:
        audio_file = recognizer.record(source)
    print("Please wait ...")
    result = recognizer.recognize_google(audio_file)
    with open(OUTPUT_TEXT_FILE, 'w') as file:
        file.write(result)
        print("Speech to text conversion successfull.")
except Exception as e:
    print("Attempt failed -- ", e)

Note
For longer videos, you can split audio data into chunks.

This is the video that I use for testing purposes: video.
The video is originally uploaded to Youtube and you can find it here: Youtube link.

Thank you all.

Top comments (1)

DSNR • Sep 23 '21

Hey! awesome post. Works brilliantly and helped clear some things up for me with how it works.

How would i track where each word is by some timestamp, to the nearest second?

I would like to return live timestamps for each word along with the transcription.

For clarity.. my end goal is the ability to search for a word and then find all instances of it within a clip and then output them selectively. Essentially giving me 5 files of the word in audio as individual clips, labelled accordingly etc.

Thanks for the great post!

DEV Community

Export text from the video with Python

Top comments (1)

Read next

Building a Nickname-Based Crypto Transfer Service Like WhiteBIT's QuickSend: A Developer's Guide

Handling Dates and Times in Oracle Database

Hacking the Python Import System and Rewriting the AST For Durable Execution

What Can In-Browser JavaScript Do and What Are Its Limitations?