In today's post, I will show you how can you export text from the video. We are going to use SpeechRecognition
: This is a library for or performing speech recognition with the Google Speech Recognition API.
Also, we will be using moviepy
library. MoviePy
is a Python library for video editing: cutting, concatenations, title insertions, video compositing (a.k.a. non-linear editing), video processing, and creation of custom effects. MoviePy can read and write all the most common audio and video formats, including GIF, and runs on Windows/Mac/Linux, with Python 2.7+ and 3 (or only Python 3.4+ from v.1.0).
Let's start
import speech_recognition as sr
import moviepy.editor as me
We need to specified, video_file
, output_audio_file
, and output_text_file
VIDEO_FILE = "test.mp4"
OUTPUT_AUDIO_FILE = "converted.wav"
OUTPUT_TEXT_FILE = "recognized.txt"
The concept will be like this: the script will convert the mp4 file
to a wav file
, and from that file, it will output text file.
Let's do that - Extracting audio from video
video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE))
The next thing we need to do is define the recognizer.
recognizer = sr.Recognizer()
We need to import audio file for recognition
audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE))
Now the magic begins - we will start the conversion to text
with audio_clip as source:
audio_file = recognizer.record(source)
print("Please wait ...")
result = recognizer.recognize_google(audio_file)
with open(OUTPUT_TEXT_FILE, 'w') as file:
file.write(result)
print("Speech to text conversion successfull.")
except Exception as e:
print("Attempt failed -- ", e)
This is the whole code:
import speech_recognition as sr
import moviepy.editor as me
VIDEO_FILE = "video.mp4"
OUTPUT_AUDIO_FILE = "converted.wav"
OUTPUT_TEXT_FILE = "recognized.txt"
try:
video_clip = me.VideoFileClip(r"{}".format(VIDEO_FILE))
video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE))
recognizer = sr.Recognizer()
audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE))
with audio_clip as source:
audio_file = recognizer.record(source)
print("Please wait ...")
result = recognizer.recognize_google(audio_file)
with open(OUTPUT_TEXT_FILE, 'w') as file:
file.write(result)
print("Speech to text conversion successfull.")
except Exception as e:
print("Attempt failed -- ", e)
Note
For longer videos, you can split audio data into chunks.
This is the video that I use for testing purposes: video.
The video is originally uploaded to Youtube and you can find it here: Youtube link.
Thank you all.
Top comments (1)
Hey! awesome post. Works brilliantly and helped clear some things up for me with how it works.
How would i track where each word is by some timestamp, to the nearest second?
I would like to return live timestamps for each word along with the transcription.
For clarity.. my end goal is the ability to search for a word and then find all instances of it within a clip and then output them selectively. Essentially giving me 5 files of the word in audio as individual clips, labelled accordingly etc.
Thanks for the great post!