DEV Community

Cover image for Exploring OCR and text-to-speech in FFMPEG...
Alan Allard for Eyevinn Video Dev-Team Blog

Posted on

Exploring OCR and text-to-speech in FFMPEG...

...in which the author pursues a hare-brained idea to create a system to recognise text in any given image, reproduce the text, place it next to the original image and have the text be spoken aloud - all within ffmpeg, in order to explore several techniques and features within said tool. How far did the author get with all this madness? Read on to find out...

While spending an unhealthy amount of time browsing through the various filters available in ffmpeg, I discovered a couple that I really wanted to investigate. The ocr filter (as in Optical Character Recognition) is not documented in great detail - as is the case in several parts of the ffmpeg documentation. (On the other hand, some of the filters are extremely well-documented with several intriguing examples. But not this one!) Nonetheless, I wanted to try it out. Also, while writing my earlier blogs about audio in ffmpeg, I had come across the flite filter. I recognised the name (from my days in the past working with the open-source audio software Pure Data) to be text-to-speech generation software. This was also interesting to me, so I started experimenting...

Using the ocr filter

The ocr filter in ffmpeg is powered by the Tesseract library. As you will often find in ffmpeg, the build within ffmpeg has only a subset of the functionality of the original library - at least, for the moment. There's always the possibility of APIs being expanded in later ffmpeg releases. And it is open source of course, so there's the option of instigating those changes yourself - or using the original library in conjunction with ffmpeg if that suits your needs better.

To use the ocr filter, we need to be sure that ffmpeg was built with Tesseract enabled. You can check this by running ffmpeg without any options. If you see --enable-libtesseract in the configuration line in the first few lines, like this:

 % ffmpeg
ffmpeg version N-109469-g62da0b4a74 Copyright (c) 2000-2023 the FFmpeg developers
  built with Apple clang version 14.0.0 (clang-1400.0.29.202)
  configuration: --enable-libx264 --enable-gpl --enable-lv2 --enable-libfreetype --enable-libflite --enable-cross-compile --enable-libtesseract --enable-libfontconfig --enable-libfribidi
Enter fullscreen mode Exit fullscreen mode

then you don't need to do anything. Otherwise add --enable-libtesseract to and run your ./configure command, then make and make install.

To test the ocr filter, a video containing text would be useful. We can generate that fairly easily in ffmpeg:

ffmpeg -f lavfi -i color=c=red:size=400x400:duration=5 -filter_complex \
drawtext=fontfile='fonts/lazenby-computer/LazenbyCompSmooth.ttf'\
:text='SYNTHESIZER':fontsize=40:x=25:y=20:fontcolor=white,\
drawtext=fontfile='fonts/lazenby-computer/LazenbyCompSmooth.ttf'\
:text='DRUM MACHINE':fontsize=40:x=25:y=150:fontcolor=white,\
drawtext=fontfile='fonts/lazenby-computer/LazenbyCompSmooth.ttf'\
:text='REVERB':fontsize=40:x=25:y=310:fontcolor=white\
 ocrTest1.mov
Enter fullscreen mode Exit fullscreen mode

This is probably fairly self-explanatory. It:

  1. generates a single colour video of dimensions 400x400 for 5 seconds
  2. draws three texts over this, using a specified font (one that I downloaded to a local folder). We do this with the drawtext filter.

This was done using only the most basic font support available in ffmpeg. This installation of ffmpeg was built with only --enable-libfreetype configured. If we also configure --enable-libfontconfig and --enable-libfribidi, things get a little easier. Among other things, we can then use the default font option, meaning that we don't have to pass in a long, unwieldy path to the drawtext filter (unless we have specific font requirements:

ffmpeg -f lavfi -i color=c=red:size=400x400:duration=5 -filter_complex \
drawtext=text='SYNTHESIZER':fontsize=40:x=25:y=20:fontcolor=white,\
drawtext=text='DRUM MACHINE':fontsize=40:x=25:y=150:fontcolor=white,\
drawtext=text='REVERB':fontsize=40:x=25:y=310:fontcolor=white\
 ocrTest2.mov
Enter fullscreen mode Exit fullscreen mode

So we have a video with text in, in a default font:

Now it would be nice to feed that through the ocr filter and see what text it recognises. Before we do that, let's also stagger the display of the texts over time but in the same position, which will allow us to demonstrate the OCR output clearly by printing it below the source text:

ffmpeg -f lavfi -i color=c=red:size=400x400:duration=9.5 -filter_complex \
"[0]drawtext=text='SYNTHESIZER':fontsize=40:x=(w-text_w)/2:y=20:\
fontcolor=white:enable='between(t,1,3)'[1];\
[1]drawtext=text='DRUM MACHINE':fontsize=40:x=(w-text_w)/2:y=20:\
fontcolor=white:enable='between(t,4,6)'[2];\
[2]drawtext=text='REVERB':fontsize=40:x=(w-text_w)/2:y=20:\
fontcolor=white:enable='between(t,7,9)'"\
 ocrTest3.mov
Enter fullscreen mode Exit fullscreen mode

Notice that we centred the texts horizontally (using the text width text_w, which is available within the drawtext filter). Also, if you take a moment to run ffmpeg -filters, you will see a list of all the available filters with three characters preceding each filter. Here is the one for drawtext:

T.C drawtext          V->V       Draw text on top of video frames using libfreetype library.
Enter fullscreen mode Exit fullscreen mode

The T is an indication that drawtext has timeline support. This means that we can enable the filter based on the time we are at in the video. There are several expressions available for evaluating this and we are using a common one, the between expression. For example, the first drawtext instance is only enabled between 1 and 3 seconds in the video. This is how it looks for all three:

So now we have something that our ocr filter can respond to. This is how we arrange that:

ffmpeg -i ocrTest3.mov -filter_complex \
"ocr,drawtext=fontsize=40:fontcolor=black:x=(w-text_w)/2:y=(h-text_h)/2:\
text='%{metadata\:lavfi.ocr.text}'" ocrOutput1.mov
Enter fullscreen mode Exit fullscreen mode

resulting in this video:

Here we are using drawtext to display the data that the ocr filter outputs to the metadata that accompanies each frame of the video. This is why the OCR output changes; it's being updated per frame. (We will examine this metadata shortly).

To explore things further, let's use a text pulled from the internet as a source. We will then position it next to the ocr output in an output image. Unfortunately for this particular (surely atypical) use-case, the ocr stream leaves the source input unchanged, which means our recognised text would be output on top of the existing image. So let's cover the stream with a white colour overlay before outputting the ocr
results using drawtext on top:

ffmpeg -i test.png -f lavfi -i "color=white" -filter_complex \
"split[textOut][origOut];\
[textOut]ocr[ocrOut];\
[1:v][ocrOut]scale2ref=w=iw:h=ih[textScaled][colorScaled];\
[colorScaled][textScaled]overlay[overlayOut];\
[overlayOut]drawtext=fontsize=40:fontcolor=black:x=(w-text_w)/2:y=(h-text_h)/2:\
text='%{metadata\:lavfi.ocr.text}'[drawtextOut];\
[origOut][drawtextOut]hstack" -t 1 -update 1 ocrOutput2.png
Enter fullscreen mode Exit fullscreen mode

Useful things to note here are that we resize the white color image to the original .png using the scale2ref filter with the (source video) input height and input width constants ih and iw. Then we overlay the OCR output text onto that white background. Finally, we line up the two video streams horizontally using the hstack filter. If we didn't add the -t 1, this stream would continue until we shut it down manually. And the -update 1 is necessary for outputting a single image instead of a video, video being what ffmpeg is primarily designed to work with, of course. Here is ocrOutput2.png:

Image ocr test

I was curious as to how the ocr filter handled angled text - and if it didn't really, then how much of an angle can cause distortion of the recognised text:

ffmpeg -i test.png -f lavfi -i "color=red" -filter_complex \
"rotate='PI/24:ow=hypot(iw,ih):oh=ow'[rotOut];\
[rotOut]split[textOut][origOut];\
[textOut]ocr[ocrOut];\
[1:v][ocrOut]scale2ref=w=oh*mdar:h=ih[textScaled][colorScaled];\
[colorScaled][textScaled]overlay[overlayOut];\
[overlayOut]drawtext=fontsize=40:fontcolor=black:x=(w-text_w)/2:y=(h-text_h)/2:\
text='%{metadata\:lavfi.ocr.text}'[drawtextOut];\
[origOut][drawtextOut]hstack" -t 1 -update 1 ocrOutput2.png
Enter fullscreen mode Exit fullscreen mode

What's new here is the rotate filter, which rotates the test image by 7.5 degrees (expressed in radians). The parameters after the angle expression - ow=hypot(iw,ih):oh=ow - control the output height and width and ensure that the image is contained within its frame when we rotate it (this will be particularly useful when we animate the rotation shortly). I coloured the text underlay red for clarity, giving us this for ocrOutput2.png

Image ocr test 2

And rotating by PI/12 (15 degrees) gives us:

Image description

And here you can see, perhaps unsurprisingly, that the OCR results are starting to break up a bit. 30 degrees looks like this:

Image ocr test 3

Now let's test with a couple of real-world images. First this (the image shown here is much smaller than the large output file I got):

Image ocr real world test 1

That's quite impressive - as well as recognising the neon sign, it did actually manage to capture some of the text on the synthesizers too!

Now this (same here, much-reduced version of the resulting image):

Image ocr real world test 2

Pretty extreme, the texture of the book and/or the resulting pixels got recognised as characters.

Image ocr real world test 3

Much better, especially considering that the ocr filter defaults to interpreting the English language (this is Swedish). (It can be changed using the language parameter though).

For a quick overview of how rotating the text affects the ocr output - and mostly because it's instructive - let's animate the rotation in a 12 second video:

ffmpeg -loop 1 -i test.png -f lavfi -i "color=white" -filter_complex \
"rotate='PI/6*t:ow=hypot(iw,ih):oh=ow'[rotOut];\
[rotOut]split[textOut][origOut];\
[textOut]ocr[ocrOut];\
[1:v][ocrOut]scale2ref=w=oh:h=ih[textScaled][colorScaled];\
[colorScaled][textScaled]overlay[overlayOut];\
[overlayOut]drawtext=fontsize=40:fontcolor=black:\
x=(w-text_w)/2:y=(h-text_h)/2:text='%{metadata\:lavfi.ocr.text\:NA}'[drawtextOut];\
[origOut][drawtextOut]hstack" -t 12 ocrOutput8.mov
Enter fullscreen mode Exit fullscreen mode

Here we are animating by multiplying the rotation by time t. Otherwise most of this is the same as previously, except for the fact that we have a default value of "NA" for when the lavfi.ocr.text data is empty or non-existent. It's quite clear that most of the OCR output is junk, but the effect is quite entertaining (I think so, in any case). If you did want to use this text in some way, we can analyse the output video in ffprobe like so:

ffprobe -show_entries frame_tags=lavfi.ocr.text,lavfi.ocr.confidence -f lavfi -i "movie=ocrOutput8.mov,ocr"  > ocrText.txt
Enter fullscreen mode Exit fullscreen mode

Here's a couple of the frame outputs from that text file:

[FRAME]
TAG:lavfi.ocr.text=It was the best of

times, it was the worst Fe REG oT
of times, it was the age St eee
of wisdom, jas the age of foolishness...
age of foolishness...

TAG:lavfi.ocr.confidence=96 96 96 96 96 96 96 95 96 96 31 4 5 97 96 96 96 96 96 1 0 96 96 21 96 96 96 96 95 95 96 
[SIDE_DATA]
[/SIDE_DATA]
[/FRAME]
[FRAME]
TAG:lavfi.ocr.text=It was the best of

times, it was the worst times it was the worst
of times, it was the age oF Wego twas they.
of wisdom, eae fie age of foolishness...
age of foolishness...

TAG:lavfi.ocr.confidence=96 96 96 96 96 96 95 96 96 96 70 80 80 96 96 96 96 96 96 96 96 73 27 26 23 96 96 41 46 95 95 96 96 97 95 
[/FRAME]
Enter fullscreen mode Exit fullscreen mode

This is quite early in the rotation and most of the text is intact. The ocr filter also outputs a confidence rating for each word, which is fairly high at this stage. Later on it drops to 0 at worst:

FRAME]
TAG:lavfi.ocr.text=aa QB
s oO a omMmMs 4
2 m3 gf =o a0 F
man &
% 24Gb & 04,9
4.3 2 a 04.520
oC n ae Cofnae
mm OY 4 oO m OY a)
a oS a oO OA2e4Cc
a> 327% 6 $3 2% 9
a. Se % 5 DP
Ce eS a,
Rae et Eee &S
O
S40 & S470 &
Lo) eo & 4 Bec4
%% 2 4 %& 2S
Pah GY Paet %
, TP we , De we
@ 9 aj "00 &

TAG:lavfi.ocr.confidence=59 95 0 8 74 90 96 67 37 95 57 57 75 64 74 79 16 6 47 48 58 47 79 0 0 67 90 11 51 94 26 92 96 96 43 22 53 53 90 42 45 36 89 89 96 42 9 0 96 95 18 27 49 47 41 93 92 0 0 0 91 91 0 38 27 89 90 13 61 81 36 90 0 37 91 92 37 56 56 90 95 95 48 32 43 81 81 
[/FRAME]
Enter fullscreen mode Exit fullscreen mode

So now we have a good idea of how the ocr filter works and can now move on to...speech!

Introducing the flite filter

As I mentioned, it is the flite filter that can generate speech in ffmpeg. The name flite derives from the fact that the library is designed as a lightweight and portable version of the Festival speech synthesis library. In the ffmpeg implementation there are six speech synthesis voices available. Flite itself has support for downloading voices on the fly, but this is currently unavailable within ffmpeg. Here is an example of text being spoken in the flite filter:

ffmpeg -filter_complex \
"flite=text='This is the slt voice':\
voice=slt" fliteOut1.wav
Enter fullscreen mode Exit fullscreen mode

Text can also be spoken from a text file:

ffmpeg -filter_complex \
"flite=textfile=speech.txt:\
voice=rms" fliteOut2.wav
Enter fullscreen mode Exit fullscreen mode

NB: You may need to re-configure ffmpeg in order to use it - in this case, we'll also need to clone the flite repo to get it working:

  • fetch the repo: git clone https://github.com/festvox/flite.git
  • move to the directory: cd flite/
  • configure, make and install:
    • ./configure
    • make
    • sudo make install

Then in ffmpeg:

  • ./configure --enable-libflite --enable-cross-compile
  • make install

(These are macOS Monterey instructions, others are here)

Combining OCR and text-to-speech...

So could we build an ffmpeg command that takes in an image of text, recognises the text and then speak it aloud? Well, there are limitations. For example, if the flite filter had had the ability to handle text expansion we could simply build it into the filtergraph in the same way as the drawtext filter.
Instead we are going to have to read the text from a file. This means that we will have to resort to some sort of compound command. Here's one possibility:

ffmpeg -y -i test.png -f lavfi -i "color=red" -filter_complex \
"[0]split[textOut][origOut];\
[textOut]ocr[ocrOut];\
[1:v][ocrOut]scale2ref=w=iw*mdar:h=ih[textScaled][colorScaled];\
[colorScaled][textScaled]overlay[overlayOut];\
[overlayOut]metadata=mode=print:key=lavfi.ocr.text:file=metadata.txt:direct=1[metadataOut];\
[metadataOut]drawtext=fontsize=40:fontcolor=black:x=(w-text_w)/2:y=(h-text_h)/2:\
text='%{metadata\:lavfi.ocr.text}'[drawtextOut];\
[origOut][drawtextOut]hstack" -update 1 -frames:v 1 textOutSpeak.png \
&& ffmpeg -y -loop 1 -i textOutSpeak.png -filter_complex \
"flite=textfile=metadata.txt:\
voice=rms[speakOut]" -t 15 -map 0 -map "[speakOut]" ocrAndFliteOutput1.mov
Enter fullscreen mode Exit fullscreen mode

and the other way to do this would be to pipe the output from the first ffmpeg command to the next:

ffmpeg -y -i test.png -f lavfi -i "color=red" -filter_complex \
"[0]split[textOut][origOut];\
[textOut]ocr[ocrOut];\
[1:v][ocrOut]scale2ref=w=iw*mdar:h=ih[textScaled][colorScaled];\
[colorScaled][textScaled]overlay[overlayOut];\
[overlayOut]metadata=mode=print:key=lavfi.ocr.text:file=metadata.txt:direct=1[metadataOut];\
[metadataOut]drawtext=fontsize=40:fontcolor=black:x=(w-text_w)/2:y=(h-text_h)/2:\
text='%{metadata\:lavfi.ocr.text}'[drawtextOut];\
[origOut][drawtextOut]hstack" -update 1 -frames:v 1 -f image2 - \
| ffmpeg -y -loop 1 -i pipe:0 -filter_complex \
"flite=textfile=metadata.txt:\
voice=rms[speakOut]" -t 15 -map 0 -map "[speakOut]" ocrAndFliteOutput2.mov
Enter fullscreen mode Exit fullscreen mode

Both of these will give us the following, which sort of works:

The problem is that flite is reading the whole output of metadata.txt rather than just the text itself. We need to trim this text data before it is read. One way is to use the grep and tail in the following way:

grep -A50 '=' metadata.txt | tail -c+16 > massagedData.txt
Enter fullscreen mode Exit fullscreen mode

This will work up to a fairly reasonable maximum of 50 lines. It trims away the first part of the text, giving us the whole readable text. And so, finally, we have the following three-part compound command that takes in an image and produces OCR output and text-to-speech, tidily packaged in a short video:

ffmpeg -y -i test.png -f lavfi -i "color=red" -filter_complex \
"[0]split[textOut][origOut];\
[textOut]ocr[ocrOut];\
[1:v][ocrOut]scale2ref=w=iw*mdar:h=ih[textScaled][colorScaled];\
[colorScaled][textScaled]overlay[overlayOut];\
[overlayOut]metadata=mode=print:key=lavfi.ocr.text:file=metadata.txt:direct=1[metadataOut];\
[metadataOut]drawtext=fontsize=40:fontcolor=black:x=(w-text_w)/2:y=(h-text_h)/2:\
text='%{metadata\:lavfi.ocr.text}'[drawtextOut];\
[origOut][drawtextOut]hstack" -update 1 -frames:v 1 textOutSpeak.png \
&& grep -A50 '=' metadata.txt | tail -c+16 > massagedData.txt \
&& ffmpeg -y -loop 1 -i textOutSpeak.png -filter_complex \
"flite=textfile=massagedData.txt:\
voice=rms[speakOut]" -t 15 -map 0 -map "[speakOut]" ocrAndFliteOutput3.mov
Enter fullscreen mode Exit fullscreen mode

And the result, a tidied-up version of the last video:

Obviously, this is a rather cumbersome and fragile end result. We can summarise by stating that there are better ways to do this. However, it has been quite an educational investigation into some useful tools within ffmpeg.

About the cover image

The cover image of this post was also generated with ffmpeg. I used two GLSL shaders (captured to .png files) as input. These are the shaders used for the design of two cards from the Pixelspirit Deck. The first card is called "Vision" and the second card is called "Judgement". They seemed appropriate ;)

I used the ffmpeg filter minterpolate to morph from the "Vision" image to the "Judgement" image (the morph effect is pretty basic, but it works):

ffmpeg -y -r 0.3 -stream_loop 1 -i 33_vision.png -r 0.3 -stream_loop 2 -i 31_Judgement.png -filter_complex "[0][1]concat=n=2:v=1:a=0[v];[v]minterpolate=fps=24:scd=none,trim=3:7,setpts=PTS-STARTPTS" -pix_fmt yuv420p visionToJudgement.mp4
Enter fullscreen mode Exit fullscreen mode

I first captured the output images from the visionToJudgement.mp4:

ffmpeg -i visionToJudgement.mp4 -r 1 morphtest1/test-%09d.png
Enter fullscreen mode Exit fullscreen mode

and tiled them horizontally (new filter alert: tile) at the right size to create the header image

ffmpeg -i morphtest1/test-%09d.png -filter_complex "scale=594:1496,tile=6x1" -update 1 -frames:v 1 output.png
Enter fullscreen mode Exit fullscreen mode

Alan Allard is a developer at Eyevinn Technology, the European leading independent consultancy firm specializing in video technology and media distribution.

If you need assistance in the development and implementation of this, our team of video developers are happy to help out. If you have any questions or comments just drop us a line in the comments section to this post.

Top comments (0)