Preprocessing and OCR
When we preprocess an image, we transform images to make them more OCR-friendly. OCR engines are usually trained with image data resembling print, so the closer the text in your image is to print, the better the OCR will perform. In this post, we will apply several preprocessing methods to improve our OCR accuracy.
Methods of Preprocessing
- Binarization
- Skew Correction
- Noise Removal
- Thinning and Skeletonization
You can find detailed information on each of these methods in this article. Here we will focus on working with dialogue text from video games.
Quick Setup
In my last post, I talked about how to snip screenshots from videos and run OCR on the browser with tesseract.js. We can reuse our code for this demonstration.
Using HTML Canvas to Snip Screenshots of Your Video
Mathew Chan γ» Nov 11 '20
To get started you can download the html file and open it on your browser. It would prompt you to select a window for sharing. After that, click and drag over your video to snip an image for OCR.
Binarization
To binarize an image means to convert the pixels of an image to either black or white. To determine whether the pixel is black or white, we define a threshold value. Pixels that are greater than the threshold value are black, otherwise they are white.
Applying a threshold filter removes a lot of unwanted information from the image.
Let's add two functions: preprocessImage and thresholdFilter. These functions will take pixel information as parameters, which can be obtained from the canvas context with ctx.getImageData().data. For every pixel we calculate its grayscale value from its [r,g,b] values and compare it to our threshold level to set it to either black or white.
function preprocessImage(canvas) {
const processedImageData = canvas.getContext('2d').getImageData(0,0,canvas.width, canvas.height);
thresholdFilter(processedImageData.data, level=0.5);
return processedImageData;
}
// from https://github.com/processing/p5.js/blob/main/src/image/filters.js
function thresholdFilter(pixels, level) {
if (level === undefined) {
level = 0.5;
}
const thresh = Math.floor(level * 255);
for (let i = 0; i < pixels.length; i += 4) {
const r = pixels[i];
const g = pixels[i + 1];
const b = pixels[i + 2];
const gray = 0.2126 * r + 0.7152 * g + 0.0722 * b;
let val;
if (gray >= thresh) {
val = 255;
} else {
val = 0;
}
pixels[i] = pixels[i + 1] = pixels[i + 2] = val;
}
}
Then call our new function in the VideoToCroppedImage function after we are done snipping the image with drawImage. We can apply the processed image to the canvas with putImageData.
function VideoToCroppedImage({width, height, x, y}) {
..
ctx2.drawImage(videoElement, x*aspectRatioX, y*aspectRatioY, width*aspectRatioX, height*aspectRatioY, 0, 0, cv2.width, cv2.height);
ctx2.putImageData(preprocessImage(cv2), 0, 0);
const dataURI = cv2.toDataURL('image/jpeg');
recognize_image(dataURI);
}
Here's how it looks like before and after the threshold filter.
OCR Results:
The filter removed the gray patterns behind the text. Now our OCR result has one fewer error!
Here's a more challenging image.
OCR Results:
As you can see, the background strokes are creating noise. Simply applying the threshold filter would worsen the OCR result.
Let's find out how to remove noise.
Noise Removal
We can remove patches of high intensity in an image by blurring it. Box blur and Gaussian blur are one of the many blurring methods.
Insert two helper functions getARGB and setPixels.
function getARGB (data, i) {
const offset = i * 4;
return (
((data[offset + 3] << 24) & 0xff000000) |
((data[offset] << 16) & 0x00ff0000) |
((data[offset + 1] << 8) & 0x0000ff00) |
(data[offset + 2] & 0x000000ff)
);
};
function setPixels (pixels, data) {
let offset = 0;
for (let i = 0, al = pixels.length; i < al; i++) {
offset = i * 4;
pixels[offset + 0] = (data[i] & 0x00ff0000) >>> 16;
pixels[offset + 1] = (data[i] & 0x0000ff00) >>> 8;
pixels[offset + 2] = data[i] & 0x000000ff;
pixels[offset + 3] = (data[i] & 0xff000000) >>> 24;
}
};
For the Gaussian blur, add two functions buildBlurKernel and blurARGB.
// internal kernel stuff for the gaussian blur filter
let blurRadius;
let blurKernelSize;
let blurKernel;
let blurMult;
// from https://github.com/processing/p5.js/blob/main/src/image/filters.js
function buildBlurKernel(r) {
let radius = (r * 3.5) | 0;
radius = radius < 1 ? 1 : radius < 248 ? radius : 248;
if (blurRadius !== radius) {
blurRadius = radius;
blurKernelSize = (1 + blurRadius) << 1;
blurKernel = new Int32Array(blurKernelSize);
blurMult = new Array(blurKernelSize);
for (let l = 0; l < blurKernelSize; l++) {
blurMult[l] = new Int32Array(256);
}
let bk, bki;
let bm, bmi;
for (let i = 1, radiusi = radius - 1; i < radius; i++) {
blurKernel[radius + i] = blurKernel[radiusi] = bki = radiusi * radiusi;
bm = blurMult[radius + i];
bmi = blurMult[radiusi--];
for (let j = 0; j < 256; j++) {
bm[j] = bmi[j] = bki * j;
}
}
bk = blurKernel[radius] = radius * radius;
bm = blurMult[radius];
for (let k = 0; k < 256; k++) {
bm[k] = bk * k;
}
}
}
// from https://github.com/processing/p5.js/blob/main/src/image/filters.js
function blurARGB(pixels, canvas, radius) {
const width = canvas.width;
const height = canvas.height;
const numPackedPixels = width * height;
const argb = new Int32Array(numPackedPixels);
for (let j = 0; j < numPackedPixels; j++) {
argb[j] = getARGB(pixels, j);
}
let sum, cr, cg, cb, ca;
let read, ri, ym, ymi, bk0;
const a2 = new Int32Array(numPackedPixels);
const r2 = new Int32Array(numPackedPixels);
const g2 = new Int32Array(numPackedPixels);
const b2 = new Int32Array(numPackedPixels);
let yi = 0;
buildBlurKernel(radius);
let x, y, i;
let bm;
for (y = 0; y < height; y++) {
for (x = 0; x < width; x++) {
cb = cg = cr = ca = sum = 0;
read = x - blurRadius;
if (read < 0) {
bk0 = -read;
read = 0;
} else {
if (read >= width) {
break;
}
bk0 = 0;
}
for (i = bk0; i < blurKernelSize; i++) {
if (read >= width) {
break;
}
const c = argb[read + yi];
bm = blurMult[i];
ca += bm[(c & -16777216) >>> 24];
cr += bm[(c & 16711680) >> 16];
cg += bm[(c & 65280) >> 8];
cb += bm[c & 255];
sum += blurKernel[i];
read++;
}
ri = yi + x;
a2[ri] = ca / sum;
r2[ri] = cr / sum;
g2[ri] = cg / sum;
b2[ri] = cb / sum;
}
yi += width;
}
yi = 0;
ym = -blurRadius;
ymi = ym * width;
for (y = 0; y < height; y++) {
for (x = 0; x < width; x++) {
cb = cg = cr = ca = sum = 0;
if (ym < 0) {
bk0 = ri = -ym;
read = x;
} else {
if (ym >= height) {
break;
}
bk0 = 0;
ri = ym;
read = x + ymi;
}
for (i = bk0; i < blurKernelSize; i++) {
if (ri >= height) {
break;
}
bm = blurMult[i];
ca += bm[a2[read]];
cr += bm[r2[read]];
cg += bm[g2[read]];
cb += bm[b2[read]];
sum += blurKernel[i];
ri++;
read += width;
}
argb[x + yi] =
((ca / sum) << 24) |
((cr / sum) << 16) |
((cg / sum) << 8) |
(cb / sum);
}
yi += width;
ymi += width;
ym++;
}
setPixels(pixels, argb);
}
For this example, we also need two more functions:
- invertColors: inverts the colors of the pixels.
- dilate: increases light areas of the image.
function invertColors(pixels) {
for (var i = 0; i < pixels.length; i+= 4) {
pixels[i] = pixels[i] ^ 255; // Invert Red
pixels[i+1] = pixels[i+1] ^ 255; // Invert Green
pixels[i+2] = pixels[i+2] ^ 255; // Invert Blue
}
}
// from https://github.com/processing/p5.js/blob/main/src/image/filters.js
function dilate(pixels, canvas) {
let currIdx = 0;
const maxIdx = pixels.length ? pixels.length / 4 : 0;
const out = new Int32Array(maxIdx);
let currRowIdx, maxRowIdx, colOrig, colOut, currLum;
let idxRight, idxLeft, idxUp, idxDown;
let colRight, colLeft, colUp, colDown;
let lumRight, lumLeft, lumUp, lumDown;
while (currIdx < maxIdx) {
currRowIdx = currIdx;
maxRowIdx = currIdx + canvas.width;
while (currIdx < maxRowIdx) {
colOrig = colOut = getARGB(pixels, currIdx);
idxLeft = currIdx - 1;
idxRight = currIdx + 1;
idxUp = currIdx - canvas.width;
idxDown = currIdx + canvas.width;
if (idxLeft < currRowIdx) {
idxLeft = currIdx;
}
if (idxRight >= maxRowIdx) {
idxRight = currIdx;
}
if (idxUp < 0) {
idxUp = 0;
}
if (idxDown >= maxIdx) {
idxDown = currIdx;
}
colUp = getARGB(pixels, idxUp);
colLeft = getARGB(pixels, idxLeft);
colDown = getARGB(pixels, idxDown);
colRight = getARGB(pixels, idxRight);
//compute luminance
currLum =
77 * ((colOrig >> 16) & 0xff) +
151 * ((colOrig >> 8) & 0xff) +
28 * (colOrig & 0xff);
lumLeft =
77 * ((colLeft >> 16) & 0xff) +
151 * ((colLeft >> 8) & 0xff) +
28 * (colLeft & 0xff);
lumRight =
77 * ((colRight >> 16) & 0xff) +
151 * ((colRight >> 8) & 0xff) +
28 * (colRight & 0xff);
lumUp =
77 * ((colUp >> 16) & 0xff) +
151 * ((colUp >> 8) & 0xff) +
28 * (colUp & 0xff);
lumDown =
77 * ((colDown >> 16) & 0xff) +
151 * ((colDown >> 8) & 0xff) +
28 * (colDown & 0xff);
if (lumLeft > currLum) {
colOut = colLeft;
currLum = lumLeft;
}
if (lumRight > currLum) {
colOut = colRight;
currLum = lumRight;
}
if (lumUp > currLum) {
colOut = colUp;
currLum = lumUp;
}
if (lumDown > currLum) {
colOut = colDown;
currLum = lumDown;
}
out[currIdx++] = colOut;
}
}
setPixels(pixels, out);
};
Finally call these newly created filters in the preprocessing function. The order of these filters is significant as you will see later.
function preprocessImage(canvas) {
const processedImageData = canvas.getContext('2d').getImageData(0,0,canvas.width, canvas.height);
blurARGB(processedImageData.data, canvas, radius=1);
dilate(processedImageData.data, canvas);
invertColors(processedImageData.data);
thresholdFilter(processedImageData.data, level=0.4);
return processedImageData;
}
Here's what the image looks like after every filter is applied.
OCR Results:
After a series of filters, our image resembles a lot more like printed text and the result is nearly perfect!
Let's go through what each filter does to the image.
- Gaussian Blur: Smoothen the image to remove random areas of high intensity.
- Dilation: Brighten the white text.
- Color Inversion: Make the bright text dark but the dark background light.
- Threshold Filter: Turn light pixels including the background into white, but turn the dark text black.
Note: There is no need to reinvent the wheel by writing your own filter algorithms. I borrowed these algorithms from p5.js repository and this article so I can use the functions that I need without having to import an entire image processing library like OpenCV.
Wrapping it up
When it comes to OCR, data quality and data cleansing can be even more important to the end result than data training.
There are many more methods to preprocess data and you will have to make the decisions on what to use. Alternatively to expand on this project, you can employ adaptive processing or set rules such as inverting color when text is white or applying threshold filters only when the background is light.
Let me know if you found this post helpful. :)
Top comments (7)
Thank you, your tutorial did improve my tesseract.js OCR operations!
Recently I started experimenting on scanning smaller images, like 300px x 100px, and this logic isn't being helpful anymore. Per the generated images I think the filters loose efficacy if the sample is too small. But I'm not able to really learn the inner workings of these filters one by one. Do you know what type of changes I should be trying first in order to achieve better results on small images?
i want to use this code in an react app can you give the whole code as one function which takes in an image and return one
Thanks for the nice tutorial. Can you provide a python project that outputs same result?
Sure, I'll let you know when I write it up. It's going to cover OpenCV and I was also thinking on touching on OpenCV's Adaptive thresholding. Lately I feel like open-cv-python is relatively small compared to the other libraries one might need for a python project anyway.
Hi, I am working on a similar project is this code available on github repo?
You should be able to implement this by starting with the base code shown in the last post.
The final product is here, but there are a lot of other code.
Can this filters battery improve all kind of images or there are some that require another filters?
In the second case, how do I know automatically what filters to apply?