DEV Community

Priyansh Jain
Priyansh Jain

Posted on • Updated on

How I developed a captcha cracker for my University's website

Hello again!

Consider this a spinoff of my original article. I had some requests from the readers to explain how I developed the parser, and hence I decided to share the story of my first (significant?) project with you guys.
Repository Link

Let's start!

When I developed these set of scripts, I had zero knowledge of Image Processing or the algorithms used in it. It was in my fresher year that I worked on this.

The basic ideas I had in mind when I started:

  • An image is basically a matrix, with pixels as individual cells.
  • A color image has a tuple (Red, Green, Blue) values for every pixel, and a grayscale image has a single value and each pixel value ranges from (0, 255) in a general image

So the student login portal in my college looks like this:
captcha

To begin with, I had made some very useful observations about the image.
captchaimage

  • The number of characters in the captcha is always 6, and it is a grayscale image.
  • The spacing between the characters looked very constant
  • Each character is completely defined.
  • The image has many stray dark pixels, and lines passing through the image.

So I ended up downloading one such image and using this tool visualized the image in binary(0 for black and 1 for white pixel).
captchabinary

My observation was right - the image is 45x180 in dimension, and each character is allotted a space of 30 pixels to fit, thus making them evenly spaced.
Thus I got my step 1, that was

  • Crop any image you get into 6 different parts, each having a width of 30 pixels.

I chose Python as my prototyping language, as its libraries are easiest to use and implement.
On some simple searching, found the PIL library. I decided to use the Image module, as my operation was limited to only cropping and loading the image as a matrix.
So, according to the documentation, the syntax for cropping an image is

from PIL import Image
image = Image.open("filename.xyz")
cropped_image = image.crop((left, upper, right, lower))
Enter fullscreen mode Exit fullscreen mode

In my case, if you want to crop just the first character,

from PIL import Image
image = Image.open("captcha.png").convert("L") # Grayscale conversion
cropped_image = image.crop((0, 0, 30, 45))
cropped_image.save("cropped_image.png")
Enter fullscreen mode Exit fullscreen mode

The image that got saved:
cropped_image

I wrapped this in a loop, wrote a simple script that fetches 500 captcha images from the site, and saved all the cropped characters into a folder.

Going to the third observation - Every character is well defined.
In order to "clean" a cropped character from the image (remove the unnecessary lines and dots), I used the following method.

  • All the pixels in a character are pure black(0). I used a simple logic - If it's not completely black, it's white. Hence for every pixel that has a value greater than 0, reassign it to 255. The image is converted into a 45x180 matrix using the load() function, and then it is processed.
pixel_matrix = cropped_image.load()
for col in range(0, cropped_image.height):
    for row in range(0, cropped_image.width):
        if pixel_matrix[row, col] != 0:
            pixel_matrix[row, col] = 255
image.save("thresholded_image.png")
Enter fullscreen mode Exit fullscreen mode

For clarity sakes, I applied the code to the original image.
Original:
original
Modified:
thresholded-full
So you can see, that all the pixels that weren't completely dark have been removed. This includes the line that passed through the image.
It was only later after the project was completed that I learnt that the above method is called thresholding in Image Processing.

Moving on the fourth observation - There are many stray pixels in the image.
Looped through the image matrix, and if an adjacent pixel is white, and pixel opposite to the adjacent pixel is also white, and the central pixel is dark, make the central pixel white.

for column in range(1, image.height - 1):
    for row in range(1, image.width - 1):
        if pixel_matrix[row, column] == 0 \
            and pixel_matrix[row, column - 1] == 255 and pixel_matrix[row, column + 1] == 255 :
            pixel_matrix[row, column] = 255
        if pixel_matrix[row, column] == 0 \
            and pixel_matrix[row - 1, column] == 255 and pixel_matrix[row + 1, column] == 255:
            pixel_matrix[row, column] = 255
Enter fullscreen mode Exit fullscreen mode

Output:
nostray_image

So you see, the image has been reduced to the individual characters themselves! Even though it may look like some characters have lost their base pixels, they serve as very good skeletons for other images to compare with. After all, the main reason we're doing so many changes is to generate a proper image for every possible character.

I applied the above algorithm to all the cropped characters and stored them in a new folder. The next task was to name at least one sample for each character belonging to "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789". This step was like the "training" step, where I manually selected a character image for each one and renamed them.

With this step complete, I had a skeleton image for every character!
skeleton

I ran couple other scripts to get the best image among all images of a character - for example, if there were 20 'A' character images, the image with the least number of dark pixels was obviously the one with the least amount of noise and was hence best fit for the skeleton images. So there were two scripts:

  • One to group similar images sorted by character (constraints: no. of dark pixels, and similarity >= 90 - 95 %)
  • One to get the best images from every grouped character

Thus by now, the library images were generated. Converted them to pixel matrices, and stored the "bitmaps" as JSON file.

Finally, here is the algorithm that solves any new captcha image

  • Reduced the unnecessary noise in the new image using the same algorithms
  • For every character in the new captcha image, brute forced through the JSON bitmaps I generated. The similarity is calculated on the basis of corresponding dark pixels matching.
    • This means that, if a pixel is dark and at position (4, 8) in the image to crack the captcha for, and if the pixel is dark at the same position in our skeleton image/ bitmap, then the count is incremented by 1.
    • This count, when compared with the number of dark pixels in the skeleton image is used to calculate the percentage match. The percentage and the character for which the percentage is calculated, is pushed to a dictionary.
  • The character that has the highest percentage of match is selected.
    import json
    characters = "123456789abcdefghijklmnpqrstuvwxyz"
    captcha = ""
    with open("bitmaps.json", "r") as f:
        bitmap = json.load(f)

    for j in range(image.width/6, image.width + 1, image.width/6):
        character_image = image.crop((j - 30, 12, j, 44))
        character_matrix = character_image.load()
        matches = {}
        for char in characters:
            match = 0
            black = 0
            bitmap_matrix = bitmap[char]
            for y in range(0, 32):
                for x in range(0, 30):
                    if character_matrix[x, y] == bitmap_matrix[y][x] and bitmap_matrix[y][x] == 0:
                        match += 1
                    if bitmap_matrix[y][x] == 0:
                        black += 1
            perc = float(match) / float(black)
            matches.update({perc: char[0].upper()})
        try:
            captcha += matches[max(matches.keys())]
        except ValueError:
            print("failed captcha")
            captcha += "0"
    print captcha
Enter fullscreen mode Exit fullscreen mode

And the final result we get is:
Final

which is Z5M3MQ - The captcha has been solved successfully!

So that's pretty much how I did it. It was a great learning experience, and I developed a chrome extension that uses the algorithm, and it has 1800+ users!

Would love to see your opinions and suggestions!
The above code is hosted here

Top comments (32)

Collapse
 
peter profile image
Peter Kim Frank

This was a really fun and entertaining read! I really appreciated the step-by-step breakdown for how you thought about the problem, identified some key insights, and then implemented a solution. Thank you for sharing :)

Collapse
 
presto412 profile image
Priyansh Jain

Thank you!

Collapse
 
lauriy profile image
Lauri Elias

Good to see you didn't run to TensorFlow right away. Sometimes you just don't need a billion samples and a 16-layer neural net.

Collapse
 
rafalpienkowski profile image
Rafal Pienkowski

Very nice article.
When I was studying I really liked my classes about signal processing. When I was reading your article I felt as I went back in time. 😁
I'm not sure if you have heard about salt and pepper? It's related to additional dots on an image. There are several techniques for removing such a noise like median filter. Maybe it would give even betterer results.

Thank you for your post.

Collapse
 
presto412 profile image
Priyansh Jain

Thanks!
Yes, I know about the median filter. Very recently had a semester long course in image processing. I had implemented region growing with this, to get great noise free skeletons. Will try with the median filter!

Collapse
 
presto412 profile image
Priyansh Jain • Edited

Median
Got this result, with 2px radius.

Collapse
 
rafalpienkowski profile image
Rafal Pienkowski

I thought I would be better 🤔 Thanks for your time I appreciate it.

Collapse
 
iambalajirk profile image
balaji radhakrishnan

Image processing is a beautiful subject. With no knowledge about it I did my final year project on it. The fun part is that the algo you build will already exist and gives you the good feeling that we are doing something awesome. Good luck :)

Collapse
 
presto412 profile image
Priyansh Jain

So true.

Collapse
 
briedis profile image
Mārtiņš Briedis

After the clean-up step, did you try a simple OCR approach?

Collapse
 
chabala profile image
Greg Chabala

I appreciate the detail in the article, but that captcha looks so trivial I bet off the shelf OCR libraries could handle it without any preprocessing.

Collapse
 
presto412 profile image
Priyansh Jain

Sorry for the late reply, but yeah they most definitely would. I just wanted to do something real new cause I was new to programming and this was actually something that wasn't taught in class. Felt nice.

Collapse
 
markjohnson303 profile image
Mark Johnson 👔

Thanks for writing this up! I enjoyed reading about how you solved the problem, and your explanation was super clear and helpful :) I'm amazed how easy this was to accomplish... I would have expecting clearing the lines would have been more difficult. What do you think you would have done if they were pure black like the letters?

Collapse
 
presto412 profile image
Priyansh Jain • Edited

Actually its cases like these where getting the best skeleton out of all the skeletons comes in handy

Collapse
 
presto412 profile image
Priyansh Jain

I would have eliminated single pixel thickness lines, by checking the top and bottom pixels I guess

Collapse
 
hussaintamboli profile image
Hussain Tamboli

Nicely explained! I was also trying to do similar thing some time ago using tesseract-ocr. It was not that accurate though :)

Collapse
 
muqadirhussain profile image
Collapse
 
seankilleen profile image
Sean Killeen

Hey! I've noticed that in this post you use "guys" as a reference to the entire community, which is not made up of only guys but a variety of community members.

I'm running an experiment and hope you'll participate. Would you consider changing "guys" to a more inclusive term? If you're open to that, please let me know when you've changed it and I'll delete this comment.

For more information and some alternate suggestions, see dev.to/seankilleen/a-quick-experim....

Thanks for considering!

Collapse
 
falansari profile image
Ash • Edited

This is why you use Google's reCAPTCHA :) the old traditional captcha is worthless nowadays, might as well not have it, it won't make much of a difference.

Collapse
 
5bentz profile image
5bentz

You may be surprised at the effectiveness of such simple CAPTCHA's.
On a website of mine, a CAPTCHA has neutralized spam messages. The CAPTCHA is more elaborate than this one, but still...Bad CAPTCHA's are better than nothing ;)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.