Neel

Posted on Jun 29, 2023 • Edited on Jul 25, 2023

Convert a PDF document into images using Python

#python #programming #tutorial #beginners

I was working on a Python project recently (for a DEV hackathon, of course) and I needed a library to extract the pages from a huge PDF document as images. The primary use case was to extract the content from the PDF document and doing this in a blocking way was expensive. So I was looking for options to split the document's pages into multiple images which would let me concurrently do the content extraction.

As any rad developer would do, I asked ChatGPT for a solution and got a pretty easy-to-integrate option to achieve my goal.

The solution is to use Python pdf2image package and write just 2 lines of code to get the pages from the PDF document extracted as images.

If you are not interested in the specifics of the process and get directly into the code, then you can jump straight to the pdf2image notebook

Setup

Even though the solution is pretty simple, you need to set up a utility called as poppler before you start writing your Python code. Poppler is the underlying C++ based PDF rendering library used by pdf2image for rendering the PDF document behind the scenes.

MacOS

Setting up poppler on a MacOS device is as simple as running brew install poppler (assuming you have Homebrew installed on your machine)

Linux

If you are on a Linux machine, which is very common if you are running your application as a container, then the process of installing poppler may differ based on the distro you are opting for

If its a Ubuntu based machine/container, then you can install poppler using apt-get -> sudo apt-get install poppler-utils

Windows

The process is slightly elaborate on Windows (just like everything else on Windows 😋). The latest releases for the Windows binary can be found here.

Extract the zip file and add the <extracted folder>/Library/bin to the PATH environment variable

Time for some code

Once the setup is done you can install the pdf2image package from pip and get on with your coding

pip install pdf2image

Convert from path

If the PDF file you want to convert is nicely located in a directory, then the job is much easier. You can directly import the convert_from_path function from the pdf2image package to extract the pages as images

from pdf2image import convert_from_path
import os

# Checks if the output folder exists and if not it creates a new one
if not os.path.exists("extracted_images"):
    os.mkdir("extracted_images")

convert_from_path("top_secret_document.pdf",
                  output_folder="extracted_images", fmt="jpeg")

The above code will pick the PDF document named top_secret_document.pdf from the current path and extracts every single page in the document as a JPEG image and stores it in the output folder named extracted_images.

The function expects the output folder to be already available and if it is not available, then an exception will be raised. To avoid this, you can conditionally check for the existence of the folder and create it, if not present

The good thing about this function is that it assigns a UUID to the output images and appends a page counter to the file name to make the file names unique. If you want to add a common prefix instead of going with a UUID, then you can make use of the output_file parameter accepted by the function

Convert from bytes

In the project I was working on, the use case goes like this.

The users are asked to upload the document and the uploaded file is submitted to a Golang based microservice. This service does some validations and other stuff behind the scenes. Once all the processing is done, it stores the document in a minio bucket and publishes an event to a Kafka topic

The python service acts as a consumer and once the event is received, it fetches the PDF document from the bucket and starts with the page extraction process. This means that the document will not be readily available in the path of the file system and it will be available as a byte stream.

This is not an esoteric topic to the pdf2image library and it readily comes with a convert_from_bytes function to tackle such use cases

# The get_object function fetches the PDF document stored in the object store
response = minio_client.get_object(bucket_name, "top_secret_document.pdf")

document_as_bytes = response.read()

images = convert_from_bytes(document_as_bytes, fmt="jpeg", output_folder="extracted_images")

The above code will read the PDF document as bytes and store the pages into image files in the output directory.

This function is not limited to PDF documents downloaded from external sources. You can use it to read a file from the file system as bytes and use that to extract the images

from pdf2image import convert_from_bytes

f = open('./top_secret_document.pdf', 'rb') #opening the file as a binary file

document_bytes = f.read()

images = convert_from_bytes(
    document_bytes, fmt="jpeg", output_folder="extracted_images")

Provided we already have the convert_from_path function to abstract all this boilerplate, I highly doubt if someone is ever gonna use the convert_from_bytes function for working with a local PDF file, still I just wanted to show that it's possible

Conclusion

That's all to it. The pdf2image library is not limited to light PDF documents and it can work with bulk documents too. Both the convert_from_path and convert_from_bytes variety of arguments that can be used to control the process of rendering the PDF document and extracting the images in different ways

I have attached some useful links in the reference section below which can be used to explore the pdf2image library in depth

The code used above can be found in the following Collab notebook pdf2image notebook

🐍!Happy coding!🐍

References

Official library docs

Poppler

DEV Community

Convert a PDF document into images using Python

Setup

Time for some code

Conclusion

References

Top comments (0)

Read next

The Rise of AI-Driven Web Development

How to Create Native .NET Applications for ARM Processors: A Practical Guide for Developers

Building SaaS Faster with Ercas for SaaS: A Template for Indie Hackers

Terraform vs AWS CDK: ¿Qué herramienta de infraestructura como código es mejor para tu proyecto?