I was working on a Python project recently (for a DEV hackathon, of course) and I needed a library to extract the pages from a huge PDF document as images. The primary use case was to extract the content from the PDF document and doing this in a blocking way was expensive. So I was looking for options to split the document's pages into multiple images which would let me concurrently do the content extraction.
As any rad developer would do, I asked ChatGPT for a solution and got a pretty easy-to-integrate option to achieve my goal.
The solution is to use Python pdf2image
package and write just 2 lines of code to get the pages from the PDF document extracted as images.
If you are not interested in the specifics of the process and get directly into the code, then you can jump straight to the pdf2image notebook
Setup
Even though the solution is pretty simple, you need to set up a utility called as poppler
before you start writing your Python code. Poppler is the underlying C++ based PDF rendering library used by pdf2image for rendering the PDF document behind the scenes.
MacOS
Setting up poppler on a MacOS device is as simple as running brew install poppler
(assuming you have Homebrew installed on your machine)
Linux
If you are on a Linux machine, which is very common if you are running your application as a container, then the process of installing poppler may differ based on the distro you are opting for
If its a Ubuntu based machine/container, then you can install poppler using apt-get
-> sudo apt-get install poppler-utils
Windows
The process is slightly elaborate on Windows (just like everything else on Windows 😋). The latest releases for the Windows binary can be found here.
Extract the zip file and add the <extracted folder>/Library/bin
to the PATH
environment variable
Time for some code
Once the setup is done you can install the pdf2image
package from pip and get on with your coding
pip install pdf2image
Convert from path
If the PDF file you want to convert is nicely located in a directory, then the job is much easier. You can directly import the convert_from_path
function from the pdf2image package to extract the pages as images
from pdf2image import convert_from_path
import os
# Checks if the output folder exists and if not it creates a new one
if not os.path.exists("extracted_images"):
os.mkdir("extracted_images")
convert_from_path("top_secret_document.pdf",
output_folder="extracted_images", fmt="jpeg")
The above code will pick the PDF document named top_secret_document.pdf
from the current path and extracts every single page in the document as a JPEG image and stores it in the output folder named extracted_images
.
The function expects the output folder to be already available and if it is not available, then an exception will be raised. To avoid this, you can conditionally check for the existence of the folder and create it, if not present
The good thing about this function is that it assigns a UUID to the output images and appends a page counter to the file name to make the file names unique. If you want to add a common prefix instead of going with a UUID, then you can make use of the output_file
parameter accepted by the function
Convert from bytes
In the project I was working on, the use case goes like this.
The users are asked to upload the document and the uploaded file is submitted to a Golang based microservice. This service does some validations and other stuff behind the scenes. Once all the processing is done, it stores the document in a minio bucket and publishes an event to a Kafka topic
The python service acts as a consumer and once the event is received, it fetches the PDF document from the bucket and starts with the page extraction process. This means that the document will not be readily available in the path of the file system and it will be available as a byte stream.
This is not an esoteric topic to the pdf2image
library and it readily comes with a convert_from_bytes
function to tackle such use cases
# The get_object function fetches the PDF document stored in the object store
response = minio_client.get_object(bucket_name, "top_secret_document.pdf")
document_as_bytes = response.read()
images = convert_from_bytes(document_as_bytes, fmt="jpeg", output_folder="extracted_images")
The above code will read the PDF document as bytes and store the pages into image files in the output directory.
This function is not limited to PDF documents downloaded from external sources. You can use it to read a file from the file system as bytes and use that to extract the images
from pdf2image import convert_from_bytes
f = open('./top_secret_document.pdf', 'rb') #opening the file as a binary file
document_bytes = f.read()
images = convert_from_bytes(
document_bytes, fmt="jpeg", output_folder="extracted_images")
Provided we already have the convert_from_path function to abstract all this boilerplate, I highly doubt if someone is ever gonna use the convert_from_bytes function for working with a local PDF file, still I just wanted to show that it's possible
Conclusion
That's all to it. The pdf2image
library is not limited to light PDF documents and it can work with bulk documents too. Both the convert_from_path
and convert_from_bytes
variety of arguments that can be used to control the process of rendering the PDF document and extracting the images in different ways
I have attached some useful links in the reference section below which can be used to explore the pdf2image
library in depth
The code used above can be found in the following Collab notebook pdf2image notebook
🐍!Happy coding!🐍
Top comments (0)