Introduction
PyMuPDF is a versatile Python library that empowers developers to work with PDF documents effortlessly. From extracting text and images to performing complex manipulations, PyMuPDF offers a rich set of features for handling PDF files programmatically. In this chapter, we explore the capabilities of PyMuPDF and demonstrate its usage through practical examples.
Topics
- Installation and setup of PyMuPDF
- Text extraction from PDF documents
- Image extraction from PDF documents
- PDF manipulation and modification
Installation and Setup of PyMuPDF
To begin harnessing the capabilities of PyMuPDF, you first need to install the library. You can install PyMuPDF via pip:
pip install PyMuPDF
Once installed, you can import the library into your Python scripts:
import fitz
Text Extraction from PDF Documents
PyMuPDF allows you to extract text from PDF documents with ease. Here's a simple example:
PDF file:
Test document PDF
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.
import fitz
def extract_text_from_pdf(filename: str) -> str:
doc = fitz.open(filename=filename)
text = ""
for page in doc:
text += page.get_text()
return text
extracted_text = extract_text_from_pdf(filename ="example.pdf")
print(extracted_text)
Output:
Test document PDF
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.
Image Extraction from PDF Documents
In addition to text, PyMuPDF enables you to extract images from PDF documents:
import fitz
def extract_images_from_pdf(filename: str) -> list:
doc = fitz.open(filename= filename)
images = []
for page in doc:
for img in page.get_images():
xref = img[0]
base_image = doc.extract_image(xref=xref)
image_bytes = base_image["image"]
images.append(image_bytes)
return images
extracted_images = extract_images_from_pdf(filename ="example.pdf")
print("Number of images extracted:", len(extracted_images))
Output:
Number of images extracted: 1
PDF Manipulation and Modification
PyMuPDF facilitates various manipulations and modifications of PDF documents, such as adding annotations, merging documents, and more:
import fitz
def add_annotation_to_pdf(in_filename: str, annotation: str, out_filename: str) -> None:
doc = fitz.open(filename=in_filename)
page = doc[0] # Add annotation to the first page
annot = page.add_text_annot(point=(100, 100), text=annotation)
annot.set_colors(colors=(1, 0, 0)) # Set annotation color to red
doc.save(filename=out_filename)
in_filename = "example.pdf"
annotation = "This is an annotation added using PyMuPDF."
out_filename = "example2.pdf"
add_annotation_to_pdf(in_filename=in_filename, annotation=annotation, out_filename=out_filename)
This code adds the note "This is an annotation added using PyMuPDF." to the output PDF.
Conclusion
PyMuPDF emerges as a powerful ally for Python developers tasked with working with PDF documents. Whether it's extracting text and images, performing manipulations, or modifying PDF files, PyMuPDF offers a comprehensive toolkit for tackling diverse PDF-related tasks programmatically.
Top comments (0)