Seraph★776

Posted on Jul 21, 2022 • Edited on Jul 24, 2022

Extract Text from PDF Using Python

#python #tutorial #programming #productivity

Introduction

This article will discuss how to extract text from a PDF using Python. To complete this task we'll use the PyPDF2 module. PyPDF2 is a free and open-source python library capable of many tasks such as splitting, merging, cropping, adding custom data, encrypting, and retrieving text from PDFs.

The PDF Sample File

The PDF sample file that will be used to extract text from will be The Raven by Edgar Allan Poe.

Directory Structure

This is the directory structure prior to executing script.py

Python Project/
├── app/
│   ├── script.py
│   ├── the_raven.pdf
│

Implementation

Open PDF and Extract Text
Save Text to File.

Open PDF and Extract Text

def extract_text_from_pdf(pdf_filename: str) -> str:
    text_output = ''
    with open(pdf_filename, 'rb') as pdf_object:
        pdf_reader = PyPDF2.PdfFileReader(pdf_object)
        for i in range(0, pdf_reader.numPages):
            page_obj = pdf_reader.getPage(i)
            text_output += page_obj.extractText()
    return text_output

The convert_pdf_to_text() function takes one parameter, pdf_filename, which is the filename of the PDF from which the text will be extracted.
pdf_filename is opened in rb mode (which opens the file in a binary format for reading) as pdf_object, which is then passed to the PyPDF2 object named pdf_reader.
We then iterate over all pages of the PyPDF2 object using the range() function, and the numPages attribute to define the upper bound of the range function.
We then create a page_obj instance for each page, and extract the text from each page_object using the extractText() method.
Finally, we concatenate the results to our text_output string, and return the results.

Save Text to File.

def save_converted_text(text_file: str, filename: str) -> None:
    with open(filename, 'w+', encoding='utf8') as file_obj:
        file_obj.write(text_file)
    print(f'{text_file} has been successfully saved.')

save_converted_text() function takes two parameters, text_file which is the extracted text from the PDF, and filename which is the name you will save your file as. The file name is opened in w+ mode (write + read) using 'utf8' encoding as file_obj.
The contents of text_file are then written to file_obj. A message is printed if the operation executes successfully.

What is Encoding?

Many times applications often use internationalized messages to display output in a variety of user-selected languages such as English, French, Japanese, Hebrew, or Russian. Web content can be written in any of these languages and can also include a variety of emoji symbols. Python’s string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. If encoding is not specified, UTF-8 will be used by default. Read the Official Python Documentation to learn more about encoding

Full Code

import PyPDF2


# STEP 1: open PDF and convert to text
def extract_text_from_pdf(pdf_filename: str) -> str:
    text_output = ''
    with open(pdf_filename, 'rb') as pdf_object:
        pdf_reader = PyPDF2.PdfFileReader(pdf_object)
        for i in range(0, pdf_reader.numPages):
            page_obj = pdf_reader.getPage(i)
            text_output += page_obj.extractText()
    return text_output


# STEP 2: Save Text to File
def save_converted_text(text_file: str, filename: str) -> None:
    with open(filename, 'w+', encoding='utf8') as file_obj:
        file_obj.write(text_file)
    print(f'{text_file} has been successfully saved.')


if __name__ == '__main__':
    # extract text from PDF
    text_from_pdf = extract_text_from_pdf('the_raven.pdf')
    # save extracted text
    save_converted_text(text_from_pdf, 'the_raven.txt')

Directory Structure

This is the directory structure after executing script.py

Python Project/
├── app/
│   ├── script.py
│   ├── the_raven.pdf
│   ├── the_raven.txt
│

Conclusion

After reading this article you should now be able to extract text from a PDF using Python's PyPDF2 library. Remember, if you extract text and you encounter unrecognizable text make sure you are using the correct string encoding. If you found this article helpful, please like, follow, and leave a comment!

🔗 Resource Links

GitHub Source Code

Top comments (1)

Kayla M • Dec 21 '22

Great tutorial! is there a particular reason why you use PyPDF2 over PyMuPDF or pdfminer?

DEV Community

Extract Text from PDF Using Python

Introduction

The PDF Sample File

Directory Structure

Implementation

Open PDF and Extract Text

Save Text to File.

What is Encoding?

Full Code

Directory Structure

Conclusion

🔗 Resource Links

Top comments (1)

Read next

React + AWS Cognito: Email Authentication Setup Guide (First Part)

Python beats Javascript, Next.js Leap & the AI Coding Wars

Exploring Test Automation in Embedded Systems Testing

The No-Fluff Guide to OpenGraph Images That Actually Work 🎯