Introduction
This article will discuss how to extract text
from a PDF
using Python. To complete this task we'll use the PyPDF2 module. PyPDF2
is a free and open-source python library capable of many tasks such as splitting, merging, cropping, adding custom data, encrypting, and retrieving text from PDFs.
The PDF Sample File
The PDF sample file that will be used to extract text from will be The Raven by Edgar Allan Poe.
Directory Structure
This is the directory structure prior to executing script.py
Python Project/
├── app/
│ ├── script.py
│ ├── the_raven.pdf
│
Implementation
- Open PDF and Extract Text
- Save Text to File.
Open PDF and Extract Text
def extract_text_from_pdf(pdf_filename: str) -> str:
text_output = ''
with open(pdf_filename, 'rb') as pdf_object:
pdf_reader = PyPDF2.PdfFileReader(pdf_object)
for i in range(0, pdf_reader.numPages):
page_obj = pdf_reader.getPage(i)
text_output += page_obj.extractText()
return text_output
- The
convert_pdf_to_text()
function takes one parameter,pdf_filename
, which is the filename of the PDF from which the text will be extracted. -
pdf_filename
is opened inrb
mode (which opens the file in a binary format for reading) aspdf_object
, which is then passed to thePyPDF2
object namedpdf_reader
. - We then iterate over all pages of the
PyPDF2
object using therange()
function, and thenumPages
attribute to define the upper bound of the range function. - We then create a
page_obj
instance for each page, and extract the text from eachpage_object
using theextractText()
method. - Finally, we concatenate the results to our
text_output
string, and return the results.
Save Text to File.
def save_converted_text(text_file: str, filename: str) -> None:
with open(filename, 'w+', encoding='utf8') as file_obj:
file_obj.write(text_file)
print(f'{text_file} has been successfully saved.')
-
save_converted_text()
function takes two parameters,text_file
which is the extracted text from the PDF, andfilename
which is the name you will save your file as. Thefile name
is opened inw+
mode (write + read) using'utf8'
encoding asfile_obj
. - The contents of
text_file
are then written tofile_obj
. A message is printed if the operation executes successfully.
What is Encoding?
Many times applications often use internationalized messages to display output in a variety of user-selected languages such as English, French, Japanese, Hebrew, or Russian. Web content can be written in any of these languages and can also include a variety of emoji symbols. Python’s string type uses the Unicode Standard
for representing characters, which lets Python programs work with all these different possible characters. If encoding is not specified, UTF-8
will be used by default. Read the Official Python Documentation to learn more about encoding
Full Code
import PyPDF2
# STEP 1: open PDF and convert to text
def extract_text_from_pdf(pdf_filename: str) -> str:
text_output = ''
with open(pdf_filename, 'rb') as pdf_object:
pdf_reader = PyPDF2.PdfFileReader(pdf_object)
for i in range(0, pdf_reader.numPages):
page_obj = pdf_reader.getPage(i)
text_output += page_obj.extractText()
return text_output
# STEP 2: Save Text to File
def save_converted_text(text_file: str, filename: str) -> None:
with open(filename, 'w+', encoding='utf8') as file_obj:
file_obj.write(text_file)
print(f'{text_file} has been successfully saved.')
if __name__ == '__main__':
# extract text from PDF
text_from_pdf = extract_text_from_pdf('the_raven.pdf')
# save extracted text
save_converted_text(text_from_pdf, 'the_raven.txt')
Directory Structure
This is the directory structure after executing script.py
Python Project/
├── app/
│ ├── script.py
│ ├── the_raven.pdf
│ ├── the_raven.txt
│
Conclusion
After reading this article you should now be able to extract text from a PDF using Python's PyPDF2
library. Remember, if you extract text and you encounter unrecognizable text make sure you are using the correct string encoding. If you found this article helpful, please like
, follow
, and leave a comment
!
Top comments (1)
Great tutorial! is there a particular reason why you use PyPDF2 over PyMuPDF or pdfminer?