DEV Community

Cover image for PDF Extraction: Retrieving Text and Tables together using Python🐍
Rishab Dugar
Rishab Dugar

Posted on

PDF Extraction: Retrieving Text and Tables together using Python🐍

Extracting both text and tables can be challenging when working with PDF files due to their complex structure. However, the “pdfplumber” library offers a powerful solution. This article explores an effective method for combining text and table extraction from PDFs using pdfplumber. Special thanks to Karl Genockey a.k.a. cmdlineuser and other contributors for their brilliant approach discussed here.

Understanding the Approach

The method involves extracting table objects and text lines separately and then combining them based on their positional values. This ensures that the extracted data maintains the correct order and structure as it appears in the PDF. Let’s break down the code and logic step-by-step.

As an example, we will use the sample_pdf below, containing tables and text in multiple pages.

Preview of a page from the sample pdf containing text and tables

Prerequisites

Before running the code, we should ensure that the necessary libraries are installed. Besides pdfplumber and pandas, we also need the tabulate library. This library is used by pandas To convert DataFrame objects to Markdown format, which is crucial for our table extraction process. This conversion helps in maintaining the structure and readability of table data extracted from the PDF.

Installing Required Libraries

You can install these libraries using pip. Run the following commands in your

pip install pdfplumber pandas tabulate
Enter fullscreen mode Exit fullscreen mode

Step-by-Step Explanation

  1. Import Libraries: First things first, we start by importing all necessary libraries.
  • pdfplumber is used for extracting text and tables from PDFs.
  • pandas is used for handling and manipulating data.
  • extract_text, get_bbox_overlap, and obj_to_bbox are utility functions from pdfplumber.
  • tabulate helps in converting data into Markdown format.
import pdfplumber
import pandas as pd
from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox
import tabulate
Enter fullscreen mode Exit fullscreen mode
  1. Function Definition and PDF Opening:
  • The function process_pdf takes pdf_path as an argument, which is the path to the PDF file.
  • pdfplumber.open(pdf_path) opens the PDF file.
  • all_text is initialized as an empty list to store the extracted text from all pages.
    def process_pdf(pdf_path):
      pdf = pdfplumber.open(pdf_path)
      all_text = []
Enter fullscreen mode Exit fullscreen mode
  1. Iterate Over Pages:
  • for page in pdf.pages — The for loop iterates over each page in the PDF.
  • filtered_page — is initially set to the current page.
  • chars — captures all characters on the filtered_page.
      for page in pdf.pages:
        filtered_page = page
        chars = filtered_page.chars
Enter fullscreen mode Exit fullscreen mode
  1. Table Detection and Filtering:
  • for table in page.find_tables() — The for loop iterates over each table found on the page.
  • first_table_char — stores the first character of the cropped table area.
  • filtered_page — is updated by filtering out characters that overlap with the table's bounding box using get_bbox_overlap and obj_to_bbox.
        for table in page.find_tables():
            first_table_char = page.crop(table.bbox).chars[0]
            filtered_page = filtered_page.filter(lambda obj: 
                get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
            )
            chars = filtered_page.chars
Enter fullscreen mode Exit fullscreen mode
  1. Extract and Convert Table to Markdown:
  • table.extract() extracts the table content.
  • A DataFrame df is created from the extracted table data.
  • The first row is set as the header using df.columns = df.iloc[0].
  • The rest of the DataFrame is converted to Markdown format and stored in markdown.
            df = pd.DataFrame(table.extract())
            df.columns = df.iloc[0]
            markdown = df.drop(0).to_markdown(index=False)
Enter fullscreen mode Exit fullscreen mode
  1. Append Markdown to Characters:
  • The first_table_char is updated with the markdown content and appended to chars.
chars.append(first_table_char | {"text": markdown})
Enter fullscreen mode Exit fullscreen mode
  1. Extract Page Text:
  • extract_text(chars, layout=True) extracts the text from the filtered characters with layout preservation.
  • The extracted text page_text is appended to all_text.
        page_text = extract_text(chars, layout=True)
        all_text.append(page_text)
Enter fullscreen mode Exit fullscreen mode
  1. Close PDF and Return Text:
  • The PDF file is closed using pdf.close().
  • The extracted text from all pages is joined into a single string with newline characters and returned.
    pdf.close()
    return "\n".join(all_text)
Enter fullscreen mode Exit fullscreen mode
  1. Execute Function and Print Result:
  • The path to the PDF file is defined in pdf_path.
  • process_pdf(pdf_path) is called to process the PDF and extract text.
  • The extracted text is printed.
# Path to your PDF file
pdf_path = r"sample_pdf.pdf"
extracted_text = process_pdf(pdf_path)
print(extracted_text)
Enter fullscreen mode Exit fullscreen mode

Complete Code

Here is the complete script for extracting text and tables as markdown from a PDF:

import pdfplumber
import pandas as pd
from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox
def process_pdf(pdf_path):
    pdf = pdfplumber.open(pdf_path)
    all_text = []
    for page in pdf.pages:
        filtered_page = page
        chars = filtered_page.chars
        for table in page.find_tables():
            first_table_char = page.crop(table.bbox).chars[0]
            filtered_page = filtered_page.filter(lambda obj: 
                get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
            )
            chars = filtered_page.chars
            df = pd.DataFrame(table.extract())
            df.columns = df.iloc[0]
            markdown = df.drop(0).to_markdown(index=False)
            chars.append(first_table_char | {"text": markdown})
        page_text = extract_text(chars, layout=True)
        all_text.append(page_text)
    pdf.close()
    return "\n".join(all_text)
# Path to your PDF file
pdf_path = r"sample_pdf.pdf"
extracted_text = process_pdf(pdf_path)
print(extracted_text)
Enter fullscreen mode Exit fullscreen mode

Output :

Hello
World
| First name   | Last name   |   Age | City        |
|:-------------|:------------|------:|:------------|
| Nobita       | Nobi        |    15 | Tokyo       |
| Eli          | Shane       |    23 | Orlando     |
| Rahul        | Jain        |    22 | Los Angeles |
| Lucy         | Carlyle     |    17 | London      |
| Anthony      | Lockwood    |    19 | Leicester   |
Loreum  ipsum
dolor sit amet,
consectetur
adipiscing
Hello
Python
| First name   | Last name   | Address             |
|:-------------|:------------|:--------------------|
| James        | Watson      | 221 B, Baker Street |
| Mycroft      | Holmes      | Diogenes Club       |
| Irene        | Adler       | 21 New Jersey       |
| Lucy         | Carlyle     | 33 Claremont Square |
| Anthony      | Lockwood    | 35 Portland Row     |
Neque  porro
quisquam  est qui
            dolorem
      ipsum     quia
      dolor sit amet,
consectetur, adipisci
velit..."
Enter fullscreen mode Exit fullscreen mode

Conclusion

This approach provides a systematic way to extract and combine text and tables from PDFs using “pdfplumber”. By leveraging table and text line positional values, we can maintain the integrity of the original document’s layout. Credits to cmdlineuser and jsvine for their insightful discussion and innovative solution to the problem!

That’s all for now! Hope this tutorial was helpful. Feel free to explore and adapt this method to fit your specific needs.

Top comments (1)

Collapse
 
jeff_stone_748e3c820c0b8c profile image
Jeff Stone

Hi Rishab,
Nice post that addresses a vexing problem. I tried your code on a complex .PDF that I have but got the following error:
File c:\users\js.spyder-py3\temp.py:12 in process_pdf

first_table_char = page.crop(table.bbox).chars[0]

File C:\ProgramData\anaconda3\Lib\site-packages\pdfplumber\page.py:535 in crop

return CroppedPage(self, bbox, relative=relative, strict=strict)

File C:\ProgramData\anaconda3\Lib\site-packages\pdfplumber\page.py:677 in init

test_proposed_bbox(crop_bbox, parent_page.bbox)

File C:\ProgramData\anaconda3\Lib\site-packages\pdfplumber\page.py:656 in test_proposed_bbox

raise ValueError(

ValueError: Bounding box (19.448275862068964, 154.38000000000005, 1183.5160975609742, 553.6492307692307) is not fully within parent page bounding box (0, 0, 792, 612)

Do you have any idea how to adjust for this situation?

Thanks,

Jeff