In this guide, I'll show you how to extract structured data from PDFs using vision-language models (VLMs) like Gemini Flash or GPT-4o.
Gemini, Google's latest series of vision-language models, has shown state of the art performance in text and image understanding. This improved multimodal capability and long context window makes it particularly useful for processing visually complex PDF data that traditional extraction models struggle with, such as figures, charts, tables, and diagrams.
By doing so, you can easily build your own data extraction tool for visual file and web extraction. Here's how:
Gemini's long context window and multimodal capability makes it particularly useful for processing visually complex PDF data where traditional extraction models struggle.
Setting Up Your Environment
Before we dive into extraction, let's set up our development environment. This guide assumes you have Python installed on your system. If not, download and install it from https://www.python.org/downloads/
⚠️ Note that, if you don't want to use Python, you can use the cloud platform at thepi.pe to upload your files and download your result as a CSV without writing any code.
Install Required Libraries
Open your terminal or command prompt and run the following commands:
pip install git+https://github.com/emcf/thepipe
pip install pandas
For those new to Python, pip is the package installer for Python, and these commands will download and install the necessary libraries.
Set Up Your API Key
To use thepipe, you need an API key.
Disclaimer: While thepi.pe is a free an open source tool, the API has a cost, roughly $0.00002 per token. If you want to avoid such costs, check out the local setup instructions on GitHub. Note that you will still have to pay your LLM provider of choice.
Here's how to get and set it up:
- Visit https://thepi.pe/platform/
- Create an account or log in
- Find your API key in the settings page
Now, you need to set this as an environment variable. The process varies depending on your operating system:
- Copy the API key from the settings menu on thepi.pe Platform
For Windows:
- Search for "Environment Variables" in the Start menu
- Click "Edit the system environment variables"
- Click the "Environment Variables" button
- Under "User variables", click "New"
- Set the variable name as THEPIPE_API_KEY and the value as your API key
- Click "OK" to save
For macOS and Linux:
Open your terminal and add this line to your shell configuration file (e.g., ~/.bashrc or ~/.zshrc):
export THEPIPE_API_KEY=your_api_key_here
Then, reload your configuration:
source ~/.bashrc # or ~/.zshrc
Defining Your Extraction Schema
The key to successful extraction is defining a clear schema for the data you want to pull out. Let's say we're extracting data from a Bill of Quantity document:
An example of a page from the Bill of Quantity document. The data on each page is independent of the other pages, so we do our extraction "per page". There are multiple pieces of data to extract per page, so we set multiple extractions to True
Looking at the column names, we might want to extract a schema like this:
schema = {
"item": "string",
"unit": "string",
"quantity": "int",
}
You can modify the schema to your liking on thepi.pe Platform. Clicking "View Schema" will give you a schema you can copy and paste for use with the Python API
Extracting Data from PDFs
Now, let's use extract_from_file to pull data from a PDF:
from thepipe.extract import extract_from_file
results = extract_from_file(
file_path = "bill_of_quantity.pdf",
schema = schema,
ai_model = "google/gemini-flash-1.5b",
chunking_method = "chunk_by_page"
)
Here, we've chunking_method="chunk_by_page"
because we want to send each page to the AI model individually (the PDF is too large to feed all at once). We also set multiple_extractions=True
because the PDF pages each contain multiple rows of data. Here's what a page from the PDF looks like:
The results of the extraction for the Bill of Quantity PDF as viewed on thepi.pe Platform
Processing the Results
The extraction results are returned as a list of dictionaries. We can process these results to create a pandas DataFrame:
import pandas as pd
df = pd.DataFrame(results)
# Display the first few rows of the DataFrame
print(df.head())
This creates a DataFrame with all the extracted information, including textual content and descriptions of visual elements like figures and tables.
Exporting to Different Formats
Now that we have our data in a DataFrame, we can easily export it to various formats. Here are some options:
Exporting to Excel
df.to_excel("extracted_research_data.xlsx", index=False, sheet_name="Research Data")
This creates an Excel file named "extracted_research_data.xlsx" with a sheet named "Research Data". The index=False
parameter prevents the DataFrame index from being included as a separate column.
Exporting to CSV
If you prefer a simpler format, you can export to CSV:
df.to_csv("extracted_research_data.csv", index=False)
This creates a CSV file that can be opened in Excel or any text editor.
Ending Notes
The key to successful extraction lies in defining a clear schema and utilizing the AI model's multimodal capabilities. As you become more comfortable with these techniques, you can explore more advanced features like custom chunking methods, custom extraction prompts, and integrating the extraction process into larger data pipelines.
Top comments (0)