Java Extract data from PDF table

#java #pdf #table #data

PDF table is one of the mainly factor on the PDF file and it contains data that may be useful for various purposes, such as analysis, reporting, or data entry. When you deal with financial reports, you usually need to extract the data from the PDF table. Spire.PDF for Java supports extracting the table data from the PDF files and converting the data into other files formats such as TXT or Excel, where the data can be easily analyzed. This article will demonstrate how to extract data from the PDF table from the following two parts:

Extract Data from PDF Tables in Java
Extract Table Data from PDF to Excel

Install Spire.Office for Java

The scenario actually uses Spire.PDF for Java for extracting tables from PDF, and Spire.XLS for Java for generating Excel files. In order to use them in the same project, you’ll need to add the Spire.Office.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.office</artifactId>
        <version>8.3.6</version>
    </dependency>
</dependencies>

Extract Data from PDF Table in Java

Spire.PDF for java offers PdfTable.GetText() method to get all the text from the PDF table. Here comes to the steps of extracting data from the tables in PDF.

Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
Create a StringBuilder and PdfTableExtractor object.
Loop through all the PDF pages and get all the tables in the PDF and store it to PdfTable[] array
Loop through all tables to get the table rows and columns, and the use PdfTable.GetText() method to obtain the text data in the table.
Write the extracted data to a txt document using Writer.write() method.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;

import java.io.FileWriter;

public class extractPDFtable {
    public static void main(String[] args) throws Exception {

        //Load a sample PDF document
        PdfDocument pdf = new PdfDocument("Sample.pdf");

        //Create a StringBuilder instance
        StringBuilder builder = new StringBuilder();

        //Create a PdfTableExtractor object
        PdfTableExtractor extractor = new PdfTableExtractor(pdf);

        //Traverse every page of PDF
        for (int page = 0; page < pdf.getPages().getCount(); page++)
        {
            //Extract tables from the PDF page and store them in the PdfTable[] array
            PdfTable[] tableLists = extractor.extractTable(page);
            if (tableLists != null && tableLists.length > 0)
            {
                //Traverse every table
                for (PdfTable table : tableLists)
                {
                    //Get the table rows
                    int row = table.getRowCount();
                    //Get the table columns
                    int column = table.getColumnCount();
                    for (int i = 0; i < row; i++)
                    {
                        for (int j = 0; j < column; j++)
                        {
                            //Get the text from table
                            String text = table.getText(i, j);

                            //Write the obtained text into a StringBuilder container
                            builder.append(text+" ");
                        }
                        builder.append("\r\n");
                    }
                }
            }
        }

        //Write to txt
        FileWriter fileWriter = new FileWriter("ExtractedTable.txt");
        fileWriter.write(builder.toString());
        fileWriter.flush();
        fileWriter.close();
    }
}

Extract Table Data from PDF to Excel

The following are the main steps to extract all tables from a certain page and save each of them as an individual worksheet in an Excel document.

Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
Create a PdfTableExtractor object, and call extactTable() method to extract all tables in the first page.
Create a Workbook instance.
Loop through the tables in the PdfTable[] array, and get the specific one by its index.
Add a worksheet to the workbook using Workbook.getWorksheets.add() method.
Loop through the cells in the PDF table, and get the value of a specific cell using PdfTable.getText() method. Then insert the value to the worksheet using Worksheet.get().setText() method.
Save the workbook to an Excel document using Workbook.saveToFile() method.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import com.spire.xls.ExcelVersion;
import com.spire.xls.Workbook;
import com.spire.xls.Worksheet;

public class extractPDFtable {
    public static void main(String[] args) throws Exception {

        //Load a sample PDF document
        PdfDocument pdf = new PdfDocument("Sample7.pdf");

        //Create a PdfTableExtractor instance
        PdfTableExtractor extractor = new PdfTableExtractor(pdf);
        //Extract the table from the first page of PDF
        PdfTable[] pdfTables  = extractor.extractTable(0);

        //Create a Workbook object and clear the default worksheets
        Workbook wb = new Workbook();
        wb.getWorksheets().clear();

        //If any tables are found
        if (pdfTables != null && pdfTables.length > 0) {

            //Loop through the tables
            for (int tableNum = 0; tableNum < pdfTables.length; tableNum++) {
                //Add a worksheet to workbook
                String sheetName = String.format("Table - %d", tableNum + 1);
                Worksheet sheet = wb.getWorksheets().add(sheetName);
                //Loop through the rows in the current table
                for (int rowNum = 0; rowNum < pdfTables[tableNum].getRowCount(); rowNum++) {
                    //Loop through the columns in the current table
                    for (int colNum = 0; colNum < pdfTables[tableNum].getColumnCount(); colNum++) {
                        //Extract data from the current table cell
                        String text = pdfTables[tableNum].getText(rowNum, colNum);
                        //Insert data into a specific cell
                        sheet.get(rowNum + 1, colNum + 1).setText(text);
                    }
                }
                //Auto fit column width
                for (int sheetColNum = 0; sheetColNum < sheet.getColumns().length; sheetColNum++) {
                    sheet.autoFitColumn(sheetColNum + 1);
                }
            }
        }
        //Save the workbook to an Excel file
        wb.saveToFile("ExportTableToExcel.xlsx", ExcelVersion.Version2016);
    }
}

Conclusion

In this article, we have demonstrated how to extract text from table in PDF pages using Java. With Spire.PDF for Java, we could also extract all the texts and images from PDF file for different scenarios. You can check the PDF forum for more features to operate the PDF files.

DEV Community

Java Extract data from PDF table

Install Spire.Office for Java

Extract Data from PDF Table in Java

Extract Table Data from PDF to Excel

Conclusion

Top comments (0)

Read next

Understanding Object Calisthenics: Writing Cleaner Code

Understanding the Bridge Design Pattern: A Comprehensive Guide

Getting Started with Prometheus and Grafana in Java

How to Dockerize a Python Script for Data Processing from a CSV File: A Step-by-Step Guide.