PDF table is one of the mainly factor on the PDF file and it contains data that may be useful for various purposes, such as analysis, reporting, or data entry. When you deal with financial reports, you usually need to extract the data from the PDF table. Spire.PDF for Java supports extracting the table data from the PDF files and converting the data into other files formats such as TXT or Excel, where the data can be easily analyzed. This article will demonstrate how to extract data from the PDF table from the following two parts:
Install Spire.Office for Java
The scenario actually uses Spire.PDF for Java for extracting tables from PDF, and Spire.XLS for Java for generating Excel files. In order to use them in the same project, you’ll need to add the Spire.Office.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.office</artifactId>
<version>8.3.6</version>
</dependency>
</dependencies>
Extract Data from PDF Table in Java
Spire.PDF for java offers PdfTable.GetText() method to get all the text from the PDF table. Here comes to the steps of extracting data from the tables in PDF.
- Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
- Create a StringBuilder and PdfTableExtractor object.
- Loop through all the PDF pages and get all the tables in the PDF and store it to PdfTable[] array
- Loop through all tables to get the table rows and columns, and the use PdfTable.GetText() method to obtain the text data in the table.
- Write the extracted data to a txt document using Writer.write() method.
import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import java.io.FileWriter;
public class extractPDFtable {
public static void main(String[] args) throws Exception {
//Load a sample PDF document
PdfDocument pdf = new PdfDocument("Sample.pdf");
//Create a StringBuilder instance
StringBuilder builder = new StringBuilder();
//Create a PdfTableExtractor object
PdfTableExtractor extractor = new PdfTableExtractor(pdf);
//Traverse every page of PDF
for (int page = 0; page < pdf.getPages().getCount(); page++)
{
//Extract tables from the PDF page and store them in the PdfTable[] array
PdfTable[] tableLists = extractor.extractTable(page);
if (tableLists != null && tableLists.length > 0)
{
//Traverse every table
for (PdfTable table : tableLists)
{
//Get the table rows
int row = table.getRowCount();
//Get the table columns
int column = table.getColumnCount();
for (int i = 0; i < row; i++)
{
for (int j = 0; j < column; j++)
{
//Get the text from table
String text = table.getText(i, j);
//Write the obtained text into a StringBuilder container
builder.append(text+" ");
}
builder.append("\r\n");
}
}
}
}
//Write to txt
FileWriter fileWriter = new FileWriter("ExtractedTable.txt");
fileWriter.write(builder.toString());
fileWriter.flush();
fileWriter.close();
}
}
Extract Table Data from PDF to Excel
The following are the main steps to extract all tables from a certain page and save each of them as an individual worksheet in an Excel document.
- Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
- Create a PdfTableExtractor object, and call extactTable() method to extract all tables in the first page.
- Create a Workbook instance.
- Loop through the tables in the PdfTable[] array, and get the specific one by its index.
- Add a worksheet to the workbook using Workbook.getWorksheets.add() method.
- Loop through the cells in the PDF table, and get the value of a specific cell using PdfTable.getText() method. Then insert the value to the worksheet using Worksheet.get().setText() method.
- Save the workbook to an Excel document using Workbook.saveToFile() method.
import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import com.spire.xls.ExcelVersion;
import com.spire.xls.Workbook;
import com.spire.xls.Worksheet;
public class extractPDFtable {
public static void main(String[] args) throws Exception {
//Load a sample PDF document
PdfDocument pdf = new PdfDocument("Sample7.pdf");
//Create a PdfTableExtractor instance
PdfTableExtractor extractor = new PdfTableExtractor(pdf);
//Extract the table from the first page of PDF
PdfTable[] pdfTables = extractor.extractTable(0);
//Create a Workbook object and clear the default worksheets
Workbook wb = new Workbook();
wb.getWorksheets().clear();
//If any tables are found
if (pdfTables != null && pdfTables.length > 0) {
//Loop through the tables
for (int tableNum = 0; tableNum < pdfTables.length; tableNum++) {
//Add a worksheet to workbook
String sheetName = String.format("Table - %d", tableNum + 1);
Worksheet sheet = wb.getWorksheets().add(sheetName);
//Loop through the rows in the current table
for (int rowNum = 0; rowNum < pdfTables[tableNum].getRowCount(); rowNum++) {
//Loop through the columns in the current table
for (int colNum = 0; colNum < pdfTables[tableNum].getColumnCount(); colNum++) {
//Extract data from the current table cell
String text = pdfTables[tableNum].getText(rowNum, colNum);
//Insert data into a specific cell
sheet.get(rowNum + 1, colNum + 1).setText(text);
}
}
//Auto fit column width
for (int sheetColNum = 0; sheetColNum < sheet.getColumns().length; sheetColNum++) {
sheet.autoFitColumn(sheetColNum + 1);
}
}
}
//Save the workbook to an Excel file
wb.saveToFile("ExportTableToExcel.xlsx", ExcelVersion.Version2016);
}
}
Conclusion
In this article, we have demonstrated how to extract text from table in PDF pages using Java. With Spire.PDF for Java, we could also extract all the texts and images from PDF file for different scenarios. You can check the PDF forum for more features to operate the PDF files.
Top comments (0)