Introduction
In this article, we explore various methods for extracting data from documents, comparing OCR+LLM with Claude 3 Vision, and delving into fast OCR transformers and cloud-native OCRs. We also provide a code example for implementing OCR as a simple API using docTR and discuss how Groq can be leveraged to achieve the best inference speed for LLMs.
The Use Case: Document Scanning to Save Time
Imagine a SaaS platform that helps register invoices for a company. Speed and convenience are paramount, and while some errors are tolerable, the goal is to minimize them. This scenario highlights the need for rapid and reliable data extraction.
Claude 3 Vision vs OCR+LLM
Claude 3 Vision
Claude 3 Vision is known for its speed and cost-efficiency. However, it has limitations, including a tendency to hallucinate (produce errors). It's suitable for simple tasks but may fall short in more complex scenarios.
OCR+LLM
OCR+LLM combines Optical Character Recognition (OCR) with Large Language Models (LLMs) to extract and analyze text. This approach offers a balance between accuracy and speed, making it ideal for more detailed data extraction tasks.
Testing Limits with Claude 3 Vision
Using an example invoice, we can define a protocol for our application:
invoice_number: "string"
invoice_date: "string" # YYYY-MM-DD
due_date: "string" # YYYY-MM-DD
seller_details:
seller_name: "string"
seller_address:
street_number_and_name: "string"
city_or_town: "string"
country: "string"
buyer_details:
buyer_name: "string"
buyer_address:
street_number_and_name: "string"
city_or_town: "string"
country: "string"
buyer_email: "string"
buyer_phone_number: "string"
products_services:
- item_number: number
description: "string"
quantity: number
unit_price: number
total_price: number
sub_total: number
total: number
This pseudo-YAML format outlines the fields we want to extract from an invoice. Testing with Claude 3 Vision yielded response times of about 1 second, which is slower than desired.
OCR Transformers Designed for Speed
Notable OCR Tools
- DocTR: Optimized for high-speed performance on both CPU and GPU, requiring only three lines of code to implement.
- TrOCR: Pre-trained transformers supported by Microsoft, offering various models.
- PaddleOCR: Known for its speed, capable of processing large volumes of images in real-time.
- MMOCR: Another fast OCR tool.
- Surya: Highly efficient and fast.
Performance testing showed that these OCR tools could achieve processing times as low as 20ms on a GPU.
Cloud-Native OCRs
- Azure Form Recognizer: Best performance time around 3 seconds.
- Amazon Textract: Processes documents in 3-4 seconds per page.
- Google Cloud Vision API and Document AI: Highly efficient and similar to Azure and Amazon.
- Abby Cloud OCR: Faster than the other alternatives and offers detailed page representations.
These cloud AI services used to be the go-to solutions but are now often replaced by LLMs due to cost and flexibility advantages.
(https://miro.medium.com/v2/resize:fit:720/format:webp/1*xlWNvAtaM0ObnSKYz6MWwA.png)
Implementing OCR with docTR
Setting Up the OCR API
Here’s a simple example using docTR:
from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from PIL import Image
import io
app = FastAPI(title="OCR Service using docTR")
@app.post("/ocr/")
async def perform_ocr(file: UploadFile = File(...)):
image_data = await file.read()
doc = DocumentFile.from_images(image_data)
model = ocr_predictor(pretrained=True)
result = model(doc)
extracted_texts = []
for page in result.pages:
for block in page.blocks:
for line in block.lines:
line_text = ' '.join([word.value for word in line.words])
extracted_texts.append(line_text)
return JSONResponse(content={"ExtractedText": extracted_texts})
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8001)
Docker Setup for the OCR API
# Use an official Python runtime as a parent image, suitable for TensorFlow
FROM tensorflow/tensorflow:latest
# Set the working directory in the container
WORKDIR /app
# Install system dependencies required for OpenCV and WeasyPrint
RUN apt-get update && apt-get install -y \
libgl1-mesa-glx \
libpango-1.0-0 \
libpangocairo-1.0-0 \
libgdk-pixbuf2.0-0 \
libffi-dev \
shared-mime-info
# Install FastAPI and Uvicorn
RUN pip install fastapi uvicorn python-multipart aiofiles Pillow
# Copy the local directory contents into the container
COPY . /app
# Install `doctr` with TensorFlow support
RUN pip install python-doctr[tf]
# Expose the port FastAPI will run on
EXPOSE 8001
# Command to run the FastAPI server on container start
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8001", "--workers", "4"]
(https://miro.medium.com/v2/resize:fit:720/format:webp/0*OeNLrlTN_iNgWcvX)
Groq: The King of Speed
Groq's architecture provides exceptional performance and cost efficiency, boasting speeds three times faster at half the cost compared to traditional methods.
Using Groq with OCR
from groq import Groq
def send_request_to_groq(content: str) -> str:
client = Groq(api_key=API_KEY_GROQ)
completion = client.chat.completions.create(
model="gemma-7b-it",
messages=[
{
"role": "system",
"content": "You are an API server that receives content from a document and returns a JSON with the defined protocol"
},
{
"role": "user",
"content": content
}
],
temperature=1,
max_tokens=1024,
top_p=1,
stream=False,
response_format={"type": "json_object"},
stop=None,
)
return completion.choices[0].message.content
The response_format
feature of Groq is particularly noteworthy, offering unique capabilities not found in other providers.
Final Implementation
Controller Code
@app.post("/extract_fast")
async def extract_text(file: UploadFile = File(...), extraction_contract: str = Form(...)):
temp_file = tempfile.NamedTemporaryFile(delete=False)
shutil.copyfileobj(file.file, temp_file)
file_path = temp_file.name
images = convert_pdf_to_images(file_path)
extracted_text = extract_text_with_pytesseract(images)
extracted_text = "\n new page --- \n".join(extracted_text)
extracted_text = systemMessage + "\n####Content\n\n" + extracted_text
extracted_text = extracted_text + "\n####Structure of the JSON output file\n\n" + extraction_contract
extracted_text = extracted_text + "\n#### JSON Response\n\n" + jsonContentStarter
start_time = time.time()
content = send_request_to_groq(extracted_text)
elapsed_time = time.time() - start_time
print(f"send_request_to_groq took {elapsed_time} seconds")
temp_file.close()
content = remove_json_format(content)
return json.loads(content)
Conclusion
For document scanning and data extraction, combining OCR and LLMs on GPUs with Groq provides superior speed and efficiency. This approach is especially beneficial for processing invoices and other documents captured via mobile devices.
Top comments (0)