Artificial Intelligence as we know found use cases in every possible industry! Many complicated problems we used to face during our day to day are now being solved using AI. Some of them might not give results upto human standards but with improvements in underlying algorithms and optimizations we are progressing towards achieving this standards. In this article we will see one such important problem, Text Extraction from documents. For many years, companies are working on this problem using manual techniques, rule-based methods or customized OCR which are both time consuming and complicated.
One important point here is documents are important! How? Let's see!
Documents are primary tools for keeping the records. Large amount of data is stored in structured or unstructured documents. They are also important when it comes to communicate, collaborate or transact the data across industries like medical, legal, business management, finance, education, tax management and many more.
What are the types of documents we are looking at?
We are looking at scanned documents, digital documents, forms, tables, contracts and many other.
I mentioned some classical techniques which we are using above. What is the problem with those? The major problems in this manual techniques are they are too expensive
, error prone
and time consuming
as it involves human-intervention.
Let's see problems with each of the technique:
1. Manual processing (humans):
When we depend on humans processing the docs there might be issues like
- Variable output
- Inconsistent results
- Reviews for consensus
in a example below humans can process and interpret this blocks differently and it depends on variety of factors.
2. Customized OCR was better solution than manual extraction but it has it's own problem:
- Paragraph detection (You can code this but again manual intervention comes in. You can annotate the sample set and train a ML model on model on that which will give you separated paragraphs and again there are some unsupervised methods but ML comes into play here. )
- No rotated text and stylized text detection
- No multi-column detection
- Table Extraction
You can obviously add this features and if you want to do it without ML you have to maintain a separate code template (and templates are brittle) for each document and it's time consuming. If we consider tax form for any country there will be different variations for different job categories and you have to maintain different template and rule-sets for all of them which is nightmare.
So how can we not complicate our life further and still make a robust text extraction solution? Amazon textract comes handy and solves many of the problems we have seen! It's tagline says extract text and data from virtually any document!
Let's jump into details!
What Amazon Textract can do?
Let's first list down some things you can achieve using amazon textract and then see core features in details:
- Text detection from documents
- Multi-column detection and reading order
- Natural language processing and document classification
- Natural language processing for medical documents
- Document translation
- Search and discovery
- Form extraction and processing
- Compliance control with document redaction
- Table extraction and processing
- PDF document processing
How textract works?
Amazon textract API accepts the document stored in s3 and uses ML models built in to extract text, tables or any fields of interest from docs. Now we get an option to either store this extracted data into some other format or stack some other services for further processing the output. We can use services like Elasticsearch to create indexes for the data to built a search application around it or we can amazon comprehend to use Natural Language Processing on our data.
We can use services like amazon comprehend medical which uses advanced machine learning models to accurately and quickly identify medical information, such as medical conditions and medications, and determines their relationship to each other, for instance, medicine dosage and strength. Amazon Comprehend Medical can also link the detected information to medical ontologies such as ICD-10-CM or RxNorm. And if you are not interested in all this fancy stuff you can just store your data in database with pre-defined schema and use it in your application! The above self-explanatory diagram from documentation will make understanding of things little easy!
Before going ahead let's just see request and response format of Textract API.
1. Request Syntax:
{
"Document": {
"Bytes": blob,
"S3Object": {
"Bucket": "string",
"Name": "string",
"Version": "string"
}
},
"FeatureTypes": [ "string" ],
"HumanLoopConfig": {
"DataAttributes": {
"ContentClassifiers": [ "string" ]
},
"FlowDefinitionArn": "string",
"HumanLoopName": "string"
}
}
Here, Document
is input document which can be base64-encoded bytes or an Amazon S3 object and it's required. FeatureTypes
is list of features you want to extract like tables, forms etc. and it's also required. HumanLoopConfig
allows you to set human reviewer and it's not required.
2. Response Syntax:
{
"AnalyzeDocumentModelVersion": "string",
"Blocks": [
{
"BlockType": "string",
"ColumnIndex": number,
"ColumnSpan": number,
"Confidence": number,
"EntityTypes": [ "string" ],
"Geometry": {
"BoundingBox": {
"Height": number,
"Left": number,
"Top": number,
"Width": number
},
"Polygon": [
{
"X": number,
"Y": number
}
]
},
"Id": "string",
"Page": number,
"Relationships": [
{
"Ids": [ "string" ],
"Type": "string"
}
],
"RowIndex": number,
"RowSpan": number,
"SelectionStatus": "string",
"Text": "string"
}
],
"DocumentMetadata": {
"Pages": number
},
"HumanLoopActivationOutput": {
"HumanLoopActivationConditionsEvaluationResults": "string",
"HumanLoopActivationReasons": [ "string" ],
"HumanLoopArn": "string"
}
}
Here, AnalyzeDocumentModelVersion
tells you version of model used used and Blocks contains all the detected items. DocumentMetadata
gives additional information about document and HumanLoopActivationOutput
gives results of evaluation by human reviewer.
Now we know what textract can do and how it works, let's see the core features and capabilities textract provides in details:
Core Features:
You can try this all from Amazon Textract Console directly!
1. Table Extraction:
Amazon textract can extract tables from given document and provide them into any format we want including CSV or spreadsheet and we can even automatically load the extracted data into a database using a pre-defined schema.
Let's consider one document and see how Textract works for that!
Here are the results which are really promising!
2. Form Extraction:
Amazon textract can extract data from forms in key-value pairs which we can use for various applications. For example you want to setup automated process which accepts scanned bank account opening application and fills required data into system and creates account you can do that using amazon textract form extraction.
let's try this on below document:
Here are the results:
Let's see harder problem with document like this:
Here's what we got:
3. Text Extraction:
Amazon textract uses a better adoption of OCR which uses ML along with OCR (some people like to call it OCR++) which detects printed text and numbers in a scan or rendering of a document. This can be used for medical reports, financial reports or we can use it for applications like clause extraction in legal documents when paired with amazon comprehend.
Let's try to extract text from this document:
Here are the results:
Along with this 3 core features, textract also provides you bunch of features like Bounding Boxes, Adjustable Confidence Thresholds, Built-in Human Review Workflow.
So, how can we use the textract API with python?
Let's build a very simplified upload and analyze pipeline based on amazon textractor.
-
Pipeline:
First, we will upload document to s3 and then use amazon textractor to extract fields we want from document.
import os
import subprocess as sp
from s3_upload import upload
import re
def run_pipeline(source_file, bucket_name, object_key, flags):
upload(source_file, bucket_name, object_key)
url = f"s3://{bucket_name}/{object_key}"
command_analysis = f"python textractor.py --documents {url} {flags}"
os.system(command_analysis)
def main():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('source_file', help='The path and name of the source file to upload.')
parser.add_argument('bucket_name', help='The name of the destination bucket.')
parser.add_argument('object_key', help='The key of the destination object.')
parser.add_argument('flags', help='Only one of the flags (--text, --forms and --tables) is required at the minimum. You can use combination of all three.')
args = parser.parse_args()
run_pipeline(args.source_file, args.bucket_name, args.object_key, args.flags)
if __name__ == "__main__":
main()
Here, we will provide local file path, s3 bucket we want to upload file in and name of the file along with what we want to extract.
-
Upload file to s3:
uploading file to s3 is really easy:
def upload(source_file, bucket_name, object_key):
s3 = boto3.resource('s3')
try:
s3.Bucket(bucket_name).upload_file(source_file, object_key)
except Exception as e:
print(e)
-
Textractor:
Textractor is the ready to use solution made by amazon which helps to speed up the PoC's. It can convert output in different formats including raw JSON, JSON for each page in the document, text, text in reading order, key/values exported as CSV, tables exported as CSV. It can also generate insights or translate detected text by using Amazon Comprehend, Amazon Comprehend Medical and Amazon Translate.
This is how textractor uses response parser library which helps process JSON returned from Amazon Textract. See the repo and documentation for more details.
# Call Amazon Textract and get JSON response
docproc = DocumentProcessor(bucketName, filePath, awsRegion, detectText, detectForms, tables)
response = docproc.run()
# Get DOM
doc = Document(response)
# Iterate over elements in the document
for page in doc.pages:
# Print lines and words
for line in page.lines:
print("Line: {}--{}".format(line.text, line.confidence))
for word in line.words:
print("Word: {}--{}".format(word.text, word.confidence))
# Print tables
for table in page.tables:
for r, row in enumerate(table.rows):
for c, cell in enumerate(row.cells):
print("Table[{}][{}] = {}-{}".format(r, c, cell.text, cell.confidence))
# Print fields
for field in page.form.fields:
print("Field: Key: {}, Value: {}".format(field.key.text, field.value.text))
# Get field by key
key = "Phone Number:"
field = page.form.getFieldByKey(key)
if(field):
print("Field: Key: {}, Value: {}".format(field.key, field.value))
# Search fields by key
key = "address"
fields = page.form.searchFieldsByKey(key)
for field in fields:
print("Field: Key: {}, Value: {}".format(field.key, field.value))
This is how the output looks like!
What's next
We went through various features and capabilities textract provides! This is one of the ready to use solution which can simplify some very complicated problems we face while building business applications around documents. This is not 100% accurate and directly usable for every case but some small tweaks here and there should make it usable for most of the use cases. In next article, we will see how we can use this is some of the business applications and we will also try to build end to end pipeline using various AWS services.
Until then, let me know if you have some use-cases where you are already using amazon textract or you're planning to use this in comments. If you have any questions or want to discuss any use-cases ping me on twitter.
Stay safe!
References:
- Amazon Textract : https://aws.amazon.com/textract/
- Amazon Textract Console: https://console.aws.amazon.com/textract/home?region=us-east-1#/
- Amazon Blogs: https://aws.amazon.com/blogs/machine-learning/automatically-extract-text-and-structured-data-from-documents-with-amazon-textract/
- Amazon Textract Documentation: https://docs.aws.amazon.com/textract/latest/dg/what-is.html
- Amazon textract textractor
Top comments (0)