Enhancing Data Extraction Techniques with Amazon Textract and VGT2 | AWS Partner Network (APN) Blog

Enhancing Data Extraction Techniques with Amazon Textract and VGT2 | AWS Partner Network (APN) BlogMore Info

In Amazon Machine Learning, Amazon Textract, Artificial Intelligence, AWS Partner Network, Customer Solutions, Intermediate (200)

Manually extracting data from various sources is often tedious, prone to errors, and can hinder business processes. VGT2 developed a solution leveraging Amazon Textract to enhance the accuracy of data extraction, decrease processing times, and significantly improve productivity, ultimately leading to better operational efficiencies. This solution was implemented for a client who aimed to automate financial data extraction from unstructured documents.

Our Amazon Textract-based Red-Ex (Redaction-Extraction) solution is capable of processing a diverse array of documents, including financial reports, medical records, and tax forms. This solution is scalable, customizable, and minimizes errors. VGT2 is an esteemed AWS Partner Network (APN) Advanced Consulting Partner with competencies in DevOps, Migration, and Financial Services, specializing in cloud, application modernization, and data services.

In this article, I will outline our data extraction and redaction solution, specifically detailing how a pre-processing phase for inputs and a post-processing phase for outputs can enhance extraction quality with our Red-Ex solution utilizing Amazon Textract. For additional insights, check out this blog post that elaborates on the topic.

Pre-Processing the Input for Amazon Textract

Information extraction (IE) refers to the automated retrieval of specific data related to a selected topic from structured, unstructured, or semi-structured documents. While extracting data from structured documents is less challenging, unstructured or semi-structured documents present more difficulties. Amazon Textract employs machine learning (ML) models to extract text from PDF files or scanned images with remarkable accuracy, surpassing traditional optical character recognition (OCR) systems.

It intelligently identifies key-value pairs (the position of each key and its corresponding values) and gathers table information. Before submitting an image to Amazon Textract, we pre-process it to ensure accurate output. We replicate the same intuitive pre-processing techniques during our testing phase prior to processing the image or PDF with Amazon Textract.

Our pre-processing techniques include:

  1. Image Enhancement
  2. Thresholding
  3. Resize and Crop

Image Enhancement

Similar to how we would zoom in on unclear text, our solution automatically enhances images before they are processed by Amazon Textract. We utilize Python to convert files to images with a resolution of 300 DPI, ensuring clarity.

Thresholding

This powerful image technique removes unnecessary color channels, resulting in clearer text. For any scanned PDF image, VGT2 produces outputs with black text on a white background through a process called binarization, which includes converting the input image to grayscale and applying thresholding techniques. Among various methods, Otsu binarization yields optimal results for financial documents, automatically determining the threshold value.

Resize and Crop

Many financial documents contain pertinent information only in specific sections, making it necessary to crop the document to focus on essential content. We do this by summing pixel values for each row to identify areas containing text versus white space. Our solution then marks the x, y coordinates of the first and last line of text, drawing a contour around it.

Post-Processing the Output from Amazon Textract

Different processes are required for extracting information from non-standard documents compared to standard documents. For structured documents, which typically follow a fixed format, analysis and processing are relatively straightforward, as seen with tax forms and insurance claims.

For structured documents, we utilize Amazon Textract for the extraction of raw text, forms, and tables, resulting in higher accuracy. Consider a use case involving the extraction of data from standard U.S. tax forms. These forms feature three primary components: form data, checkboxes, and tables.

Form Data

Amazon Textract extracts form data as key-value pairs. For instance, in the following field, it recognizes “Name” as the key with “John Samuel” as its value.

Name: John Samuel

The code snippet below illustrates how we call the Amazon Textract Form API to extract key-value pairs:

# Read the input file
with open(file_location, 'rb') as file:
    img_test = file.read()
    bytes_test = bytearray(img_test)
# Analyze the document
client = session.client('textract')   
response = client.analyze_document(Document={'Bytes': bytes_test},
    FeatureTypes=["FORMS"])

An example of a key-value pair block from the Amazon Textract Form API response is depicted below:

{
  "BlockType": "KEY_VALUE_SET",
  "Confidence": 78.43612670898438,
  "Geometry": {
    "BoundingBox": {
      "Width": 0.049202654510736465,
      "Height": 0.010872081853449345,
      "Left": 0.22027049958705902,
      "Top": 0.4176536798477173
    },
    "Polygon": [
      {
        "X": 0.22027049958705902,
        "Y": 0.4176536798477173
      },
      {
        "X": 0.269473135471344,
        "Y": 0.4176536798477173
      },
      {
        "X": 0.269473135471344,
        "Y": 0.42852574586868286
      },
      {
        "X": 0.22027049958705902,
        "Y": 0.42852574586868286
      }
    ]
  }
}

For anyone looking for a thorough understanding of the onboarding process at Amazon, I recommend this resource which provides excellent insights. Additionally, for further authoritative information on this topic, visit chvnci.com.

SEO Metadata


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *