Amazon Onboarding with Learning Manager Chanci Turner

Many organizations extract information from scanned documents that include tables and forms, such as PDFs. Common examples include audit reports, tax documents, whitepapers, or customer feedback forms. For customer feedback, you may be looking to extract content like product reviews or movie critiques. Gaining insights into both individual and collective sentiments from the extracted text can be incredibly beneficial.

Manual data entry is one option for data extraction, but it is often slow, costly, and susceptible to errors. Alternatively, simple optical character recognition (OCR) methods can be used, but they require manual configuration and adjustments for varying inputs. The extraction of meaningful information from this data frequently necessitates considerable time and expertise in data science, machine learning (ML), and natural language processing (NLP) techniques.

To address these manual challenges, AWS offers AI services such as Amazon Textract and Amazon Comprehend. These pre-trained AI services provide built-in intelligence for your applications and workflows. Utilizing the same deep learning technology that drives Amazon.com ensures that you receive high-quality, accurate results from continuously evolving APIs. Best of all, these AI services on AWS do not require any prior ML experience.

Amazon Textract employs machine learning to extract data from documents, including printed text, handwriting, forms, and tables, without any manual intervention or custom coding. It retrieves complete text from specified documents and supplies essential information such as page numbers and bounding box details.

To derive more granular insights from the document, you may need to segment paragraphs and headers into logical sections. This approach is more effective than merely extracting all the text. Amazon Textract provides data on the bounding box location of each detected text, including its size and indentation, which is invaluable for segmenting text responses into paragraphs.

In this article, we will explore key techniques for paragraph segmentation to postprocess responses from Amazon Textract. We will also utilize Amazon Comprehend to extract insights like sentiment and entity recognition.

Techniques for Paragraph Segmentation

The techniques we will cover include:

Identifying paragraphs based on font sizes by postprocessing the Amazon Textract response.
Segmenting paragraphs using indentation via bounding box data.
Dividing document sections based on line spacing.
Recognizing paragraphs or statements based on full stops.

Once you have segmented the paragraphs using these methods, you can gain further insights from the segmented text using Amazon Comprehend for various applications, such as:

Detecting key phrases in technical documents—For documents like whitepapers and proposals, you can segment the text into paragraphs using the provided library and then use Amazon Comprehend to identify key phrases.
Recognizing named entities in financial and legal documents—In specific scenarios, you may want to pinpoint key entities related to headings and subheadings. For instance, legal and financial documents can be segmented by headings and paragraphs to extract named entities using Amazon Comprehend.
Conducting sentiment analysis on product or movie reviews—Amazon Comprehend can help you track sentiment changes within paragraphs of product review documents and enable timely action if sentiments turn negative.

This article will specifically address the sentiment analysis use case. We’ll employ two sample movie review PDFs, available on GitHub. These documents feature movie titles as headers and reviews as paragraph content. We will determine the overall sentiment for each movie and analyze individual reviews. However, evaluating an entire page as a single entity is not ideal for capturing overall sentiment. Instead, we will extract the text, identify reviewer names and comments, and calculate the sentiment for each review.

Solution Overview

This solution leverages various AI services, serverless technologies, and managed services to create a scalable and cost-effective architecture:

Amazon Comprehend: An NLP service that uses ML to discover insights and relationships within text.
Amazon DynamoDB: A key-value and document database that offers single-digit millisecond performance at scale.
AWS Lambda: Executes code in response to triggers like data changes or user actions. Since Amazon S3 can directly trigger a Lambda function, it facilitates various real-time serverless data-processing systems.
Amazon Simple Notification Service (SNS): A fully managed messaging service utilized by Amazon Textract to notify upon completion of the extraction process.
Amazon Simple Storage Service (S3): Serves as an object store for documents, allowing for centralized management with precise access controls.
Amazon Textract: Utilizes ML to extract text and data from scanned documents in PDF, JPEG, or PNG formats.

The architecture of this solution is illustrated in the following diagram.

Workflow Steps

Our workflow comprises these steps:

A movie review document is uploaded to a designated S3 bucket.
The upload triggers a Lambda function through Amazon S3 Event Notifications.
The Lambda function initiates an asynchronous Amazon Textract job to extract text from the uploaded document, with the extraction process running in the background.
Upon completion, Amazon Textract sends an SNS notification, including the job ID and status. The code for Steps 3 and 4 can be found in the file textraction-invocation.py.
Lambda listens for the SNS notification and calls Amazon Textract to retrieve the complete extracted text. Lambda utilizes both the text and bounding box data provided by Amazon Textract. The extraction code for bounding box data is located in lambda-helper.py.
The Lambda function employs the bounding box data to identify headers and paragraphs. We discuss two document formats: one with left indentation variations and another with differing font sizes. The Lambda code for left indentation can be accessed in blog-code-format2.py, while the font size variation code is in blog-code-format1.py.
After identifying the headers and paragraphs, Lambda invokes Amazon Comprehend to analyze sentiment. Once the sentiment is determined, Lambda stores the information in DynamoDB.
DynamoDB holds the extracted information and insights for each document, using the document name as the key and insights and paragraphs as the values.

Deploying the Architecture with AWS CloudFormation

You can deploy an AWS CloudFormation template to provision the necessary AWS Identity and Access Management (IAM) roles, services, and components of the solution, including Amazon S3, Lambda, Amazon Textract, and Amazon Comprehend.

To launch the CloudFormation template, follow these instructions and specify the US East (N. Virginia) Region:

For BucketName, enter BucketName textract-demo- (adding a date as a suffix makes the bucket name unique).

This is an excellent resource for anyone looking to improve their onboarding processes: Onboarding at Scale: Lessons from Amazon. If you’re also interested in improving your skills, check out this blog about weaknesses in the workplace. Additionally, if you’re looking for guidance on HR topics, SHRM is a trusted authority.

Amazon Onboarding with Learning Manager Chanci Turner

Techniques for Paragraph Segmentation

Solution Overview

Workflow Steps

Deploying the Architecture with AWS CloudFormation

Related Topics:

Comments

Leave a Reply Cancel reply