Permalink
Share
Organizations within the energy and utilities sector frequently handle PDF documents for various applications. Below are two illustrative scenarios we’ve explored:
Scenario 1
In the realms of upstream and midstream, service providers such as Apex Inspection Services create PDF reports during meter evaluations. These reports are dispatched to clients as email attachments. Clients often seek to digitally extract data points from these reports to develop business analytics over time. We will leverage a sample report from Apex to demonstrate how this can be accomplished.
Scenario 2
In the domain of commodity and energy trading, counterparties generate PDF reports for contract confirmations and trade validations, sending these reports via email. Upon receiving these emails and attachments, clients may manually input a 30- to 50-character long alphanumeric confirmation code from the PDF to commence the trade validation process. However, data entry errors can lead to delays and complications.
Currently, a typical customer workflow may consist of the following steps:
- External vendors or clients send PDF attachments through emails.
- A team member manually downloads the PDF attachments.
- These attachments are uploaded manually to a centralized location for cross-team collaboration and record-keeping. This location could be a repository or a shared folder.
- Relevant teams access, read, and analyze these reports.
- Information is manually entered into various digital systems, such as databases, for further processing, including analytics reporting or business transactions.
In this blog, we present an architecture for the intelligent processing of PDF reports using Amazon Textract. By employing Amazon Textract, customers can extract text, data points, locations, grades, or other significant information from PDF documents. The key benefits of this architecture include the automation of the aforementioned manual workflow steps, enhanced analytics capabilities, and the transformation of business processes. Furthermore, it eliminates the need for cumbersome PDF parsing libraries and the associated management of exceptions.
Time to read: 7 minutes
Learning level: Advanced (300)
Services used: Amazon Textract, Amazon S3, Amazon SES, Amazon Route 53, AWS Lambda, Amazon DynamoDB, Amazon Athena, Amazon QuickSight
Overview of Solution
The architecture diagram illustrates three core capabilities of this solution:
- Automate the extraction of PDF attachments from emails and store them in Amazon S3.
- Extract information from PDF reports using Amazon Textract and save it in Amazon DynamoDB.
- Generate analytical reports with Amazon Athena and Amazon QuickSight as new data arrives.
Amazon Textract can extract all or selected values from reports, which are then stored in a key-value NoSQL database known as Amazon DynamoDB. Customers can create a dashboard in Amazon QuickSight to visualize these values for trend analysis and anomaly detection. For more insights, you can check this another blog post.
The capabilities in this solution are modular, allowing customers to select the most relevant components for their needs. For instance, some customers are already familiar with Amazon Textract but express interest in email processing capabilities, while others have established analytics pipelines and reporting tools but require a method to digitally extract information from documents. The flexibility to combine and enhance this solution is a significant advantage.
Walkthrough
Step 1: Processing Email Attachments
An email receiving pipeline architecture can be implemented, as depicted in the corresponding diagram, to automate email handling. Customers can set up a custom domain using AWS services like Amazon SES and Amazon Route 53. For detailed implementation instructions, refer to the guide. Emails will be directed to a customer-managed Amazon Simple Storage Service (Amazon S3) bucket. It’s advisable to encrypt data at rest, perhaps by using Amazon KMS with Customer Managed Keys (CMK) to encrypt the S3 bucket.
AWS Lambda will then process the email as a JSON message and store the PDF document attachment in the S3 bucket. Below is a sample code snippet:
import json
import boto3
import email
import os
def lambda_handler(event, context):
# Initiate boto3 client
s3 = boto3.client('s3')
# Get the S3 object contents
objectData = s3.get_object(Bucket='<bucket_name>', Key='<item_name>')
emailContent = objectData['Body'].read().decode("utf-8")
# Extract the message content and attachment
message = email.message_from_string(emailContent)
try:
attachment = message.get_payload()[1]
# Write the attachment to a temporary location
open('/tmp/<file>.pdf', 'wb').write(attachment.get_payload(decode=True))
# Upload the file to the destination S3 bucket
try:
s3.upload_file('/tmp/<file>.pdf', '<bucket_name>', '<key_name>')
except FileNotFoundError:
console.log("<failure message>")
# Clean up the temporary file
os.remove('/tmp/<file>.pdf')
return {
'statusCode': 200,
'body': json.dumps('<success message>')
}
except:
# Handle exception
Step 2: Extracting Information from PDF Reports
There are two methods for initiating the processing of PDF documents. One is when customers receive emails with PDF attachments, which are processed and stored in an S3 bucket as previously described. The alternative is for customers who already possess PDF reports stored elsewhere; they can batch upload these reports to an S3 bucket.
To enhance functionality, we can convert PDFs to images. Using a sample report from Apex for testing, we illustrate the fields and values extracted and stored.
When PDF attachments are processed from emails and stored in the S3 bucket, an S3 object creation event is triggered. This event activates the AWS Lambda function. The Lambda function utilizes the pdfium library to convert each page of the PDF document into an image, which is saved temporarily. Subsequently, the Lambda function invokes the Textract API to process the image and returns the extracted values. For further guidance on synchronous and asynchronous use cases, please refer to the conclusion section. Image conversion from PDF can also accommodate edge cases, such as extracting 30- to 50-character alphanumeric strings. If customers have multi-page PDFs, the Lambda function will need to iterate through all pages until completion.
For expert insights, you can also refer to this authoritative source. Additionally, for those interested in the hiring process at Amazon, this is an excellent resource.
Leave a Reply