Efficient Processing of Energy Sector PDF Reports with Amazon Textract

In the energy and utilities sector, organizations frequently handle PDF documents for various purposes. Below are two scenarios we’ve explored:

Scenario 1

In upstream and midstream operations, service firms like XYZ Inspection Services produce PDF reports during meter inspections, sending them to clients as email attachments. Clients often seek to extract data from these reports digitally to develop long-term business analytics. We will illustrate this process using a sample report from XYZ.

Scenario 2

In the realm of commodity and energy trading, counterparties generate PDF reports for contract confirmations and trade validations, which are subsequently emailed. Upon receiving these attachments, clients may manually enter a 30- to 50-character alphanumeric confirmation code from the PDF to kick off the trade validation process. Manual data entry can lead to errors, causing delays and complications.

Current Customer Workflow

External vendors or clients send PDF attachments via email.
A team member manually downloads these PDF files.
The PDFs are then uploaded to a central repository for team collaboration, record-keeping, and storage.
Relevant teams access these reports for analysis and insights.
Information is manually input into other digital systems, such as databases, for further processing like analytics and transactions.

This blog post will present an architecture for efficiently processing PDF reports using Amazon Textract. With Amazon Textract, organizations can extract text, data points, geographic information, grades, or other critical data from PDF documents. The primary benefits of our architecture include automating manual workflow steps, enhancing analytics capabilities, and transforming business processes. Additionally, this approach removes the need for cumbersome PDF parsing libraries and the management of exceptions.

Reading Time: 7 minutes
Learning Level: Advanced (300)
Services Utilized: Amazon Textract, Amazon S3, Amazon SES, Amazon Route 53, AWS Lambda, Amazon DynamoDB, Amazon Athena, Amazon QuickSight

Solution Overview

The architecture outlined in Diagram 1 includes three key functionalities:

Automatically extract PDF attachments from emails and store them in Amazon S3.
Extract information from PDF reports using Amazon Textract and save it in Amazon DynamoDB.
Generate analytical reports with Amazon Athena and Amazon QuickSight as new data comes in.

Amazon Textract can extract all or selected values from reports, which are stored in a NoSQL database known as Amazon DynamoDB. Organizations can create a dashboard in Amazon QuickSight to visualize these values for identifying trends and anomalies.

Diagram 1 – Reference Architecture: Store and Visualize Information from PDFs using Amazon Textract, Amazon DynamoDB, and Amazon QuickSight.
The capabilities of this solution can be decoupled, allowing customers to select the components most relevant to their needs. For instance, some clients are already familiar with Amazon Textract but seek to process emails and attachments. Others might have existing analytics pipelines and reporting tools but require a method to digitally extract information from documents. This flexibility allows clients to customize and enhance their solutions.

Walkthrough

Step 1. Process Email Attachments

An architecture for an email-receiving pipeline can be implemented as shown in Diagram 2, automating email handling. Customers can establish a custom domain using AWS services such as Amazon SES and Amazon Route 53. For more insights, check out this blog post. Emails will arrive in a customer-owned Amazon Simple Storage Service (Amazon S3) bucket, and we strongly recommend that customers encrypt data at rest. One method to achieve this is by encrypting the S3 bucket with Amazon KMS using Customer Managed Keys (CMK).

Diagram 2 – Reference Architecture: Set Up an Email-Receiving Pipeline
AWS Lambda will then process the email as a JSON message, storing the PDF document attachment in the S3 bucket. Here’s a snippet of sample code:

import json
import boto3
import email
import os

def lambda_handler(event, context):
    # Initiate boto3 client
    s3 = boto3.client('s3')
    # Get the s3 object contents
    objectData = s3.get_object(Bucket=<bucket_name>, Key=<item_name>)
    emailContent = objectData['Body'].read().decode("utf-8")
    # Given the s3 object content is the ses email, extract the message content and attachment using the email package.
    message = email.message_from_string(emailContent)
    try:
        attachment = message.get_payload()[1]
        # Write the attachment to a temporary location
        open('/tmp/<file>.pdf', 'wb').write(attachment.get_payload(decode=True))
        # Upload the file from the temporary location to destination S3 bucket
        try:
            s3.upload_file('/tmp/<file>.pdf', '<bucket_name>', '<key_name>')
        except FileNotFoundError:
            console.log("<failure message>")
        # Clean up the temporary file
        os.remove('/tmp/<file>.pdf')
        return {
            'statusCode': 200,
            'body': json.dumps('<success message>')
        }
    except:
        # Handle exception

Step 2. Extract Information from PDF Reports

The solution can initiate PDF document processing in two ways. One scenario involves customers receiving emails with PDF attachments, which are then stored in an S3 bucket as previously described. Alternatively, customers may already have PDF reports stored elsewhere and can batch upload them to an S3 bucket.

Convert PDF to Image

Using a sample report from XYZ for testing, we will highlight the fields and values extracted and stored.

When PDF attachments are processed from email and stored in an S3 bucket, an S3 object creation event is triggered. This event will activate the AWS Lambda function seen in Diagram 1. The pdfium library will convert each page of the PDF document into an image stored temporarily. The Lambda function will then call the synchronous Textract APIs to process the image and return the values. For more details on synchronous and asynchronous use cases and patterns, refer to this authoritative source.

By integrating this architecture, organizations can streamline their PDF reporting processes, which can be particularly beneficial in the energy sector. For additional insights on the topic, you may find this resource helpful.

Efficient Processing of Energy Sector PDF Reports with Amazon Textract | AWS for Industries