Orchestrate an ETL Pipeline Using AWS Glue Workflows, Triggers, and Crawlers with Custom Classifiers

Extract, transform, and load (ETL) orchestration is a widely used method for constructing big data pipelines. Effective orchestration for parallel ETL processes necessitates the integration of various tools to execute distinct operations. To streamline this orchestration, AWS Glue workflows can be utilized. This article illustrates how to achieve parallel ETL orchestration through AWS Glue workflows and triggers. Additionally, we will showcase how to apply custom classifiers with AWS Glue crawlers to categorize fixed-width data files.

AWS Glue workflows offer both visual and programmatic avenues to design data pipelines by incorporating AWS Glue crawlers for schema discovery, along with AWS Glue Spark and Python shell jobs to transform data. A workflow is made up of one or more task nodes arranged in a graphical representation. You can define relationships and pass parameters between task nodes, allowing you to construct pipelines of varying complexity. Workflows can be triggered either on a schedule or on-demand, and you can monitor the progress of individual nodes or the entire workflow, simplifying the troubleshooting of your pipelines.

If you aim to automatically create a table definition for data that does not conform to AWS Glue’s built-in classifiers, a custom classifier must be defined. For instance, if your data originates from a mainframe system utilizing a COBOL copybook data structure, defining a custom classifier during the crawling process is essential for extracting the schema. AWS Glue crawlers allow you to implement a custom classifier to categorize your data. This classifier can be created using a Grok pattern, an XML tag, JSON, or CSV. When initiated, the crawler invokes a custom classifier, and if it recognizes the data, it stores the classification and schema in the AWS Glue Data Catalog.

Use Case

For this article, we will examine the ingestion of Automated Clearing House (ACH) and check payments data. ACH is a computer-based electronic network for processing transactions, while check payments involve a negotiable transaction drawn from deposited funds to pay the recipient a specific amount upon demand. Both ACH and check payments data files, which are formatted in fixed widths, must be incrementally ingested into the data lake over time. During ingestion, these two data types need to be merged to provide a unified view of all payments. ACH and check payment records are consolidated into a table that is useful for executing business analytics using Amazon Athena.

Solution Overview

We define an AWS Glue crawler with a custom classifier for each file or data type. An AWS Glue workflow orchestrates the entire process, triggering crawlers to run concurrently. Once the crawlers complete their tasks, the workflow initiates an AWS Glue ETL job to process the input data files. The workflow monitors the completion of the ETL job, which performs data transformation and updates the table metadata in the AWS Glue Data Catalog.

The accompanying diagram illustrates a typical workflow for ETL workloads. This post includes an AWS CloudFormation template that sets up the resources described by the AWS Glue workflow architecture. AWS CloudFormation allows you to model, provision, and manage AWS resources by treating infrastructure as code.

The CloudFormation template generates the following resources:

An AWS Glue workflow trigger that starts manually. This trigger initiates two crawlers simultaneously for processing the ACH and check payments data files.
Custom classifiers for parsing incoming fixed-width files containing ACH and check data.
AWS Glue crawlers:
- A crawler to classify ACH payments in the RAW database. This crawler employs the custom classifier defined for ACH payments raw data, creating a table called ACH in the Data Catalog’s RAW database.
- A crawler to classify check payments, utilizing the custom classifier defined for check payments raw data. This crawler creates a table named Check in the Data Catalog’s RAW database.
An AWS Glue ETL job that runs once both crawlers are complete. This ETL job reads the ACH and check tables, performs transformations using PySpark DataFrames, writes the output to a target Amazon Simple Storage Service (Amazon S3) location, and updates the Data Catalog for the processed payment table with new hourly partitions.
S3 buckets designated as RawDataBucket, ProcessedBucket, and ETLBucket. RawDataBucket stores the raw payment data received from the source system, while ProcessedBucket holds the output after AWS Glue transformations have been applied. This data is suitable for consumption by end-users via Athena. ETLBucket contains the AWS Glue ETL code utilized for data processing within the workflow.

Create Resources with AWS CloudFormation

To create your resources using the CloudFormation template, follow these steps:

Select “Launch Stack.”
Click “Next.”
Click “Next” again.
On the Review page, check the box indicating you acknowledge that AWS CloudFormation may create IAM resources.
Click “Create stack.”

Examine Custom Classifiers for Fixed Width Files

Now, let’s review the definition of the custom classifier.

On the AWS Glue console, select “Crawlers.”
Choose the crawler ach-crawler.
Select the RawACHClassifier classifier and review the Grok pattern.

This pattern assumes that the first 16 characters in the fixed-width file are designated for acct_num, and the subsequent 10 characters are reserved for orig_pmt_date. When a crawler identifies a matching classifier, the classification string and schema are utilized in the definition of tables recorded in your Data Catalog.

Run the Workflow

To execute your workflow, complete the following:

On the AWS Glue console, choose the workflow created by the CloudFormation template.
From the Actions menu, select “Run.”

This will initiate the workflow.

When the workflow concludes, navigate to the History tab and select “View run details.”

You can review a graph illustrating the workflow.

Examine the Tables

In the Databases section under the AWS Glue console, locate the database named glue-database-raw, which contains two tables named ach and check. These tables are created by the respective AWS Glue crawler using the defined custom classification pattern.

Query Processed Data

To query your data, follow these steps:

On the AWS Glue console, select the database glue-database-processed.
From the Action menu, choose “View data.”

The Athena console will open. If this is your initial use of Athena, you need to configure the S3 bucket to store query results.

In the query editor, run the following query:

select acct_num, pymt_type, count(pymt_type)
from glue_database_processed.processedpayment 
group by acct_num, pymt_type;

You will see the count of payment types for each account displayed from the processedpayment table.

Clean Up

To prevent ongoing charges, clean up your infrastructure by deleting the CloudFormation stack. However, you must first empty your S3 buckets.

In the Amazon S3 console, select each bucket created by the CloudFormation stack.
Choose “Empty.”
In the AWS CloudFormation console, select the stack you created.
Click “Delete.”

Conclusion

In this article, we explored how AWS Glue Workflows can streamline the orchestration of ETL processes, enabling efficient data management. For more insights into personal storytelling during job searches, you might want to check out this blog post. Additionally, for a deeper understanding of performance management, consider visiting this authority on the topic. If you’re interested in becoming a Learning Trainer, this link provides an excellent resource for opportunities.

6401 E HOWDY WELLS AVE LAS VEGAS NV 89115, Amazon IXD – VGT2