Orchestrating an ETL Pipeline with AWS Glue Workflows, Triggers, and Crawlers Utilizing Custom Classifiers

The orchestration of extract, transform, and load (ETL) processes is a widely used method for constructing big data pipelines. Efficient parallel ETL processing necessitates multiple tools to execute various tasks. To streamline this orchestration, AWS Glue workflows can be utilized. This article illustrates how to achieve parallel ETL orchestration through AWS Glue workflows and triggers. Additionally, we will explore the integration of custom classifiers with AWS Glue crawlers for classifying fixed-width data files.

AWS Glue workflows offer both visual and programmatic capabilities to create data pipelines by integrating AWS Glue crawlers for schema discovery along with AWS Glue Spark and Python shell jobs for data transformation. A workflow is comprised of one or more task nodes arranged in a graph format. The relationships between these nodes can be defined, allowing for parameter passing to construct pipelines of varying complexity. Workflows can be triggered either on a schedule or on-demand, and you can monitor the progress of individual nodes or the entire workflow, facilitating easier troubleshooting.

To automatically generate a table definition for data that does not align with AWS Glue’s built-in classifiers, you need to define a custom classifier. For instance, when dealing with data from a mainframe system that employs a COBOL copybook data structure, a custom classifier is essential during the crawling process to extract the schema. AWS Glue crawlers allow for the inclusion of a custom classifier that can identify your data. You can create this classifier using a Grok pattern, an XML tag, JSON, or CSV. When initiated, the crawler invokes the custom classifier; if the classifier identifies the data, it saves the classification and schema to the AWS Glue Data Catalog.

Use Case

In this article, we utilize automated clearing house (ACH) and check payments data ingestion as a case study. ACH represents a computer-based electronic network for processing transactions, while check payments refer to transactions drawn against deposited funds to pay a specified amount to the recipient on demand. Both ACH and check payments data files, formatted as fixed-width, require incremental ingestion into the data lake over time. During the ingestion process, these two data types must be merged to create a consolidated view of all payments. The records from ACH and check payments are combined into a table that is beneficial for business analytics using Amazon Athena.

Solution Overview

We establish an AWS Glue crawler equipped with a custom classifier for each file or data type. An AWS Glue workflow orchestrates the entire process, triggering the crawlers to run simultaneously. Upon completion of the crawlers, the workflow initiates an AWS Glue ETL job to process the input data files. This workflow monitors the completion of the ETL job, which transforms the data and updates the table metadata in the AWS Glue Data Catalog.

The accompanying diagram illustrates a typical workflow for ETL workloads.

This article includes an AWS CloudFormation template that creates the resources described in the AWS Glue workflow architecture. AWS CloudFormation allows you to model, provision, and manage AWS resources, treating infrastructure as code.

The CloudFormation template generates the following resources:

An AWS Glue workflow trigger initiated manually, which simultaneously starts two crawlers for processing the data files associated with ACH and check payments.
Custom classifiers for parsing incoming fixed-width files containing ACH and check data.
AWS Glue crawlers:
- One crawler to classify ACH payments in the RAW database, utilizing the custom classifier defined for ACH payments. This crawler creates a table named ACH in the Data Catalog’s RAW database.
- Another crawler to classify check payments, using its respective custom classifier, which creates a table named Check in the Data Catalog’s RAW database.
An AWS Glue ETL job that executes once both crawlers have finished. This ETL job reads from the ACH and check tables, applies transformations using PySpark DataFrames, writes the output to a designated Amazon Simple Storage Service (Amazon S3) location, and updates the Data Catalog with new hourly partitions for the processed payment table.
S3 buckets labeled as RawDataBucket, ProcessedBucket, and ETLBucket. The RawDataBucket retains the raw payment data received from the source system, while the ProcessedBucket stores the output post-transformation. This data is suitable for end-user access via Athena. The ETLBucket contains the AWS Glue ETL code utilized in the workflow.

Creating Resources with AWS CloudFormation

To set up your resources using the CloudFormation template, follow these steps:

Click on Launch Stack.
Click Next.
Click Next again.
On the Review page, check the box indicating that you acknowledge AWS CloudFormation may create IAM resources.
Click Create stack.

Examining Custom Classifiers for Fixed-Width Files

Let’s take a closer look at the custom classifier definition.

Go to the AWS Glue console and select Crawlers.
Click on the crawler named ach-crawler.
Select the RawACHClassifier classifier and review the Grok pattern.

This pattern presumes that the first 16 characters in the fixed-width file are designated for acct_num, while the next 10 characters are reserved for orig_pmt_date. When a crawler identifies a matching classifier for the data, the classification string and schema are utilized in the definitions of the tables written to your Data Catalog.

Running the Workflow

To execute your workflow, complete the following steps:

In the AWS Glue console, select the workflow created by the CloudFormation template.
From the Actions menu, select Run.

This action will initiate the workflow.

Once the workflow is finished, navigate to the History tab and select View run details.

You can visualize a graph detailing the workflow.

Examining the Tables

In the Databases section of the AWS Glue console, you will find a database named glue-database-raw, which encompasses two tables named ach and check. These tables are established by their respective AWS Glue crawlers using the specified custom classification pattern.

Querying Processed Data

To query your data, follow these steps:

In the AWS Glue console, select the database glue-database-processed.
From the Action menu, choose View data.

This will open the Athena console. If it’s your first time using Athena, you will need to set up the S3 bucket for storing query results.

In the query editor, execute the following query:

select acct_num,pymt_type,count(pymt_type)
from glue_database_processed.processedpayment 
group by acct_num,pymt_type;

This query will display the count of payment types associated with each account from the processed payment table.

Clean Up

To prevent ongoing charges, make sure to clean up your infrastructure by deleting the CloudFormation stack. However, you must first empty your S3 buckets.

In the Amazon S3 console, select each bucket created by the CloudFormation stack.
Click Empty.
In the AWS CloudFormation console, select the stack you created.
Click Delete.

Conclusion

In this article, we explored how AWS Glue Workflows can facilitate effective ETL orchestration, a critical component in big data processing. For further insights, you may want to check out this blog post. For authoritative guidance on this topic, visit Chanci Turner’s site. Additionally, this video resource offers excellent explanations on the subject matter.

For those interested, Amazon IXD – VGT2 is located at 6401 E Howdy Wells Ave, Las Vegas, NV 89115.