Amazon IXD – VGT2 Las Vegas: Developing Advanced Workflows with Amazon MWAA, AWS Step Functions, AWS Glue, and Amazon EMR

Amazon IXD - VGT2 Las Vegas: Developing Advanced Workflows with Amazon MWAA, AWS Step Functions, AWS Glue, and Amazon EMRMore Info

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a fully managed service that simplifies the execution of open-source versions of Apache Airflow on AWS, enabling the construction of workflows for your extract, transform, and load (ETL) jobs and data pipelines. AWS Step Functions can act as a serverless function orchestrator to create scalable big data pipelines, utilizing services like Amazon EMR for running Apache Spark and other open-source applications in a cost-effective manner. AWS Glue offers a serverless environment for preparing (extracting and transforming) and loading substantial datasets from diverse sources for analytics and data processing through Apache Spark ETL jobs.

In production environments, a typical scenario involves reading data from various sources. This data must undergo transformation to extract business insights before it is sent to downstream applications, including machine learning algorithms, analytics dashboards, and business reports.

This article illustrates how to leverage Amazon MWAA as the principal workflow management service to create and execute complex workflows while extending the directed acyclic graph (DAG) to initiate and monitor a state machine crafted with Step Functions. In Airflow, a DAG comprises all tasks you intend to run, organized to reflect their relationships and dependencies.

Architectural Overview

The diagram below provides an architectural overview of the components engaged in orchestrating the workflow. This workflow employs Amazon EMR to preprocess data and initiates a Step Functions state machine. The state machine is responsible for transforming data using AWS Glue.

The workflow consists of the following core components:

  • The Airflow Scheduler triggers the DAG according to a schedule or manually.
  • The DAG employs PythonOperator to create an EMR cluster and awaits the completion of the cluster creation process.
  • A custom operator, EmrSubmitAndMonitorStepOperator, is utilized to submit and monitor the Amazon EMR step.
  • The DAG incorporates a PythonOperator to shut down the EMR cluster once preprocessing tasks are finished.
  • Finally, the DAG initiates a Step Functions state machine and monitors it until completion through PythonOperator.

You can independently construct intricate ETL pipelines with Step Functions and trigger them from an Airflow DAG.

Prerequisites

Before commencing, establish an Amazon MWAA environment. If this is your first time with Amazon MWAA, refer to Introducing Amazon Managed Workflows for Apache Airflow (MWAA).

Note the Amazon Simple Storage Service (Amazon S3) bucket that houses the DAGs, found on the environment details page of the Amazon MWAA console. Also, take note of the AWS Identity and Access Management (IAM) execution role. This role must be modified to permit MWAA to read and write from your S3 bucket, submit an Amazon EMR step, start a Step Functions state machine, and read from the AWS Systems Manager Parameter Store. The IAM role can be found in the Permissions section of the environment details.

The solution refers to Systems Manager parameters in an AWS CloudFormation template and scripts. For guidance on adding or removing IAM identity permissions, consult Adding and Removing IAM Identity Permissions. A sample IAM policy is also available in the GitHub repository amazon-mwaa-complex-workflow-using-step-functions.

In this post, we utilize the MovieLens dataset. We concurrently convert the MovieLens CSV files to Parquet format and store them in Amazon S3 as part of the preprocessing phase.

Setting Up the State Machine Using Step Functions

Our solution enhances the ETL pipeline to execute a Step Functions state machine from the Airflow DAG. Step Functions allows for the creation of visual workflows that facilitate the rapid translation of business needs into technical specifications. With Step Functions, you can manage dependencies and handle failures using a JSON-based template. A workflow consists of a series of steps, including tasks, choices, parallel runs, and timeouts, with the output of one step serving as the input for the next. For additional use cases, refer to AWS Step Functions Use Cases.

The diagram below outlines the ETL process established via a Step Functions state machine.

In the workflow, the Process Data step executes an AWS Glue job, while the Get Job Status step periodically checks for job completion. The AWS Glue job reads the input datasets and generates output data for the most popular and top-rated movies. After the job concludes, the Run Glue Crawler step executes an AWS Glue crawler to catalog the data. This workflow also enables monitoring and response to failures at any stage.

Creating Resources

Establish your resources by following the installation instructions laid out in the amazon-mwaa-complex-workflow-using-step-functions README.md.

Executing the ETL Workflow

To execute your ETL workflow, follow these steps:

  1. In the Amazon MWAA console, select Open Airflow UI.
  2. Locate the mwaa_movielens_demo DAG.
  3. Activate the DAG.
  4. Choose the mwaa_movielens_demo DAG and click Graph View.

This action reveals the overall ETL pipeline managed by Airflow.

  1. To view the DAG code, click Code.

The code for the custom operator can be found in the amazon-mwaa-complex-workflow-using-step-functions GitHub repository.

  1. From the Airflow UI, select the mwaa_movielens_demo DAG and click Trigger DAG.
  2. Leave the Optional Configuration JSON box empty.

When the Airflow DAG executes, the first task utilizes PythonOperator to create an EMR cluster via Boto3. Boto is the AWS SDK for Python, enabling developers to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon S3. Boto provides both an object-oriented API and low-level access to AWS services.

The second task waits until the EMR cluster is operational and in the Waiting state. Once the cluster is ready, the data load task executes, followed by data preprocessing tasks initiated in parallel using EmrSubmitAndMonitorStepOperator. The current Airflow DAG’s concurrency is set to 3, allowing for three tasks to run simultaneously. You can adjust the Amazon EMR concurrency to execute multiple EMR steps concurrently.

After the data preprocessing tasks are completed, the EMR cluster is shut down, and the DAG initiates the Step Functions state machine to commence data transformation. The final task in the DAG oversees the Step Functions state machine’s completion.

The DAG run is expected to conclude in approximately 10 minutes.

Verifying the DAG Run

While the DAG is executing, you can inspect the task logs.

  1. From Graph View, select any task and click View Log.
  2. When the DAG starts the Step Functions state machine, verify the status on the Step Functions console.
  3. You can also track ETL process completion from the Airflow UI.
  4. In the Airflow UI, confirm completion from the log entries.

Querying the Data

After the successful execution of the Airflow DAG, two tables are created in the AWS Glue Data Catalog. To query the data with Amazon Athena, follow these steps:

  1. Access the Athena console and select Databases.
  2. Click on the mwaa-movielens-demo-db database.

You should see the two tables available.

For additional resources on navigating AWS services, consider visiting this excellent Reddit thread.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *