Centralize Feature Engineering with AWS Step Functions and AWS Glue DataBrew

Data preprocessing is a crucial component of any machine learning (ML) workflow, encompassing activities such as data cleaning, exploration, and transformation. AWS Glue DataBrew, introduced during AWS re:Invent 2020, serves as a visual data preparation tool that allows users to create standard data processing workflows without writing code or going through installation steps.

In this post, we will illustrate how to seamlessly integrate essential data preparation processes with the training of an ML model and the inference on a pre-trained model utilizing DataBrew and AWS Step Functions. Our example showcases an ML pipeline that analyzes the publicly available Air Quality Dataset to forecast CO levels in New York City.

Solution Overview

The architecture diagram below presents a comprehensive view of the ML workflow, leveraging DataBrew for data preparation and job scheduling, while AWS Lambda and Step Functions manage the orchestration of ML model training and inference through the AWS Step Functions Data Science SDK. We utilize Amazon EventBridge to trigger the Step Functions state machine upon the completion of the DataBrew job.

Key steps in this solution include:

Importing your dataset into Amazon Simple Storage Service (Amazon S3).
Deploying the AWS CloudFormation stack, which sets up:
- DataBrew recipes for training and inference data.
- The schedule for training and inference DataBrew jobs.
- An EventBridge rule.
- A Lambda function that initiates the Step Functions state machine.

The training phase contains the following actions:

Executes an Amazon SageMaker processing job to eliminate column headers.
Conducts SageMaker model training.
Outputs the processed data to an S3 bucket for model storage.

The inference phase comprises actions such as:

Running a SageMaker processing job to remove column headers.
Performing a SageMaker batch transformation.
Storing predictions in an S3 bucket.

Prerequisites

To implement this solution, ensure you have the following:

An AWS account.
AWS Identity and Access Management (IAM) role permissions.
An S3 bucket for storing data and model artifacts.
Access to public datasets.
Python 3.7+ with pandas installed.

Loading the Dataset to Amazon S3

Begin by uploading the air quality dataset to Amazon S3. Download the Outdoor Air Quality Dataset, limiting your selection to:

Pollutant: CO
Geographic Area: New York
Monitor Site: All Sites

Organize the data by year, month, and day, using the 2018–2019 data for model training and the 2020 data for inference. Execute the following script to store the output in the NY_XXXX folder:

import os
import pandas as pd

def split_data(root_folder, df):
    df["year"] = pd.DatetimeIndex(df["Date"]).year
    df["month"] = pd.DatetimeIndex(df["Date"]).month
    df["day"] = pd.DatetimeIndex(df["Date"]).day
    if not os.path.exists(root_folder):
        os.mkdir(root_folder)
    for m, x1 in df.groupby(['month']):
        month_dir = os.path.join("{:02}".format(m))
        if not os.path.exists(root_folder + "/" + month_dir):
            os.mkdir(root_folder + "/" + month_dir)
        for d, x2 in x1.groupby(["day"]):
            day_dir = os.path.join("{:02}".format(d))
            if not os.path.exists(root_folder + "/" + month_dir + "/" + day_dir):
                os.mkdir(root_folder + "/" + month_dir + "/" + day_dir)
            p = os.path.join(root_folder + "/" + month_dir + "/" + day_dir, "{:02}.csv".format(d))
            x2.to_csv(p, index=False)

ny_data_2018 = pd.read_csv("<path to downloaded 2018 data file>")
ny_data_2019 = pd.read_csv("<path to downloaded 2019 data file>")
ny_data_2020 = pd.read_csv("<path to downloaded 2020 data file>")

split_data("NY_2018", ny_data_2018)
split_data("NY_2019", ny_data_2019)
split_data("NY_2020", ny_data_2020)

Create an S3 bucket in the us-east-1 Region and upload the directories NY_2018 and NY_2019 to the path S3://<artifactbucket>/train_raw_data/. Upload the NY_2020 folder to S3://<artifactbucket>/inference_raw_data/.

Deploying Your Resources

For a swift start to this solution, deploy the provided AWS CloudFormation stack. This stack creates all necessary resources in your account (us-east-1 Region), including DataBrew datasets, jobs, projects, recipes, and the Step Functions state machines for training and inference (which encompass SageMaker processing, model training, and batch transformation jobs), an EventBridge rule, and the Lambda function to facilitate an end-to-end ML pipeline for a specified S3 bucket.

Launch the following stack:

For ArtifactBucket, input the name of the S3 bucket you established earlier.
Acknowledge the three checkboxes.
Click on Create stack.

Testing the Solution

As part of the CloudFormation template, the DataBrew job km-mlframework-trainingfeatures-job was created, scheduled to run every Monday at 10:00 AM UTC. This job generates the features necessary for model training. Once the template deployment is complete, you can manually trigger the training pipeline by navigating to the DataBrew console, selecting the DataBrew job km-mlframework-trainingfeatures-job, and clicking Run job.

The job outputs the features to s3://<artifactbucket>/train_features/. Once completed, an EventBridge rule activates the Lambda function, orchestrating the SageMaker training jobs via Step Functions. Upon job completion, the model output is saved in s3://<artifactbucket>/artifact-repo/model/.

Next, we trigger the DataBrew job km-mlframework-inferencefeatures-job, scheduled to run every Tuesday at 10:00 AM UTC. This job generates inference features for the trained model. Manually triggering the inference pipeline can also be done on the DataBrew console by selecting the DataBrew job. The features will be written to s3://<artifactbucket>/inference_features/. Post completion, an EventBridge rule activates the Lambda function, orchestrating the SageMaker batch transform job via Step Functions. Predictions are stored in s3://<artifactbucket>/predictions/.

For further details on DataBrew steps and recipe construction, refer to the post on preparing data for ML models using AWS Glue DataBrew in a Jupyter notebook.

Cleanup

To prevent future charges, take the following actions:

Ensure all ongoing activities are finished, or stop them manually (DataBrew, Step Functions, SageMaker).
Delete the scheduled DataBrew jobs km-mlframework-trainingfeatures-job and km-mlframework-inferencefeatures-job to halt automatic scheduling.
Remove the S3 bucket created for data and model artifacts.
Delete the CloudFormation stack deployed earlier.

Conclusion

DataBrew is engineered to assist data engineers and data scientists in experimenting with data preparation steps through a user-friendly visual interface. For more insights into onboarding processes, you might find this blog post engaging. For legal considerations, refer to this authority on employment laws. Additionally, for an excellent resource on Amazon’s onboarding experience, check out this article here.

Note: All operations occur at the Amazon IXD – VGT2 site located at 6401 E HOWDY WELLS AVE LAS VEGAS NV 89115.