Streamlining Machine Learning Pipelines with Terraform and Amazon SageMaker

In the realm of cloud infrastructure, customers are increasingly turning to Infrastructure as Code (IaC) to effectively design, develop, and manage their resources. IaC ensures that infrastructure and services are not only consistent and scalable but also reproducible, adhering to best practices in development operations (DevOps). One powerful tool for managing AWS infrastructure through IaC is Terraform, which enables developers to structure their infrastructure using reusable code modules. This methodology is becoming particularly significant in the field of machine learning (ML). By utilizing Terraform for the development and management of ML pipelines—including training and inference—you can effortlessly scale across multiple ML applications or regions without building the infrastructure from the ground up. This approach also guarantees uniformity in infrastructure configurations, such as instance types and sizes, which is crucial for training and inference across various implementations of the ML pipeline. Consequently, you can efficiently route requests to different Amazon SageMaker endpoints.

In this article, we will guide you through the deployment and management of ML pipelines using Terraform and Amazon SageMaker.

Overview of the Solution

This guide includes code snippets and outlines the steps required to set up AWS infrastructure for ML pipelines using Terraform, specifically for model training and inference with Amazon SageMaker. The ML pipeline will be orchestrated through AWS Step Functions, which manages the various stages of the pipeline as depicted in the following diagram.

AWS Step Functions initiates an AWS Lambda function that generates a unique job ID, which is then utilized to start a SageMaker training job. Step Functions also handles the creation of models, endpoint configurations, and endpoints used for inference. Additional resources include:

AWS Identity and Access Management (IAM) roles and policies linked to the resources to facilitate interaction with other components.
Amazon Simple Storage Service (Amazon S3) buckets for training data and model outputs.
An Amazon Elastic Container Registry (Amazon ECR) repository for the Docker image containing the training and inference logic.

The ML-related code for training and inference with a Docker image primarily draws from existing work found in a GitHub repository.

High-Level Steps

Here’s a summary of the major steps we will cover:

Deploy your AWS infrastructure with Terraform.
Push your Docker image to Amazon ECR.
Execute the ML pipeline.
Invoke your endpoint.

Repository Structure

The repository containing the code and data referenced in this article can be found in the designated GitHub repository. It comprises the following directories:

/terraform – Contains:
- ./infrastructure – Includes the main.tf file that calls the ML pipeline module, along with variable declarations for infrastructure deployment.
- ./ml-pipeline-module – Houses the reusable Terraform ML pipeline module.
/src – Contains:
- ./container – Houses example code for training and inference, including Docker image definitions.
- ./lambda_function – Contains the Python code for the Lambda function that generates configurations, including a unique job ID for the SageMaker training job.
/data – Contains:
- ./iris.csv – The dataset used for training the ML model.

Prerequisites

Before you begin, ensure that you have the following prerequisites:

An AWS account.
Terraform version 0.13.5 or higher.
AWS Command Line Interface (AWS CLI) version 2.
Python version 3.7 or higher.
Docker.

Deploying Your AWS Infrastructure with Terraform

To set up the ML pipeline, you will need to modify a few variables and names to meet your requirements. The necessary code is located in the /terraform directory. When initializing for the first time, open the file terraform/infrastructure/terraform.tfvars and adjust the variable project_name to reflect your project’s name and change the variable region if you wish to deploy in a different region. You can also modify additional variables such as instance types for training and inference.

Next, execute the following commands to deploy the infrastructure via Terraform:

export AWS_PROFILE=<your_aws_cli_profile_name>
cd terraform/infrastructure
terraform init
terraform plan
terraform apply

Review the output to ensure that the planned resources appear correct, and confirm with ‘yes’ during the apply stage if everything looks good. Subsequently, navigate to the Amazon ECR console (or review the Terraform output in the terminal) to obtain the URL for the ECR repository you created with Terraform.

Pushing Your Docker Image to Amazon ECR

To facilitate the ML pipeline and SageMaker in training and provisioning a SageMaker endpoint for inference, you must provide a Docker image and store it in Amazon ECR. An example can be found in the src/container directory. If you have already applied the AWS infrastructure from the previous step, you can push the Docker image as instructed. After developing your Docker image, execute the following commands to push it to Amazon ECR (make sure to adjust the ECR URL as necessary):

cd src/container
export AWS_PROFILE=<your_aws_cli_profile_name>
aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin <account_number>.dkr.ecr.eu-west-1.amazonaws.com
docker build -t ml-training .
docker tag ml-training:latest <account_number>.dkr.ecr.eu-west-1.amazonaws.com/<ecr_repository_name>:latest
docker push <account_number>.dkr.ecr.eu-west-1.amazonaws.com/<ecr_repository_name>

If you have already deployed the AWS infrastructure via Terraform, you can directly push the updates of your code and Docker image to Amazon ECR without redeploying through Terraform.

Running the ML Pipeline

To initiate the training and execution of the ML pipeline, navigate to the Step Functions console and start the process. You can monitor the progress of each step in the state machine visualization. Additionally, check the training job’s progress in SageMaker and the status of your SageMaker endpoint.

Once the state machine in Step Functions runs successfully, you will see that the SageMaker endpoint has been established. In the SageMaker console, select Inference from the navigation pane and then Endpoints. Ensure that you wait for the status to change to InService.

Invoking Your Endpoint

To invoke your endpoint (using the iris dataset as an example), you can utilize the following Python script with the AWS SDK for Python (Boto3). This can be executed from a SageMaker notebook or embedded within a Lambda function:

import boto3
from io import StringIO
import pandas as pd

client = boto3.client('sagemaker-runtime')

endpoint_name = 'Your endpoint name'  # Your endpoint name.
content_type = "text/csv"  # The MIME type of the input data in the request body.

payload = pd.DataFrame([[1.5, 0.2, 4.4, 2.6]])

For those interested in career growth and salary insights, you can check out this valuable resource. Furthermore, for comprehensive information about occupational health, visit SHRM. Lastly, if you’re curious about the skills Amazon employees are acquiring, this excellent resource will provide you with great insights.