Scalable Machine Learning Inference with AWS Serverless Solutions

The increasing integration of Machine Learning (ML) across various sectors has led to a heightened need for efficient and rapid ML inference at scale. Applications like defect detection in manufacturing, demand forecasting, fraud monitoring, and others often necessitate handling vast datasets, which can include images, videos, documents, and more. Such inference scenarios typically require scaling to accommodate thousands of parallel processing units. AWS serverless solutions provide a straightforward and automated scaling environment, making it an ideal option for executing ML inference at scale. With serverless architecture, inferences can be performed without the need for server management, allowing users to pay solely for the computational time utilized. ML practitioners can seamlessly bring their own models and inference code to AWS using containerization.

This article outlines how to implement and scale ML inference utilizing AWS serverless technologies, specifically AWS Lambda and AWS Fargate.

Solution Overview

The architecture for both batch and real-time inference is illustrated in the accompanying diagram. This solution is exemplified through a sample image classification case, with the source code available on GitHub.

AWS Fargate: Facilitates batch inference at scale through serverless container technology. Fargate tasks load the container image containing the inference code for image classification.
AWS Batch: Manages job orchestration for batch inference by dynamically allocating Fargate containers based on job requirements.
AWS Lambda: Supports real-time ML inference at scale. The Lambda function executes the inference code for image classification and also submits batch inference jobs.
Amazon API Gateway: Offers a REST API endpoint for the inference Lambda function.
Amazon Simple Storage Service (S3): Serves as storage for input images and inference results associated with batch processing.
Amazon Elastic Container Registry (ECR): Houses the container image with inference code for Fargate instances.

Deploying the Solution

An AWS Cloud Development Kit (CDK) template has been developed to define and configure the necessary resources for the sample solution. The CDK simplifies infrastructure provisioning and creation of deployment packages for both the Lambda Function and Fargate container. These packages include frequently used ML libraries like Apache MXNet and Python, along with their dependencies. The sample runs inference code utilizing a ResNet-50 model trained on the ImageNet dataset, allowing for the classification of images into 1000 categories, including various objects and animals. The inference process downloads an input image and predicts the top five related classes with their associated probabilities.

To follow this guide and execute the solution, you will need:

An AWS account
A terminal equipped with AWS Command Line Interface (CLI), CDK, Docker, Git, and Python. You can utilize a local terminal or an AWS Cloud9 environment.

To deploy the solution, execute these steps in your terminal:

$ git clone https://github.com/aws-samples/aws-serverless-for-machine-learning-inference

$ ./install.sh
or
$ ./cloud9_install.sh #If using AWS Cloud9

Confirm by entering Y to proceed with the deployment. This process will set up the required resources in your AWS account and may take approximately 30 minutes for the first deployment. Future deployments will typically complete in a few minutes.

The deployment steps include:

Creating a CloudFormation stack (“MLServerlessStack”).
Building a container image from the Dockerfile and inference code for batch processing.
Establishing an ECR repository and publishing the container image.
Setting up a Lambda function with the inference code for real-time processing.
Configuring a batch job with a Fargate compute environment in AWS Batch.
Creating an S3 bucket for storing input images and inference results.
Setting up a Lambda function to submit batch jobs triggered by image uploads to the S3 bucket.

Running Inference

The sample solution enables predictions through either batch inference for multiple images or real-time predictions for individual images. Follow these steps for each scenario.

Batch Inference

To obtain batch predictions, upload image files to Amazon S3.

Using the Amazon S3 console or AWS CLI, upload one or more image files to the designated S3 bucket:

$ aws s3 cp <path to jpeg files> s3://ml-serverless-bucket-<acct-id>-<aws-region>/input/ --recursive

This action will initiate a batch job, launching Fargate tasks to execute the inference. Monitor the job status in the AWS Batch console. Upon completion, inference results will be accessible from the output path in the S3 bucket.

Real-Time Inference

For real-time predictions, invoke the REST API endpoint with an image payload.

Access the CloudFormation console to locate the API endpoint URL (httpAPIUrl) from the stack output. Utilize an API client like Postman or the curl command to send a POST request to the /predict API endpoint with an image file payload:

$ curl --request POST -H "Content-Type: application/jpeg" --data-binary @<your jpg file name> <your-api-endpoint-url>/predict

The inference results will be returned in the API response.

Additional Recommendations and Tips

Scaling: Adjust AWS Service Quotas in your account and region to accommodate your scaling and concurrency requirements. For instance, if your use case exceeds the default Lambda concurrency limit, you will need to increase this limit to achieve the necessary concurrency. Size your VPC and subnets appropriately to allow for the required number of Fargate tasks.
Performance: Conduct load testing and tune performance across each layer to meet your requirements.
Use of Container Images with Lambda: Utilizing containers with both AWS Lambda and AWS Fargate can simplify source code management and packaging.
Batch Inferences with Lambda: Lambda functions can also be employed for batch inferences if the storage and processing times fall within Lambda limits.
Fargate Spot: This option allows for the execution of interruption-tolerant tasks at reduced rates compared to standard Fargate pricing, thus lowering compute resource costs.
Amazon ECS with EC2 Instances: For scenarios requiring specific compute types, consider using EC2 instances instead of Fargate.

To conclude your work, navigate to your project directory from the terminal and execute the necessary cleanup commands. For further insights on similar topics, check out this blog post here, as well as articles from chvnci.com for authoritative takes on the matter. Additionally, this YouTube video serves as an excellent resource for visual learners.