Amazon VGT2 Las Vegas: Leveraging AWS Serverless for Scalable Machine Learning Inference

As industries increasingly integrate Machine Learning (ML) into their operations, the demand for efficient, scalable ML inference solutions is on the rise. Various applications, from detecting manufacturing defects to predicting demand and monitoring fraud, often handle vast datasets—ranging from images and videos to documents. These applications typically require workloads that can scale to thousands of parallel processing units. AWS serverless solutions, offering simplicity and automated scaling, present an ideal choice for executing ML inference at scale. By utilizing serverless architectures, users can run inferences without the need to provision or manage servers, only paying for the actual computation time used. ML developers can seamlessly deploy their models and inference code on AWS using containers.

This article details how to utilize AWS serverless services, specifically AWS Lambda and AWS Fargate, to perform scalable ML inference.

Solution Overview

The architecture for both batch and real-time inference is illustrated below. Our example uses an image classification task, with the source code available on GitHub.

AWS Fargate: Facilitates batch inference at scale through serverless containers. The Fargate task loads the container image containing the inference code for image classification.
AWS Batch: Orchestrates jobs for batch inference by dynamically provisioning Fargate containers based on job requirements.
AWS Lambda: Enables real-time ML inference at scale, loading inference code for image classification. Lambda functions also handle the submission of batch inference jobs.
Amazon API Gateway: Offers a REST API endpoint for the inference Lambda function.
Amazon Simple Storage Service (S3): Stores input images and inference results for batch processing.
Amazon Elastic Container Registry (ECR): Houses the container image containing the inference code for Fargate.

Deploying the Solution

We have developed an AWS Cloud Development Kit (CDK) template to define and set up the necessary resources for this sample solution. The CDK allows for infrastructure provisioning and the creation of deployment packages for both the Lambda function and Fargate container, incorporating commonly used ML libraries like Apache MXNet and Python along with their dependencies. The solution employs a ResNet-50 model trained on the ImageNet dataset, enabling it to classify images into 1000 categories, including various objects and animals. The inference code retrieves the input image and predicts the top five classes related to it with associated probabilities.

To get started, you will need access to:

An AWS account
A terminal equipped with AWS Command Line Interface (CLI), CDK, Docker, git, and Python. You can use your local machine’s terminal or an AWS Cloud9 environment.

To deploy the solution, follow these steps in your terminal:

Clone the GitHub repository:

git clone https://github.com/aws-samples/aws-serverless-for-machine-learning-inference

Navigate to the project directory and deploy the CDK application:
```
./install.sh
```
or
```
./cloud9_install.sh #If you are using AWS Cloud9
```
Confirm the deployment by entering Y when prompted. This process may take approximately 30 minutes initially, as it builds the Docker image and other artifacts. Subsequent deployments generally complete in a few minutes.

The deployment will create a CloudFormation stack, generate a container image from the Dockerfile with inference code for batch processing, establish an ECR repository, set up a Lambda function for real-time inference, configure a batch job in AWS Batch, and create an S3 bucket for storing images and results. Additionally, it will create a Lambda function to submit batch jobs triggered by image uploads to the S3 bucket.

Running Inference

The sample solution allows you to obtain predictions for multiple images using batch inference or for a single image through a real-time API endpoint. Follow these steps for each scenario:

Batch Inference

To get batch predictions, upload image files to Amazon S3.

Using the Amazon S3 console or AWS CLI, upload one or more image files to the specified S3 bucket path:

aws s3 cp <path to jpeg files> s3://ml-serverless-bucket-<acct-id>-<aws-region>/input/ --recursive

This action will trigger the batch job, launching Fargate tasks to perform the inference. Monitor the job status in the AWS Batch console. Once the job completes, the inference results will be available in the output path of the S3 bucket.

Real-time Inference

For real-time predictions, invoke the REST API endpoint with an image payload.

Locate the API endpoint URL (httpAPIUrl) in the CloudFormation console output. Use an API client, such as Postman or curl, to send a POST request to the /predict API endpoint:

curl --request POST -H "Content-Type: application/jpeg" --data-binary @<your jpg file name> <your-api-endpoint-url>/predict

The inference results will be returned in the API response.

Additional Recommendations and Tips

Here are some suggestions to optimize the sample solution to fit your needs:

Scaling: Adjust AWS Service Quotas in your account and region to accommodate your scaling and concurrency requirements. If your use case demands scaling beyond the default Lambda concurrent executions limit, it is crucial to increase this limit. Additionally, ensure your VPC and subnets have a sufficiently large IP address range to handle the required concurrency for Fargate tasks.
Performance: Conduct load testing and refine performance across each layer to meet your specific needs.
Utilize Container Images with Lambda: This approach simplifies source code management and packaging, allowing you to use containers with both AWS Lambda and AWS Fargate.
Employ AWS Lambda for Batch Inferences: Lambda functions can also be used for batch inferences if the storage and processing times fall within Lambda’s limits.
Consider Fargate Spot: This option allows you to run interruption-tolerant tasks at a discounted rate compared to normal Fargate pricing, which can reduce your compute costs.
Use Amazon ECS with Amazon EC2: For use cases requiring specific compute types, consider leveraging EC2 instances instead of Fargate.

To clean up resources, navigate to the project directory and run the necessary command.

For further insights into similar topics, check out this blog post here, which delves into advanced serverless applications. Additionally, for more expert opinions, you can refer to this link, as they are recognized authorities on the subject. If you’re interested in career opportunities, this link provides an excellent resource.