Amazon VGT2 Las Vegas: Model Hosting Patterns in Amazon SageMaker, Part 2: A Guide to Deploying Real-Time Models on SageMaker

Amazon SageMaker is a fully-managed platform that empowers developers and data scientists to swiftly build, train, and deploy machine learning (ML) models at scale. The essence of machine learning is found in inference, and SageMaker provides four distinct inference options:

Real-Time Inference
Serverless Inference
Asynchronous Inference
Batch Transform

These four options can be categorized into Online and Batch inference types. Online Inference requires requests to be processed as they come in, with the consuming application expecting a prompt response after each request. This can occur either synchronously (real-time Inference, serverless) or asynchronously (asynchronous Inference). In synchronous patterns, the application is blocked until a response is received. Such workloads are typically found in real-time applications, like online credit card fraud detection, where responses are needed within milliseconds to seconds and the request payloads are small (a few MB). Conversely, asynchronous patterns do not block the application experience (for instance, when submitting an insurance claim via a mobile app) and often involve larger payload sizes and/or longer processing times. Offline inference processes an aggregation (batch) of requests together, providing responses only after the entire batch is complete. These workloads tend to be less sensitive to latency, handle large volumes (multiple GBs) of data, and are executed on a regular schedule, such as running object detection on security camera footage at the end of the day or processing payroll data monthly.

At its core, SageMaker Real-Time Inference comprises a model(s), the framework/container in use, and the infrastructure/instances supporting your deployed endpoint. This post will delve into how to create and invoke a Single Model Endpoint.

Selecting a Model Deployment Option

Choosing the appropriate inference type can be challenging, but this simple guide can assist you. It’s not a rigid flow chart, so feel free to choose the option that best suits your needs. Real-Time Inference is optimal for scenarios requiring low and consistent latency (milliseconds to seconds) and workloads sensitive to throughput. You have control over the instance type and count for your endpoint, along with the ability to configure AutoScaling policies to manage traffic. Additionally, Asynchronous Inference is suitable when dealing with large payload sizes and near real-time latency, particularly beneficial for NLP and Computer Vision models that demand longer preprocessing times. Serverless Inference is ideal for managing sporadic traffic without the hassle of infrastructure scaling. The procedure for creating an endpoint remains consistent, regardless of the chosen inference type. While this post focuses on setting up a real-time instance-based endpoint, it can easily be adapted for the other inference options based on your specific use case. Finally, Batch inference occurs offline, where you submit a dataset for inference processing, and we handle the job. This method is similarly instance-based, allowing you to choose the optimal instance for your workload, and you pay only for the duration of the job, making it suitable for processing large amounts of data over extended periods. Built-in features streamline working with structured data and optimize automatic distribution. Examples of use cases include propensity modeling, predictive maintenance, and churn prediction—all of which can be performed in bulk since they don’t need to respond to specific events.

Hosting a Model on SageMaker Endpoints

At its foundation, SageMaker Real-Time Endpoints consists of a model and the infrastructure you select to support the Endpoint. SageMaker utilizes containers for model hosting, necessitating a container that properly configures the environment for the employed framework for each model. For instance, when working with an SKlearn model, you need to provide your model scripts/data within a container that is compatible with SKlearn. Fortunately, SageMaker offers managed images for popular frameworks such as TensorFlow, PyTorch, SKlearn, and HuggingFace. These images can be accessed using the high-level SageMaker Python SDK, allowing you to inject your model scripts and data into these containers. If SageMaker does not have a suitable container, you can also Build Your Own Container and upload your custom image, installing any necessary dependencies for your model.

SageMaker accommodates both trained and pre-trained models. When referencing model scripts/data, you can either mount a script on your container, or if you possess a pre-trained model artifact (for instance, model.joblib for SKlearn), you can submit this alongside your image to SageMaker. Understanding SageMaker Inference involves three primary entities created during Endpoint creation:

SageMaker Model Entity – This is where you input your trained model data/script and the image you’re working with, whether it is an AWS image or one you’ve built yourself.
Endpoint Configuration Creation – Here, you define your infrastructure, including instance type, count, and more.
Endpoint Creation – This is the REST Endpoint that hosts your model, which you will invoke to receive responses. Now, let’s examine how to utilize a managed SageMaker Image as well as your custom-built image to deploy an endpoint.

Real-Time Endpoint Requirements

Before you create an Endpoint, it’s essential to identify the type of Model you wish to host. If it’s a Framework model such as TensorFlow, PyTorch, or MXNet, you can use one of the pre-built Framework images. If it’s a custom model, or you desire complete flexibility in crafting the container that SageMaker will execute for inference, then you can build your own container.

SageMaker Endpoints consist of a SageMaker Model and Endpoint Configuration. If you are using Boto3, you must create both objects. However, if you employ the SageMaker Python SDK, the Endpoint Configuration is automatically created when you invoke the .deploy(..) method.

SageMaker Entities:

SageMaker Model: This includes the details of the inference image, the location of the model artifacts in Amazon S3, network configuration, and the AWS Identity and Access Management (IAM) role to be utilized by the Endpoint.

SageMaker requires your model artifacts to be compressed into a .tar.gz file. The platform automatically extracts this .tar.gz file into the /opt/ml/model/ directory in your container. If you are using one of the framework containers, such as TensorFlow, PyTorch, or MXNet, your TAR structure should be as follows:

TensorFlow  
model.tar.gz/
|--[model_version_number]/
|--variables
|--saved_model.pb
code/
|--inference.py

For further insights on this topic, check out this other blog post here. Additionally, for authoritative information, visit this resource. If you want to understand the onboarding process at Amazon, this link is an excellent resource.

Amazon VGT2 Las Vegas: Model Hosting Patterns in Amazon SageMaker, Part 2: A Guide to Deploying Real-Time Models on SageMaker

Selecting a Model Deployment Option

Hosting a Model on SageMaker Endpoints

Real-Time Endpoint Requirements

SageMaker Entities:

Related Topics:

Comments

Leave a Reply Cancel reply