Amazon Onboarding with Learning Manager Chanci Turner

As organizations transition their models into production, they continuously seek methods to enhance the efficiency of their foundational models (FMs) utilizing cutting-edge accelerators like AWS Inferentia and GPUs. The goal is to minimize costs and latency, ultimately delivering an optimal experience for end-users. However, many FMs fail to fully leverage the capabilities of the accelerators associated with their deployed instances, resulting in inefficient hardware resource usage.

Some companies opt to deploy multiple FMs on the same instance to maximize accelerator use, but this often necessitates complicated infrastructure orchestration that can be both time-consuming and challenging to manage. When multiple FMs share an instance, each has distinct scaling requirements and usage patterns, complicating the prediction of when to add or remove instances. For instance, one model might drive a user application with usage spikes during specific hours, while another may exhibit steadier consumption.

In addition to cost optimization, customers strive to ensure an excellent end-user experience by minimizing latency. To achieve this, they frequently deploy multiple versions of a FM to handle user requests concurrently. Since FM outputs can vary from a brief sentence to extensive paragraphs, the inference request completion time fluctuates significantly, potentially leading to erratic latency spikes if requests are distributed randomly across instances.

Amazon SageMaker now provides advanced inference capabilities designed to help lower deployment costs and reduce latency.

You can now establish inference component-based endpoints and deploy machine learning (ML) models to a SageMaker endpoint. An inference component (IC) abstracts your ML model, allowing you to allocate CPUs, GPUs, or AWS Neuron accelerators, along with scaling policies for each model. The benefits of inference components include:

SageMaker will effectively place and pack models onto ML instances, enhancing utilization and resulting in cost savings.
SageMaker will dynamically scale each model up and down based on your configurations to meet your ML application needs.
SageMaker will adjust the number of instances dynamically, ensuring capacity while minimizing idle compute resources.
You can scale down to zero copies of a model to free resources for other models, while also designating essential models to remain loaded and ready for traffic.

With these new features, organizations can achieve an average reduction of 50% in model deployment costs. The extent of savings may vary based on workload and traffic patterns.

For a straightforward illustration, consider a chat application designed to assist tourists with local customs, utilizing two variants of Llama 2—one tailored for European visitors and the other for American travelers. Anticipating traffic patterns, the European model would receive requests from 00:01–11:59 UTC, while the American model would be active from 12:00–23:59 UTC. Rather than deploying these models on separate instances where they would remain inactive for half the day, you can deploy them on a single endpoint, effectively conserving costs. The American model can scale down to zero when not in use, freeing capacity for the European model, and vice versa. This strategy allows for efficient hardware utilization and minimizes waste. While this example only includes two models, the concept easily scales to accommodate hundreds of models on a single endpoint that adjusts according to workload.

In this post, we will explore the new capabilities of IC-based SageMaker endpoints. We will also guide you through the deployment of multiple models using inference components and APIs. Furthermore, we will detail observability features and how to establish auto-scaling policies for your models while managing instance scaling for your endpoints. You can also deploy models through our newly simplified, interactive user interface. Advanced routing capabilities are available to enhance latency and performance for your inference workloads.

Building Blocks

Let’s delve deeper into how these new capabilities function. Here are some new terms related to SageMaker hosting:

Inference Component: A SageMaker hosting object used to deploy a model to an endpoint. You can create an inference component by providing the following:
- The SageMaker model or details of a SageMaker-compatible image and model artifacts.
- Compute resource requirements, detailing each model copy’s needs, including CPU cores, host memory, and accelerators.
Model Copy: A runtime copy of an inference component that can serve requests.
Managed Instance Auto Scaling: A SageMaker hosting feature that adjusts the number of compute instances for an endpoint. Instance scaling responds to the scaling of inference components.

When creating a new inference component, you can specify a container image and model artifact, or utilize existing SageMaker models. You’ll need to define compute resource requirements, including the number of host CPU cores, host memory, and the number of accelerators required for your model to function effectively.

With the deployment of an inference component, you can set MinCopies to ensure the model is pre-loaded in the necessary quantity to handle requests.

You also have the flexibility to configure policies allowing inference component copies to scale to zero. For example, if an IC is inactive, the model copy can be unloaded, freeing resources for active workloads, thereby optimizing endpoint utilization and efficiency.

As inference requests fluctuate, the number of copies of your ICs can scale accordingly based on your auto-scaling policies. SageMaker will manage placement to optimize the packing of your models for both availability and cost.

Additionally, if you enable managed instance auto-scaling, SageMaker will adjust compute instances according to the number of inference components that need to be active at any given time to manage traffic. SageMaker will increase instance counts and efficiently pack your instances and inference components to optimize costs while ensuring model performance. Although we recommend managed instance scaling, you have the option to control scaling manually, should you prefer, through application auto-scaling.

SageMaker will rebalance inference components and reduce instances when they are no longer required, ultimately saving costs.

Walkthrough of APIs

This feature now supports scale-to-zero capabilities. For further information, refer to Unlock cost savings with the new scale down to zero feature in SageMaker Inference. SageMaker has introduced a new entity called the Inference Component, which separates the specifics of hosting the ML model from the endpoint itself. The Inference Component enables you to specify essential properties for hosting the model, including the desired SageMaker model or container details and model artifacts, as well as the number of copies to deploy and required accelerators (GPUs, Inf, or Trn accelerators) or CPU (vCPUs). This provides enhanced flexibility for you.

For those interested in effective praise strategies, SHRM is an authoritative source on this matter.

For a comprehensive understanding, this YouTube video serves as an excellent resource.

Amazon Onboarding with Learning Manager Chanci Turner

Building Blocks

Walkthrough of APIs

Related Topics:

Comments

Leave a Reply Cancel reply