Enhance Inference Performance for LLMs with New Amazon SageMaker Containers

Amazon SageMaker has introduced a new update (0.25.0) for its Large Model Inference (LMI) Deep Learning Containers (DLCs), now featuring support for NVIDIA’s TensorRT-LLM Library. These advancements provide seamless access to cutting-edge tools designed for optimizing large language models (LLMs) on SageMaker, delivering notable price-performance enhancements. The Amazon SageMaker LMI TensorRT-LLM DLC achieves an average reduction in latency by 33% and an average increase in throughput by 60% for models such as Llama2-70B, Falcon-40B, and CodeLlama-34B, compared to earlier versions.

The popularity of LLMs has surged across numerous applications, yet their size often limits deployment to a single accelerator or GPU device, complicating low-latency inference and scalability. SageMaker addresses these challenges with LMI DLCs, which maximize resource utilization and enhance performance. The latest LMI DLCs facilitate continuous batching for inference requests to boost throughput, employ efficient collective operations to reduce latency, utilize Paged Attention V2 for improved performance with extended sequence lengths, and incorporate NVIDIA’s latest TensorRT-LLM library to optimize GPU performance. With a low-code interface that simplifies TensorRT-LLM compilation—requiring only the model ID and optional parameters—the LMI DLC handles the complexities of building optimized models and creating model repositories. Additionally, users can leverage the latest quantization techniques, including GPTQ, AWQ, and SmoothQuant. Consequently, utilizing LMI DLCs on SageMaker accelerates the time-to-value for generative AI applications, enabling LLM optimization tailored to your hardware for optimal price-performance.

In this article, we will explore the new features introduced in the latest LMI DLCs, present performance benchmarks, and outline the necessary steps to deploy LLMs effectively using these containers to enhance performance and minimize costs.

New Features in SageMaker LMI DLCs

Here, we will discuss three significant features of the SageMaker LMI DLCs.

Support for TensorRT-LLM

The latest LMI DLC release (0.25.0) now integrates NVIDIA’s TensorRT-LLM, enabling advanced optimizations such as SmoothQuant, FP8, and continuous batching for LLMs on NVIDIA GPUs. TensorRT-LLM facilitates ultra-low latency performance, significantly enhancing overall efficiency. This SDK supports deployments from single-GPU to multi-GPU configurations, with further performance improvements achievable through tensor parallelism. To utilize the TensorRT-LLM library, select the TensorRT-LLM DLC from the available options and configure settings like engine=MPI, along with option.model_id. For further insights into this topic, check out this blog post.

Efficient Inference Collective Operations

In typical LLM implementations, model parameters are distributed across multiple accelerators to manage the size of large models. This configuration allows for parallel processing of partial calculations, which enhances inference speed. A collective operation is subsequently employed to aggregate these partial results and redistribute them among the accelerators. The latest LMI DLCs on P4D instance types introduce an optimized collective operation that accelerates communication between GPUs, resulting in lower latency and higher throughput. This feature is natively supported by LMI DLCs, requiring no additional configuration, making it a unique offering within Amazon SageMaker.

Quantization Support

SageMaker LMI DLCs have integrated the latest quantization techniques, including pre-quantized models using GPTQ, Activation-aware Weight Quantization (AWQ), and just-in-time quantization methods like SmoothQuant. GPTQ enables the running of popular INT3 and INT4 models from Hugging Face, optimizing model sizes for single or multi-GPU setups. AWQ inference allows for quicker processing, while SmoothQuant facilitates INT8 quantization to minimize memory usage and computational costs with negligible accuracy loss. Users can perform just-in-time conversions for SmoothQuant models effortlessly. GPTQ and AWQ models require quantization alongside a dataset for LMI DLC compatibility. To learn more about these techniques, visit this authoritative source.

Deploying with SageMaker LMI DLCs

You can seamlessly deploy your LLMs on SageMaker using the new LMI DLC version 0.25.0 without modifying your existing code. The LMI DLCs utilize DJL serving to facilitate inference. To begin, create a configuration file that outlines settings, such as model parallelization and the inference optimization libraries to be employed. For detailed instructions and tutorials, see the resources on Model parallelism and large model inference.

The DeepSpeed container incorporates the LMI Distributed Inference Library (LMI-Dist), which optimizes large model inference by leveraging various open-source libraries, including vLLM, Text-Generation-Inference, FasterTransformer, and DeepSpeed. This library harnesses popular technologies like FlashAttention, PagedAttention, FusedKernel, and efficient GPU communication kernels to enhance model performance while minimizing memory consumption.

TensorRT LLM is an open-source library launched by NVIDIA in October 2023, meticulously optimized for inference acceleration. It features a toolkit designed to simplify user experience by allowing just-in-time model conversion, enabling users to provide a Hugging Face model ID for end-to-end deployment. Continuous batching with streaming is also supported. Expect compilation times of roughly 1–2 minutes for Llama-2 7B and 13B models, and around 7 minutes for the 70B model. To avoid compilation delays during the setup and scaling of SageMaker endpoints, consider pre-compiling your models using our AOT tutorial. Additionally, any TensorRT LLM model designed for Triton Server can be utilized with LMI DLCs.

Performance Benchmarking Results

We evaluated the performance of the new SageMaker LMI DLC version (0.25.0) against the previous version (0.23.0), conducting tests on Llama-2 70B, Falcon 40B, and CodeLlama 34B models to illustrate the performance enhancements from TensorRT-LLM and efficient inference collective operations available on SageMaker.

SageMaker LMI containers come equipped with a default handler script for model loading and hosting, offering a low-code solution. Alternatively, users can implement their own scripts for customized model loading processes. Essential parameters must be included in a serving.properties file, which details the necessary configurations for the Deep Java Library (DJL) model server to download and host the model. Below is the serving.properties used for our deployment and benchmarking:

engine=MPI

For those exploring the onboarding process, this Reddit thread is an excellent resource.