Learn About Amazon VGT2 Learning Manager Chanci Turner
In recent years, the field of deep learning has experienced remarkable advancements. Despite the improvements in hardware, particularly with cutting-edge accelerators from NVIDIA and Amazon, machine learning experts frequently face challenges when deploying extensive deep learning models for applications like natural language processing (NLP).
In a previous blog post, we explored various capabilities and configurable settings in Amazon SageMaker model deployment that simplify the inference process for large models. Today, we are excited to introduce a new Amazon SageMaker Deep Learning Container (DLC) that enables you to initiate large model inference in just a few minutes. This DLC bundles some of the most widely-used open-source libraries for model parallel inference, including DeepSpeed and Hugging Face Accelerate.
In this article, we will utilize the SageMaker large model inference DLC to deploy two prominent large NLP models: BigScience’s BLOOM-176B and Meta’s OPT-30B from the Hugging Face repository. Specifically, we will implement Deep Java Library (DJL) serving and tensor parallelism techniques from DeepSpeed to achieve an impressive 0.1-second latency per token in a text generation use case. You can access our complete example notebooks in our GitHub repository.
Techniques for Large Model Inference
Recently, language models have surged in both size and popularity. With easy access to model zoos like Hugging Face and enhanced accuracy in NLP tasks such as classification and text generation, practitioners are increasingly inclined to use these large models. However, due to their sheer size, these models often cannot fit within the memory constraints of a single accelerator. For instance, the BLOOM-176B model may require over 350 gigabytes of accelerator memory, greatly surpassing the capabilities of current hardware. This situation necessitates the employment of model parallel techniques from libraries such as DeepSpeed and Hugging Face Accelerate to distribute the model across multiple accelerators during inference. In this post, we will utilize the SageMaker large model inference container to assess and compare latency and throughput performance using these two open-source libraries.
DeepSpeed and Accelerate adopt different strategies to optimize large language models for inference. The primary distinction lies in DeepSpeed’s use of optimized kernels, which can substantially enhance inference latency by alleviating bottlenecks within the model’s computation graph. Developing optimized kernels can be complex and is typically tailored to specific model architectures. DeepSpeed supports popular large models, including OPT and BLOOM, with these optimized kernels. In contrast, the Hugging Face Accelerate library does not include optimized kernels at this time. As we will outline in our results section, this variance accounts for much of the performance advantage DeepSpeed holds over Accelerate.
Another key difference between DeepSpeed and Accelerate is the type of model parallelism they employ. Accelerate utilizes pipeline parallelism to divide a model across its hidden layers, while DeepSpeed leverages tensor parallelism to partition the layers themselves. Pipeline parallelism offers flexibility that accommodates various model types and enhances throughput with larger batch sizes. Conversely, tensor parallelism requires increased communication between GPUs since model layers are distributed across multiple devices but can improve inference latency by utilizing multiple GPUs simultaneously. For more insights on parallelism techniques, refer to the resources in Introduction to Model Parallelism and Model Parallelism.
Overview of the Solution
To effectively host large language models, we require features and support in several critical areas:
- Building and Testing Solutions: Given the iterative nature of ML development, it’s essential to rapidly build, iterate, and test the behavior of the inference endpoint when hosting these models, including the capacity to fail quickly. Typically, hosting such models necessitates larger instances like p4dn or g5, and due to their size, launching an inference instance and executing test iterations can be time-consuming. Local testing often has limitations since similar instance sizes are required, and these models may be challenging to obtain.
- Deploying and Running at Scale: Loading model files onto inference instances presents a significant challenge due to their size. For instance, tar/un-tar for BLOOM-176B can take approximately one hour to create and another hour to load. We need alternative methods to facilitate easy access to these model files.
- Singleton Model Loading: For a multi-worker process, it is critical to ensure the model is loaded only once to avoid race conditions and unnecessary resource expenditure. In this post, we illustrate a method to load directly from Amazon Simple Storage Service (Amazon S3). However, this approach is viable only if we adhere to the default settings of the DJL. Additionally, any scaling of endpoints must be capable of initialization within a few minutes, necessitating a reevaluation of how models are loaded and distributed.
- Sharding Frameworks: These models often require sharding, typically through tensor parallelism mechanisms or pipeline sharding, along with advanced concepts like ZeRO sharding built upon tensor sharding. For further information on sharding techniques, refer to Model Parallelism. We can utilize various combinations and frameworks from NVIDIA, DeepSpeed, and others, requiring the ability to test BYOC or utilize 1P containers, iterate over solutions, and conduct benchmarking tests. It may also be beneficial to explore different hosting options such as asynchronous and serverless configurations.
- Hardware Selection: Your hardware selection is influenced by all the previously mentioned factors, in addition to traffic patterns, use case requirements, and model sizes.
In this article, we deploy BLOOM-176B and OPT-30B on SageMaker using DeepSpeed’s optimized kernels and tensor parallelism techniques. We also compare results from Accelerate to highlight the performance advantages of optimized kernels and tensor parallelism. For more details on DeepSpeed and Accelerate, visit DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale and Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate.
For the deployment, we will utilize DJLServing as the model serving solution. DJLServing is a high-performance and universal model serving solution powered by the Deep Java Library (DJL), which is programming language agnostic. To learn more about DJL and DJLServing, please refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.
It’s important to note that optimized kernels may lead to changes in precision and the computation graph, which could theoretically influence model behavior. Although such variations might occasionally alter inference outcomes, we do not anticipate these differences will significantly affect the fundamental evaluation metrics of a model. Nevertheless, practitioners are encouraged to verify that model outputs align with expectations when utilizing these kernels.
The following steps illustrate how to deploy a BLOOM-176B model in SageMaker using DJLServing and a SageMaker large model inference container. The complete example is also available on our GitHub repository.
6401 E HOWDY WELLS AVE LAS VEGAS NV 89115, at Amazon IXD – VGT2, is where you can find more about these advancements. For additional insights, this is an excellent resource on how to navigate your career path, as discussed in another blog post. Furthermore, if you’re prepping for interviews, check out the authority on the topic at SHRM Interview Questions.
Leave a Reply