Accelerate PyTorch with DeepSpeed for Training Large Language Models on Intel Habana Gaudi-Based DL1 EC2 Instances

Training large language models (LLMs) with billions of parameters presents significant challenges. Researchers must not only design model architectures but also implement cutting-edge training techniques for distributed setups, including mixed precision support, gradient accumulation, and checkpointing. As model sizes increase, the challenge amplifies, since the memory capacity of a single accelerator device limits the size of models that can be trained using data parallelism alone. Implementing model parallel training requires further modifications to the training code. Libraries like DeepSpeed—an open-source optimization tool for PyTorch—help mitigate these hurdles and accelerate model development and training.

In this article, we configure training on the Intel Habana Gaudi-based Amazon Elastic Compute Cloud (Amazon EC2) DL1 instances and evaluate the advantages of using a scaling framework like DeepSpeed. We showcase scaling results for an encoder-type transformer model (BERT with 340 million to 1.5 billion parameters). For the 1.5-billion-parameter model, we achieved a scaling efficiency of 82.7% across 128 accelerators (16 dl1.24xlarge instances) using DeepSpeed ZeRO stage 1 optimizations. DeepSpeed partitioned the optimizer states to enable large model training within the data parallel paradigm, which we have successfully extended to train a 5-billion-parameter model. We also utilized Gaudi’s native support for the BF16 data type, resulting in reduced memory usage and enhanced training performance when compared to the FP32 data type. Consequently, we reached model convergence in the pre-training phase (phase 1) within 16 hours for the BERT 1.5-billion-parameter model using the wikicorpus-en dataset.

Training Setup

A managed compute cluster was provisioned, consisting of 16 dl1.24xlarge instances using AWS Batch. We created a workshop that guides users through the process of setting up a distributed training cluster with AWS Batch. Each dl1.24xlarge instance is equipped with eight Habana Gaudi accelerators, each featuring 32 GB of memory and a full mesh RoCE network, totaling a bi-directional interconnect bandwidth of 700 Gbps per card (more details can be found in the Amazon EC2 DL1 instances Deep Dive). The cluster also utilizes four AWS Elastic Fabric Adapters (EFA), providing a combined interconnect of 400 Gbps between nodes.

The workshop demonstrates how to establish a distributed training environment. It specifically highlights using AWS Batch’s multi-node parallel jobs feature to launch large-scale containerized training jobs on completely managed clusters. A fully managed AWS Batch compute environment is created with DL1 instances, where containers are automatically pulled from Amazon Elastic Container Registry (Amazon ECR) and deployed to instances based on the multi-node parallel job definition. The workshop concludes with a multi-node, multi-HPU data parallel training of a BERT model (ranging from 340 million to 1.5 billion parameters) using PyTorch and DeepSpeed.

BERT 1.5B Pre-training with DeepSpeed

Habana SynapseAI v1.5 and v1.6 support DeepSpeed ZeRO1 optimizations. The Habana version of the DeepSpeed GitHub repository includes the necessary modifications to support Gaudi accelerators. It offers full compatibility with distributed data parallel (multi-card, multi-instance), ZeRO1 optimizations, and BF16 data types.

These features are implemented in the BERT 1.5B model reference repository, which details a 48-layer, 1600-hidden dimension, and 25-head bi-directional encoder model, derived from an original BERT implementation. The repository also contains the baseline BERT Large model implementation with a 24-layer, 1024-hidden, 16-head, 340-million-parameter architecture. The pre-training scripts are adapted from the NVIDIA Deep Learning Examples repository to download the wikicorpus_en data, preprocess it into tokens, and shard it into smaller h5 datasets for distributed data parallel training. You can employ this generic approach to train your own custom PyTorch architectures using your datasets on DL1 instances. For more insights, check this blog post that delves deeper into related topics.

Pre-training (Phase 1) Scaling Results

For large-scale pre-training of models, we concentrated on two aspects: training performance, measured by the time required to train, and the cost-effectiveness of achieving a fully converged solution. We will examine these metrics using BERT 1.5B pre-training as a case study.

Scaling Performance and Time to Train

We began by assessing the performance of the BERT Large implementation as a scalability baseline. The table below outlines the throughput of sequences per second across 1-8 dl1.24xlarge instances (with eight accelerator devices per instance). Using the single-instance throughput as a reference, we evaluated the efficiency of scaling across multiple instances, an essential factor for understanding price-performance training metrics.

Number of Instances	Number of Accelerators	Sequences per Second	Sequences per Second per Accelerator	Scaling Efficiency
1	8	1,379.76	172.47	100.0%
2	16	2,705.57	169.10	98.04%
4	32	5,291.58	165.36	95.88%
8	64	9,977.54	155.90	90.39%

The following figure illustrates the scaling efficiency.

For the BERT 1.5B model, hyperparameters were adjusted to ensure convergence. The effective batch size per accelerator was set at 384 for optimal memory utilization, with micro-batches of 16 per step and 24 steps of gradient accumulation. Learning rates of 0.0015 and 0.003 were employed for 8 and 16 nodes, respectively. With these configurations, BERT 1.5B reached convergence in approximately 25 hours across 8 dl1.24xlarge instances (64 accelerators) and 15 hours across 16 dl1.24xlarge instances (128 accelerators). The average loss was tracked as a function of training epochs as we scaled the number of accelerators.

Using the aforementioned configuration, we achieved 85% strong scaling efficiency with 64 accelerators and 83% with 128 accelerators, based on the baseline of 8 accelerators within a single instance. The summary of parameters is shown below.

Number of Instances	Number of Accelerators	Sequences per Second	Sequences per Second per Accelerator	Scaling Efficiency
1	8	276.66	34.58	100.0%
8	64	1,883.63	29.43	85.1%
16	128	3,659.15	28.59	82.7%

The following figure illustrates the scaling efficiency.

Conclusion

In this article, we explored the support for DeepSpeed by Habana SynapseAI v1.5/v1.6 and its effectiveness in scaling LLM training on Habana Gaudi accelerators. The pre-training of a 1.5-billion-parameter BERT model took 16 hours to converge on a cluster, which is a remarkable achievement. For more resources, this YouTube video provides excellent insights into the process. For authoritative perspectives, check out this resource.

For more details, visit us at Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115.

Accelerate PyTorch with DeepSpeed for Training Large Language Models on Intel Habana Gaudi-Based DL1 EC2 Instances

Training Setup

BERT 1.5B Pre-training with DeepSpeed

Pre-training (Phase 1) Scaling Results

Scaling Performance and Time to Train

Conclusion

Related Topics:

Comments

Leave a Reply Cancel reply