Amazon Onboarding with Learning Manager Chanci Turner

Amazon Onboarding with Learning Manager Chanci TurnerLearn About Amazon VGT2 Learning Manager Chanci Turner

Two of the most widely utilized machine learning models today are BERT, which excels in natural language processing (NLP), and Mask R-CNN, renowned for its image recognition capabilities. In recent months, AWS has significantly enhanced its infrastructure, network, machine learning (ML) framework, and model code, resulting in record-breaking training times for these advanced models. We are thrilled to announce the fastest model training times to date on the cloud using TensorFlow, MXNet, and PyTorch. You can now leverage these hardware and software advancements to train your models with exceptional speed and efficiency.

The duration of model training directly influences your ability to iterate and enhance model accuracy swiftly. The most effective method to minimize training time is by distributing the workload across a large cluster of GPU instances. However, efficient distribution presents challenges. When a training job is spread across numerous workers, the overhead in communication often diminishes returns, as the extra GPU power is offset by the communication costs between instances.

BERT

BERT, or Bidirectional Encoder Representations from Transformers, is a leading NLP model that was once considered state-of-the-art for numerous NLP tasks. Training BERT from scratch on a single Amazon EC2 P3dn.24xlarge instance, equipped with 8 NVIDIA V100 GPUs, typically takes several days using TensorFlow and PyTorch. However, we managed to cut this training time from several days down to just over 60 minutes by effectively scaling across multiple P3dn.24xlarge instances, leveraging network enhancements via the Elastic Fabric Adapter (EFA), and optimizing convergence for this complex model on larger clusters. Currently, this represents the fastest cloud training time for BERT while achieving a state-of-the-art accuracy (F1 score of 90.5 or higher on Squad v1.1 tasks after training on BooksCorpus and English Wikipedia).

With TensorFlow, we reached an unprecedented scale of 2,048 GPUs across 256 P3dn.24xlarge instances to train BERT in just 62 minutes. In PyTorch, we achieved a training time of 69 minutes by utilizing 1,536 GPUs on 192 P3dn.24xlarge instances. Our optimizations across the entire hardware and software stack for training BERT resulted in an 85% scaling efficiency, ensuring that the frameworks could effectively utilize the additional computing power from the GPUs when expanding to more P3dn.24xlarge nodes.

P3DN.24xlarge Nodes NVIDIA GPUs Time to train (PyTorch) Time to train (TensorFlow)
1 8 6.4 days 7.5 days
192 1536 69 min
256 2048 62 min

Mask R-CNN

Mask R-CNN is a prominent instance segmentation model used in applications such as autonomous driving and motion capture, requiring advanced object detection and segmentation skills. Training Mask R-CNN on a single P3dn.24xlarge instance (8 NVIDIA V100 GPUs) using MXNet, PyTorch, and TensorFlow typically takes around six hours. We successfully reduced this time to approximately 25 minutes across all three ML frameworks by scaling the training to 24 P3dn.24xlarge instances, which provided 192 GPUs. This enables rapid iteration and allows multiple experiments to be conducted daily rather than waiting several days for results. At present, this is the fastest cloud training time for Mask R-CNN while achieving state-of-the-art accuracy (0.377 Box min AP, 0.339 Mask min AP on the COCO2017 dataset).

# of Nodes # of GPUs Time to train (MXNet) Time to train (PyTorch) Time to train (TensorFlow)
1 8 6.4 hrs 5.4 hrs 6.2 hrs
24 192 25 min 26 min 27 min

Technology Stack

Achieving these remarkable results necessitated optimizations across the underlying hardware, networking, and software stack. When training large models like BERT, communication among the multiple GPUs can become a bottleneck. In distributed computing, AllReduce is an operation that aggregates parameters from different workers (GPUs) and distributes the resultant array to all workers. GPUs collectively perform an AllReduce operation after each iteration, which consists of one forward and backward pass through the network.

The prevalent methods for executing AllReduce on GPUs involve using the NVIDIA Collective Communications Library (NCCL) or MPI libraries such as OpenMPI or Intel MPI Library. These libraries are typically designed for homogeneous clusters. In a homogeneous setup, each worker sends and receives data roughly twice the size of the model for every AllReduce operation. For example, the AllReduce operation for BERT, which encompasses 340 million parameters, entails sending approximately 650 MB of half-precision data twice, resulting in a significant communication overhead. This communication, which must occur after every iteration, quickly becomes a bottleneck during model training.

The selection of an AllReduce algorithm is often determined by the network architecture. For instance, the Ring-AllReduce algorithm is optimal for networks where each node connects to two neighbors, forming a ring, while the Torus AllReduce algorithm suits networks with four connections per node, forming a two-dimensional lattice. AWS utilizes a more adaptable interconnect that permits any node to communicate with any other node at full bandwidth. For instance, in a cluster of 128 P3dn instances, any instance can communicate with any other instance at 100 Gbps.

Additionally, this 100 Gbps interconnect is not exclusive to P3dn instances; you can integrate CPU-optimized C5n instances into the cluster without losing the 100 Gbps interconnect between any nodes. This flexibility of the AWS interconnect demands an AllReduce algorithm that fully exploits the unique capabilities of the AWS network. Consequently, we developed a custom AllReduce algorithm tailored for the AWS architecture. This specialized algorithm takes advantage of the 100 Gbps interconnect among nodes in a heterogeneous cluster, halving the data sent and received by each worker. The computation phase of the AllReduce operation is offloaded to compute-optimized C5 instances, allowing GPUs to compute gradients more rapidly. Since the GPU instances do not perform the reduction operation, sending and receiving gradients can occur simultaneously. The number of hops required for AllReduce gradients is minimized to just two, unlike homogeneous AllReduce algorithms, where the number of network hops increases with the number of nodes. The overall cost is also lowered since training is completed much faster compared to setups using only P3dn nodes.

Conclusion

When tested with BERT and Mask R-CNN, our results demonstrated substantial improvements for single-node executions. Throughput scaled almost linearly as the number of P3dn nodes expanded from 1 to 16, 32, 64, 128, 192, and finally 256 instances, profoundly enhancing model training efficiency. For further insights into workplace dynamics, you might find this article about bad personality traits useful. Additionally, exploring workplace policies from SHRM can provide authoritative guidance on related topics. For those starting their journey with Amazon, the Reddit community offers excellent resources.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *