How to Conduct Distributed Training Using Horovod and MXNet on AWS DL Containers and AWS Deep Learning AMIs

Distributed training of extensive deep learning models has become essential for training in computer vision (CV) and natural language processing (NLP) applications. Open-source frameworks like Horovod offer distributed training capabilities for Apache MXNet, PyTorch, and TensorFlow. Transitioning your non-distributed Apache MXNet training script to utilize distributed training with Horovod only necessitates adding 4-5 lines of extra code. Horovod is an open-source distributed deep learning framework developed by Uber, utilizing efficient inter-GPU and inter-node communication methods such as NVIDIA Collective Communications Library (NCCL) and Message Passing Interface (MPI) to disseminate and consolidate model parameters among workers. The primary goal of Horovod is to simplify and accelerate distributed deep learning: transforming a single-GPU training script into a scalable solution that trains across multiple GPUs in parallel. For those new to Horovod and Apache MXNet for distributed training, we recommend reading our previous blog post on the subject before proceeding with this example.

MXNet is integrated with Horovod through the standardized distributed training APIs outlined in Horovod. You can convert your non-distributed training script into a Horovod compatible version by adhering to a higher-level code structure. This streamlined user experience requires only a few additional lines of code. However, other challenges may still hinder smooth distributed training. For instance, you may need to install additional software and libraries and address compatibility issues to ensure that distributed training functions effectively. Horovod requires a specific version of Open MPI, and if you wish to harness high-performance training on NVIDIA GPUs, you must also install an NCCL library. Another challenge arises when scaling the number of training nodes in the cluster, where ensuring that all software and libraries on the new nodes are correctly installed and configured becomes crucial.

AWS Deep Learning Containers (AWS DL Containers) have significantly simplified the process of launching new training instances within a cluster, and the latest release includes all necessary libraries to facilitate distributed training using MXNet alongside Horovod. The AWS Deep Learning AMIs (DLAMI) come pre-packaged with popular open-source deep learning frameworks along with pre-configured libraries such as CUDA, cuDNN, Open MPI, and NCCL.

In this post, we will illustrate how to run distributed training using Horovod and MXNet via AWS DL Containers and the DLAMIs.

Getting Started with AWS DL Containers

AWS DL Containers are a collection of Docker images that come pre-installed with deep learning frameworks, allowing for quick deployment of custom machine learning (ML) environments. These containers provide optimized environments featuring various deep learning frameworks (MXNet, TensorFlow, PyTorch), Nvidia CUDA (for GPU instances), and Intel MKL (for CPU instances) libraries, available in the Amazon Elastic Container Registry (Amazon ECR). You can deploy AWS DL Containers on Amazon Elastic Kubernetes Service (Amazon EKS), self-managed Kubernetes on Amazon Elastic Compute Cloud (Amazon EC2), and Amazon Elastic Container Service (Amazon ECS). For more details on launching AWS DL Containers, see this another blog post for insights.

Training an MXNet Model with Deep Learning Containers on Amazon EC2

The MXNet Deep Learning Container includes pre-installed libraries such as MXNet, Horovod, NCCL, MPI, CUDA, and cuDNN. The following diagram illustrates this architecture.

For instructions on setting up AWS DL Containers on an EC2 instance, refer to: Train a Deep Learning model with AWS Deep Learning Containers on Amazon EC2. For a hands-on tutorial using a Horovod training script, complete steps 1-5 of the previous post. To utilize the MXNet framework, follow this for step 6:

CPU:

Download the Docker image from the Amazon ECR repository.

docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-cpu-py27-ubuntu16.04

In the container’s terminal, execute the following command to train the MNIST example.

git clone --recursive https://github.com/horovod/horovod.git
mpirun -np 1 -H localhost:1 --allow-run-as-root python horovod/examples/mxnet_mnist.py

GPU:

Download the Docker image from the Amazon ECR repository.

nvidia-docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-gpu-py27-cu101-ubuntu16.04

In the container’s terminal, run the following command to train the MNIST example.

git clone --recursive https://github.com/horovod/horovod.git
mpirun -np 4 -H localhost:4 --allow-run-as-root python horovod/examples/mxnet_mnist.py

If your final output resembles the following, you have successfully executed the training script:

[1,0]<stderr>:INFO:root:Epoch[4]    Train: accuracy=0.987580    Validation: accuracy=0.988582
[1,0]<stderr>:INFO:root:Training finished with Validation Accuracy of 0.988582

To terminate the EC2 instances, follow step 7 of the previous post. You can apply the same steps outlined above for your own training scripts.

Training an MXNet Model with Deep Learning Containers on Amazon EKS

Amazon EKS is a managed service that simplifies the process of running Kubernetes on AWS without the need to install, operate, and maintain your own Kubernetes control plane or nodes. Kubernetes is an open-source system that automates the deployment, scaling, and management of containerized applications. In this post, we will guide you through setting up a deep learning environment using Amazon EKS and AWS DL Containers. With Amazon EKS, you can scale a production-ready environment for multi-node training and inference using Kubernetes containers.

The following diagram illustrates this architecture:

For detailed instructions on establishing a deep learning environment with Amazon EKS and AWS DL Containers, consult the Amazon EKS Setup documentation. To configure an Amazon EKS cluster, use the open-source tool known as eksctl. It is advisable to utilize an EC2 instance equipped with the latest DLAMI. Depending on your use case, you can launch either a GPU or CPU cluster. For this post, adhere to the Amazon EKS Setup instructions up to the Manage Your Cluster section.

Once your Amazon EKS cluster is operational, you can execute the Horovod MXNet training on the cluster. For guidance, see MXNet with Horovod distributed GPU training, which employs a Docker image that already contains a Horovod training script and a three-node cluster with node-type=p3.8xlarge. This tutorial runs the Horovod example script for MXNet on an MNIST model. The Horovod examples directory also features an Imagenet script that you can execute on the same Amazon EKS cluster.

Getting Started with the AWS DLAMI

The AWS DLAMI are machine learning images pre-loaded with deep learning frameworks along with their dependent libraries such as NVIDIA CUDA, NVIDIA cuDNN, NCCL, Intel MKL-DNN, and more. DLAMI serves as a comprehensive resource for deep learning in the cloud. You can launch EC2 instances with either Ubuntu or Amazon Linux. DLAMI comes equipped with pre-installed deep learning frameworks such as Apache MXNet, TensorFlow, Keras, and PyTorch. It is an excellent resource for training custom models, experimenting with new deep learning algorithms, and enhancing your skills and techniques in deep learning. Be sure to check out this authority on employment laws as well for more insights.

For further engagement, consider looking into this useful job opportunity related to training at Amazon IXD – VGT2, located at 6401 E HOWDY WELLS AVE LAS VEGAS NV 89115.