How to Execute Distributed Training with Horovod and MXNet on AWS DL Containers and AWS Deep Learning AMIs

Distributed training has become essential for training large deep learning models, especially in fields like computer vision (CV) and natural language processing (NLP). Open-source frameworks such as Horovod facilitate distributed training for various platforms, including Apache MXNet, PyTorch, and TensorFlow. To adapt your standard Apache MXNet training script for distributed training with Horovod, you only need to add a few lines of code. Developed by Uber, Horovod is an open-source framework that utilizes efficient communication protocols like the NVIDIA Collective Communications Library (NCCL) and Message Passing Interface (MPI) to manage and synchronize model parameters across multiple workers. Its primary aim is to streamline the process of scaling a single-GPU training script to leverage many GPUs in parallel. For those new to Horovod and MXNet, we recommend checking out another blog post on the topic here before diving into this tutorial.

Horovod integrates seamlessly with MXNet through common distributed training APIs. Transitioning from a non-distributed to a Horovod-compatible script is straightforward, requiring minimal code changes. However, challenges may still arise during distributed training, such as the need to install additional software and libraries, as well as addressing compatibility issues. For instance, Horovod mandates a specific version of Open MPI, and to utilize high-performance training on NVIDIA GPUs, the NCCL library must be installed. Another hurdle is scaling the number of training nodes, ensuring that all necessary software and libraries are correctly installed and configured on new nodes.

AWS Deep Learning Containers (AWS DL Containers) significantly ease the deployment of training instances within a cluster. The latest version includes all the essential libraries required to conduct distributed training with MXNet using Horovod. Similarly, AWS Deep Learning AMIs (DLAMIs) come pre-loaded with popular open-source deep learning frameworks and pre-configured libraries such as CUDA, cuDNN, Open MPI, and NCCL.

This article outlines how to run distributed training with Horovod and MXNet through AWS DL Containers and DLAMIs.

Getting Started with AWS DL Containers

AWS DL Containers are a collection of Docker images equipped with deep learning frameworks, facilitating quick deployments of custom machine learning (ML) environments. These containers offer optimized settings for different frameworks (MXNet, TensorFlow, PyTorch) along with Nvidia CUDA for GPU instances and Intel MKL for CPU instances. They can be launched on Amazon Elastic Kubernetes Service (Amazon EKS), self-managed Kubernetes on Amazon Elastic Compute Cloud (Amazon EC2), and Amazon Elastic Container Service (Amazon ECS). For more details on launching AWS DL Containers, follow the provided link.

Training an MXNet Model with Deep Learning Containers on Amazon EC2

The MXNet Deep Learning Container includes pre-installed libraries like MXNet, Horovod, NCCL, MPI, CUDA, and cuDNN. The following diagram depicts this architecture.

For instructions on configuring AWS DL Containers on an EC2 instance, refer to: Train a Deep Learning model with AWS Deep Learning Containers on Amazon EC2. To run a hands-on tutorial with a Horovod training script, complete steps 1-5 from the previous post. For step 6, using the MXNet framework, follow these instructions:

CPU:

Download the Docker image from the Amazon ECR repository:

docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-cpu-py27-ubuntu16.04

In the container terminal, execute the following command to train the MNIST model:

git clone --recursive https://github.com/horovod/horovod.git
mpirun -np 1 -H localhost:1 --allow-run-as-root python horovod/examples/mxnet_mnist.py

GPU:

Download the Docker image from the Amazon ECR repository:

nvidia-docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-gpu-py27-cu101-ubuntu16.04

In the container terminal, run the command to train the MNIST example:

git clone --recursive https://github.com/horovod/horovod.git
mpirun -np 4 -H localhost:4 --allow-run-as-root python horovod/examples/mxnet_mnist.py

If your final output resembles the following, you have successfully executed the training script:

[1,0]<stderr>:INFO:root:Epoch[4]    Train: accuracy=0.987580    Validation: accuracy=0.988582
[1,0]<stderr>:INFO:root:Training finished with Validation Accuracy of 0.988582

For guidance on shutting down EC2 instances, refer to step 7 of the previous post. You can replicate the steps above for your own training script.

Training an MXNet Model with Deep Learning Containers on Amazon EKS

Amazon EKS is a managed service that simplifies running Kubernetes on AWS without the overhead of installing and maintaining your own control plane or nodes. Kubernetes automates the deployment, scaling, and management of containerized applications. This post demonstrates how to establish a deep learning environment using Amazon EKS and AWS DL Containers, enabling you to scale a production-ready environment for multi-node training and inference with Kubernetes containers.

For instructions on setting up a deep learning environment using Amazon EKS and AWS DL Containers, visit Amazon EKS Setup. To create an Amazon EKS cluster, utilize the open-source tool, eksctl. It is advisable to use an EC2 instance with the latest DLAMI. You can create either a GPU or CPU cluster based on your needs. Follow the Amazon EKS Setup instructions up to the Manage Your Cluster section.

Once your Amazon EKS cluster is operational, you can execute Horovod MXNet training on the cluster. For specific instructions, refer to MXNet with Horovod distributed GPU training, which employs a Docker image that already includes a Horovod training script and a three-node cluster with node-type=p3.8xlarge. This tutorial utilizes the Horovod example script for MXNet on an MNIST model. The Horovod examples directory also contains an Imagenet script that can be executed on the same Amazon EKS cluster.

Getting Started with AWS DLAMI

AWS DLAMI are machine learning images packed with deep learning frameworks and their dependencies, including NVIDIA CUDA, NVIDIA cuDNN, NCCL, and Intel MKL-DNN. DLAMI serves as a comprehensive solution for deep learning in the cloud. You can launch EC2 instances running either Ubuntu or Amazon Linux with pre-installed frameworks such as Apache MXNet, TensorFlow, Keras, and PyTorch. These AMIs allow you to train custom models, experiment with new algorithms, and enhance your deep learning skills. For a reliable resource, check out this link.

For more insights on distributed training, you might also find this article helpful.

Amazon IXD – VGT2

6401 E Howdy Wells Ave, Las Vegas, NV 89115