Amazon VGT2 Las Vegas: How We Explored Large-Scale Transformers with Amazon SageMaker Model Parallelism

This article is co-authored by Laura Kim, CSO, Jessica Nguyen, CTO, Michael Rivers, CEO of the team at Latent Space, and Alex Carter from AWS.

Latent space represents the hidden dimensions of abstract concepts that machine learning (ML) models learn to navigate. For instance, ideas like “cat,” “tree,” or “window” correspond to points in latent space. At Latent Space, we’re developing a system that enables users to explore this latent space using both linguistic and visual prompts. Our team combines expertise from two traditionally disconnected fields: graphics and natural language processing (NLP). Historically, images and text have been treated separately, each requiring intricate and costly feature engineering. Tasks in NLP such as document comprehension and question answering have had little in common with visual tasks like scene interpretation or rendering. However, this landscape is evolving rapidly.

The integration of these modalities within a unified latent space paves the way for innovative applications, ranging from gaming to enhanced document analysis. Yet, this convergence also introduces significant scaling challenges, as discussed in Richard Sutton’s “The Bitter Lesson,” alongside recent advancements regarding scaling laws. To address these challenges, Latent Space is conducting pioneering research to merge these modalities within a single model while also ensuring efficient scaling. This is where model parallelism comes into play.

Amazon SageMaker’s automated model partitioning and efficient pipelining capabilities have enabled us to adopt model parallelism with minimal engineering overhead. We successfully scaled our model training beyond 1 billion parameters using p4d.24xlarge A100 instances, which is critical for our objectives. Notably, during training with a 16-node setup comprising eight GPUs, we observed a 38% increase in efficiency compared to earlier training iterations.

Challenges with Training Large-Scale Transformers

At Latent Space, we are working on integrating language and vision within transformer models that feature billions of parameters, facilitating “out of distribution” scenarios that stem from users’ creativity or real-world occurrences absent from our training datasets. We are addressing the complexities of scaling to billions of parameters through two primary approaches:

Retrieval-augmented generation
The Amazon SageMaker model parallelism library

Information retrieval (IR) techniques have long been fundamental to search engines and QA tasks. Recently, there has been significant progress in merging classic IR methods with contemporary transformers, particularly for question answering tasks where models are trained alongside neural retrievers designed to fetch relevant documents. For a deeper dive, check out this blog post on the subject.

While retrieval-augmented methods enhance cost-effectiveness and operational efficiency, we still encounter the limitation of fitting our largest model onto a single GPU. Consequently, we must employ model parallelism for training. However, designing our model partitioning was complicated due to the interdependencies among retrieved contexts across training inputs. Furthermore, implementing model parallelism manually throughout our research and development lifecycle posed significant engineering challenges.

The SageMaker Model Parallelism Library

Model parallelism involves distributing a model across multiple devices or nodes (such as GPU-equipped instances) to create an efficient training pipeline that maximizes GPU utilization. The SageMaker model parallelism library simplifies this process by providing automated model splitting—also known as automated model partitioning—and advanced pipeline run scheduling. Its partitioning algorithms balance memory usage, minimize device communication, and optimize performance.

Automated Model Partitioning

For our PyTorch implementation, the model parallel library conducts an initial tracing step (during the first training iteration) to build the model graph and ascertain tensor and parameter dimensions. It then creates a tree representing the nested nn.Module objects within the model and includes additional data from tracing, such as the number of stored nn.Parameters and the runtime for each nn.Module.

The library traverses this tree from the root, applying a partitioning algorithm that balances computational workload and memory demands while minimizing inter-device communication. If multiple nn.Modules share identical nn.Parameters, these modules are allocated to the same device to avoid maintaining duplicate parameter versions. Once the partitioning is finalized, the designated modules and weights are transferred to their respective devices.

Pipeline Run Scheduling

A key feature of the SageMaker distributed model parallel library is its pipelined runs, which dictate the sequence of computations and data processing across devices during model training. Pipelining works by segmenting a mini-batch into microbatches that enter the training pipeline sequentially, following a schedule defined by the library’s runtime.

The microbatch pipeline guarantees that all GPUs are fully operational—a solution we would otherwise have to develop ourselves. However, with the model parallelism library, this process is efficiently managed in the background. Additionally, we utilize Amazon FSx to ensure rapid read speeds, which is crucial given the volume of files accessed during the training of a multimodal model with retrieval.

Training Architecture

The diagram below illustrates our training architecture. Our primary goals were to enhance training speed and reduce costs. The image and language transformers we are training are extremely complex, featuring a vast number of layers and weights that total billions of parameters, rendering them incapable of fitting within a single node’s memory. Each node contains a subset of the model, through which data flows and transformations are shared and compiled. We configured 16 p4d.24xlarge instances, each equipped with eight GPUs, using the following architectural representation:

As we scale our models, a prevalent trend is to store all information within the network’s weights. However, for practical reasons, we aim to augment our models to seek relevant contexts that assist in rendering tasks. This strategy allows us to maintain reduced serving costs without sacrificing image quality. Employing a large transformer-based NLP model, we have again verified a 38% increase in training efficiency when utilizing the SageMaker model parallelism library, as evidenced by the following:

For tensor-level parallelism, we need an allreduce for each computation, resulting in O(log₂ n) parallel steps with n machines taking O(n) steps, yielding O(n log₂ n) total operations.
In the case of pipeline parallelism, we only require O(1) parallel steps for passing data through the pipeline.

For further insights, you can refer to this authoritative source on the topic, as well as this excellent resource to understand how guided training methodologies are implemented.

Amazon VGT2 Las Vegas: How We Explored Large-Scale Transformers with Amazon SageMaker Model Parallelism

Challenges with Training Large-Scale Transformers

The SageMaker Model Parallelism Library

Automated Model Partitioning

Pipeline Run Scheduling

Training Architecture

Related Topics:

Comments

Leave a Reply Cancel reply