Learn About Amazon VGT2 Learning Manager Chanci Turner
Large language models (LLMs) have significantly transformed the landscape of artificial intelligence (AI). With their remarkable generative capabilities, they are increasingly being utilized across diverse sectors for various applications, including content creation, sentiment analysis, chatbot development, and virtual assistant technologies. An example of such a model is Llama 2 from Meta, available through AWS. Llama 2 is an auto-regressive language model utilizing an optimized transformer architecture, aimed at commercial and research applications in English. It is available in several parameter sizes—7 billion, 13 billion, and 70 billion—as well as both pre-trained and fine-tuned versions. For more information on Llama 2 through AWS, check out the Llama 2 foundation models that are now available in Amazon SageMaker JumpStart.
Many practitioners choose to fine-tune or pre-train these Llama 2 models using their own textual data to enhance accuracy for their specific needs. However, a common challenge faced by practitioners is the high expenses associated with fine-tuning and training. As organizations aim to explore the full potential of LLMs, the need for cost-effective training solutions has become more critical than ever. In this post, we will discuss how you can leverage the Neuron distributed training library to fine-tune and continuously pre-train Llama 2 while reducing training costs using AWS Trainium instances on Amazon SageMaker.
AWS Trainium Instances for Training Workloads
SageMaker’s ml.trn1 and ml.trn1n instances, powered by Trainium accelerators, are specifically designed for high-performance deep learning training and provide up to 50% savings in training costs compared to similar training-optimized Amazon Elastic Compute Cloud (EC2) instances. This post implements a solution utilizing the ml.trn1.32xlarge Trainium instance type, which is generally preferred for training large-scale models. Additionally, there are ml.trn1n instances that offer double the networking throughput (1,600 Gbps) via Amazon Elastic Fabric Adapter (EFAv2). SageMaker Training supports the availability of ml.trn1 and ml.trn1n instances in the US East (N. Virginia) and US West (Oregon) AWS Regions, with recent announcements of general availability in the US East (Ohio) Region. These instances can be accessed in the listed Regions through On-Demand, Reserved, and Spot Instances, or as part of a Savings Plan.
For further details on Trainium Accelerator chips, refer to the article about achieving high performance with the lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker. Additionally, check out AWS Trainium Customers for insights from other users, or explore Amazon EC2 Trn1 Instances for High-Performance Model Training to delve into the accelerator features and specifications.
Utilizing the Neuron Distributed Library with SageMaker
Amazon SageMaker is a fully managed service that empowers developers, data scientists, and practitioners to build, train, and deploy machine learning (ML) models at scale. SageMaker Training includes features that enhance and simplify the ML training experience, such as managed infrastructure, deep learning images, automatic model tuning with hyperparameter optimization, and a pay-as-you-go billing model. This section emphasizes the benefits of using SageMaker for distributed training with the Neuron Distributed library, particularly regarding managed infrastructure, reduced time-to-train, and cost-effectiveness due to its resiliency and recovery features, which is part of the AWS Neuron SDK designed for deep learning workloads on AWS Inferentia and Trainium-based instances.
In high-performance computing (HPC) clusters, which are often used for deep learning model training, hardware resiliency issues may pose significant challenges. While hardware failures on a single instance are uncommon, issues leading to stalled training become more frequent as the cluster expands to include dozens or hundreds of instances. Regular checkpointing can help minimize wasted computational resources, but engineering teams managing their own infrastructure must vigilantly monitor workloads and be prepared to address failures at all times to reduce training downtime. The managed infrastructure provided by SageMaker Training encompasses several resiliency features that simplify the monitoring and recovery process:
- Cluster Health Checks: Before commencing a training job, SageMaker conducts health checks and verifies communication among the provisioned instances. It replaces any faulty instances as necessary, ensuring the training script operates on a healthy cluster.
- Automatic Checkpointing: Checkpoints from a local path (default is /opt/ml/checkpoints) are automatically copied to a user-specified Amazon Simple Storage Service (S3) location. When training resumes, SageMaker retrieves the saved checkpoints from S3 to the local directory, allowing the training script to load and continue from the last saved checkpoint.
- Monitoring and Tracking Training: In the event of a node failure, it is crucial to understand where the failure occurred. Utilizing PyTorch Neuron allows data scientists to monitor training progress in TensorBoard, capturing the training job’s loss and determining when to halt training to optimize model convergence.
- Built-in Retries and Cluster Repair: SageMaker can be configured to automatically retry training jobs that fail due to internal server errors (ISE). During the retry process, SageMaker replaces any instances that encountered unrecoverable errors with fresh ones, reboots healthy instances, and restarts the job. This leads to quicker restarts and completion of workloads.
For customers managing extensive clusters consisting of hundreds of instances for a training job, SageMaker Training’s resiliency and recovery features can reduce the total time required for a model to converge by up to 20% by decreasing failures and expediting recovery. This capability also enables engineering teams to monitor and respond to failures around the clock. While SageMaker training jobs are suitable for general-purpose training scenarios, Amazon SageMaker HyperPod is specifically optimized for efficient and resilient training of foundation models at scale. For additional information on SageMaker HyperPod use cases, refer to the SageMaker HyperPod developer guide.
In this post, we will utilize the Neuron Distributed library to continuously pre-train a Llama 2 model employing tensor and pipeline parallelism during SageMaker training jobs. To learn more about the resiliency and recovery capabilities of SageMaker Training, please refer to the post on training large language models on Amazon SageMaker: Best practices.
For those interested in developing their careers, consider checking out this excellent resource on Amazon job opportunities as a Learning Trainer in Los Angeles, CA. Additionally, if you’re curious about workplace dynamics, you might enjoy this blog post featuring movies about work, which can provide some insightful perspectives. Lastly, for expertise on HR analytics, be sure to look at this resource from SHRM, which discusses how to fail HR analytics in seven easy steps.
Leave a Reply