Amazon Onboarding with Learning Manager Chanci Turner

Generative artificial intelligence (AI) has simplified the processes for users, allowing foundation models (FMs) to enhance products and experiences across various sectors. However, adopting generative AI brings challenges for organizations aiming to implement these capabilities effectively. Addressing issues such as cost efficiency, concerns regarding data privacy, and ensuring adaptability in a fast-changing landscape is essential for unlocking the full potential of generative AI. This enables broader implementation in applications ranging from chatbots to language translation.

Ray is an open-source, unified computing framework that allows seamless scaling of AI and Python workloads. It offers a Python-native distributed computing environment, enabling users to parallelize existing AI applications on personal devices and scale to cloud or on-premises clusters without modifying the underlying code.

There are multiple ways to deploy a Ray cluster within Amazon Web Services (AWS). A self-managed deployment can be set up on Amazon Elastic Compute Cloud (Amazon EC2) or through Amazon Elastic Kubernetes Service (Amazon EKS) using the Kuberay operator.

In this article, we will explore some of the challenges encountered at the workload level and illustrate how Ray and Anyscale can provide solutions. Ray streamlines distributed computing, while Anyscale, an AWS Specialization Partner with the Machine Learning Competency, offers a fully managed service that accelerates the development of generative AI applications on AWS.

Open Source Large Language Model Stack

Developers can harness AWS’s capabilities to create their large language model (LLM) applications by utilizing various services and components across multiple layers:

Compute layer: Anyscale oversees and optimizes the lifecycle of different EC2 instance types for the Ray cluster.
Foundational model and fine-tuning layers: Anyscale endpoints and workspaces assist users in leveraging pre-trained or open-source models from Hugging Face.
Orchestration layer: Developers can utilize frameworks like LangChain and LlamaIndex for orchestrating LLM applications.
Deployment layer: Ray Serve provides a flexible and agnostic framework for efficient real-time model serving.
In-context learning layer: The open-source community offers various vector databases for grounding responses and reducing inaccuracies in the in-context learning layer.

Recently, the integration of open-source Ray with AWS machine learning (ML) accelerators, particularly AWS Trainium and AWS Inferentia2, was announced. This integration delivers an impressive price-to-performance ratio and minimizes training and inference costs when working with large language models.

Next, we will delve into the challenges at the workload level and explore how Ray and Anyscale can assist.

Distributed Training

Training a large language model can prove challenging, especially as the number of parameters scales into the billions. Issues such as hardware failures, managing extensive cluster dependencies, job scheduling, and optimizing for graphics processing units (GPUs) can arise during the training of an LLM.

To conduct efficient training for LLMs, developers must partition the neural network across its computational graph. Depending on the available GPU cluster, ML practitioners must implement a strategy that optimizes across different parallelization dimensions for effective training.

Currently, optimizing training across parallelization dimensions (data, model, and pipeline) can be complex and often requires manual effort. Existing dimensional partition strategies for an LLM include:

Inter-operator parallelism: Divide the full computational graph into discrete subgraphs. Each device calculates its assigned subgraph and communicates with others upon completion.
Intra-operator parallelism: Partition matrices for a specific operator into submatrices. Each device computes its designated submatrices and exchanges data with others during multiplication or addition.
Combined: Both strategies can be utilized within the same computational graph.

Benchmarks indicate a linear increase in GPU utilization as more GPU nodes are added without modifying the code.

Leading organizations are leveraging Ray to train their largest models, with benchmarks demonstrating how to efficiently scale beyond 1,000 GPUs. For further information, check out this excellent resource on training 175B parameter language models at 1,000 GPU scale with Alpa and Ray.

Fine-Tuning

Fine-tuning allows users to adapt pre-trained models for specific tasks or domains without starting from scratch. Many ML practitioners seek to fine-tune their models using task-specific data.

Managing and loading those LLMs cost-effectively for fine-tuning presents challenges, often requiring data or model parallelism along with distributed data loading.

Depending on your objectives and product needs, optimizing for cost or overall training time may necessitate different distributed computing strategies. For instance, scaling up the number of GPUs can reduce overall time, as demonstrated in a recent benchmark.

To learn more about this benchmark, see the Anyscale blog post on faster stable diffusion fine-tuning with Ray AIR.

If cost optimization is your priority, utilizing 32 x g4dn.4xl offers the best cost performance when fine-tuning a GPT-J model. For additional insights, refer to this Anyscale blog post on effectively fine-tuning and serving LLMs using Ray + DeepSpeed + Hugging Face.

Scaling Embeddings

Generative AI-powered chatbots are gaining traction and require embedding a vast dataset for effective question-answering systems. Embeddings are crucial in LLM workloads as they encode input data and embed entire corpora for information retrieval in a popular approach known as retrieval-augmented generation.

Generating embeddings can be resource-intensive. The process of creating embeddings for large datasets demands significant memory and processing power, making it difficult to scale and efficiently manage large-scale LLM workloads.

Ray facilitates the parallelization of this process, significantly reducing the time required for processing embeddings. In fact, Ray can enhance LangChain, processing embeddings up to 20 times faster. Furthermore, streaming and pipelining across input/output (IO), CPU, and GPU resources have become essential for batch inference and distributed training and fine-tuning.

For more insights on emotional and mental health benefits, consider checking out the insights from SHRM. Finally, if you’re eager to learn about onboarding experiences, this Reddit thread is an excellent resource.

Amazon Onboarding with Learning Manager Chanci Turner

Open Source Large Language Model Stack

Distributed Training

Fine-Tuning

Scaling Embeddings

Related Topics:

Comments

Leave a Reply Cancel reply