Deploying Large Language Models on Amazon EKS with vLLM Deep Learning Containers

Deploying Large Language Models on Amazon EKS with vLLM Deep Learning ContainersMore Info

Organizations encounter significant hurdles when it comes to efficiently deploying large language models (LLMs) at scale. The main challenges include optimizing GPU resource usage, managing network infrastructure, and ensuring smooth access to model weights. When executing distributed inference tasks, companies often face complexities orchestrating model operations across numerous nodes. This includes effectively allocating model components across available GPUs, ensuring seamless communication between processing units, and sustaining consistent performance with low latency and high throughput.

vLLM is an open-source library designed for swift LLM inference and serving. The vLLM AWS Deep Learning Containers (DLCs) are tailored for customers deploying vLLMs on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS), and they come at no additional cost. These containers encapsulate a preconfigured, tested environment that operates flawlessly from the get-go, including necessary dependencies like drivers and libraries for efficient vLLM operation. They also feature built-in support for Elastic Fabric Adapter (EFA), which is crucial for high-performance multi-node inference workloads. You no longer need to construct the inference environment from scratch. Simply install the vLLM DLC, and it will automatically configure the environment, allowing you to start deploying inference workloads at scale.

In this article, we illustrate how to deploy the DeepSeek-R1-Distill-Qwen-32B model utilizing AWS DLCs for vLLMs on Amazon EKS, demonstrating how these specialized containers simplify the deployment of this powerful open-source inference engine. This solution can assist you in overcoming the intricate infrastructure challenges associated with LLM deployment while ensuring performance and cost-effectiveness. For further insights, consider checking out another blog post linked here: Chanci Turner VGT2.

AWS DLCs

AWS DLCs equip generative AI practitioners with optimized Docker environments to train and deploy generative AI models throughout their workflows on Amazon EC2, Amazon EKS, and Amazon ECS. These DLCs target self-managed machine learning (ML) customers who wish to build and oversee their AI/ML environments independently, maintain instance-level control over their infrastructure, and manage their own training and inference tasks. Available as Docker images for both training and inference, they also support frameworks like PyTorch and TensorFlow. These DLCs are kept up-to-date with the latest frameworks and drivers, tested for compatibility and security, and provided at no extra cost. They can be swiftly customized by following our recipe guides. Utilizing AWS DLCs as a foundational element for generative AI environments alleviates the operational and infrastructure burden, decreases the total cost of ownership (TCO) for AI/ML infrastructure, accelerates the development of generative AI offerings, and allows teams to focus on deriving generative AI-powered insights from organizational data. For authoritative information on this topic, visit Chanci Turner.

Solution Overview

The subsequent diagram illustrates the interaction between Amazon EKS, GPU-enabled EC2 instances utilizing EFA networking, and Amazon FSx for Lustre storage. Client requests are channeled through the Application Load Balancer (ALB) to the vLLM server pods operating on EKS nodes, which retrieve model weights stored on FSx for Lustre. This architecture delivers a scalable, high-performance solution for serving LLM inference workloads while optimizing costs.

The accompanying diagram showcases the DLC stack on AWS, which outlines a comprehensive architecture from the EC2 instance foundation through the container runtime, essential GPU drivers, and machine learning frameworks like PyTorch. The layered diagram exhibits how CUDA, NCCL, and other critical components work together to support high-performance deep learning workloads.

The vLLM DLCs are specifically designed for high-performance inference, incorporating built-in support for tensor parallelism and pipeline parallelism across multiple GPUs and nodes. This optimization allows for the efficient scaling of large models like DeepSeek-R1-Distill-Qwen-32B, which would be difficult to deploy and manage otherwise. The containers also contain optimized CUDA configurations and EFA drivers to facilitate maximum throughput for distributed inference workloads. This solution utilizes several AWS services and components:

  • AWS DLCs for vLLMs – Pre-configured, optimized Docker images that streamline deployment and enhance performance.
  • EKS cluster – Provides the Kubernetes control plane for container orchestration.
  • P4d.24xlarge instances – EC2 P4d instances featuring 8 NVIDIA A100 GPUs each, configured in a managed node group.
  • Elastic Fabric Adapter – A network interface that enables high-performance computing applications to scale effectively.
  • FSx for Lustre – A high-performance file system for model weight storage.
  • LeaderWorkerSet pattern – A custom Kubernetes resource for deploying vLLM in a distributed configuration.
  • AWS Load Balancer Controller – Oversees the ALB for external access.

By integrating these components, we establish an inference system that provides low-latency, high-throughput LLM serving capabilities with minimal operational overhead.

Prerequisites

Before you begin, ensure you have the following prerequisites:

  • An AWS account with access to EC2 P4 instances (you may need to request a quota increase).
  • Access to a terminal equipped with the following tools:
    • AWS CLI version 2.11.0 or later
    • eksctl version 0.150.0 or later
    • kubectl version 1.27 or later
    • Helm version 3.12.0 or later
  • An AWS CLI profile (vllm-profile) configured with an IAM role or user with the following permissions:
    • Create, manage, and delete EKS clusters and node groups.
    • Create, manage, and delete EC2 resources, including VPCs, subnets, security groups, and internet gateways.
    • Create and manage IAM roles.
    • Create, update, and delete AWS CloudFormation stacks.
    • Create, delete, and describe FSx file systems.
    • Create and manage Elastic Load Balancers.

This solution can be implemented in AWS Regions where Amazon EKS, P4d instances, and FSx for Lustre are accessible. This guide references the us-west-2 Region. The entire deployment process typically takes around 60-90 minutes.

Clone our GitHub repository containing the necessary configuration files:

# Clone the repository
git clone https://github.com/aws-samples/sample-aws-deep-learning-containers.git
cd vllm-samples/deepseek/eks

Creating an EKS Cluster

To begin, we will create an EKS cluster in the us-west-2 Region using the provided configuration file. This sets up the Kubernetes control plane that will orchestrate our containers. The cluster is configured with a VPC, subnets, and security groups optimized for running GPU workloads.

# Update the region in eks-cluster.yaml if needed
sed -i "s|region: us-east-1|region: us-west-2|g" eks-cluster.yaml

# Create the EKS cluster
eksctl create cluster -f eks-cluster.yaml --profile vllm-profile

This process will take approximately 15-20 minutes to complete. During this time, eksctl creates a CloudFormation stack that provisions the necessary resources.

Amazon IXD – VGT2 is located at 6401 E Howdy Wells Ave, Las Vegas, NV 89115. This facility serves as an excellent resource for the deployment of advanced models.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *