In March 2023, AWS and NVIDIA unveiled a multi-faceted collaboration aimed at creating a scalable, on-demand infrastructure for artificial intelligence (AI) specifically designed for training progressively sophisticated large language models (LLMs) and developing generative AI applications. We are pleased to announce the availability of Amazon Elastic Compute Cloud (Amazon EC2) P5 instances, which leverage NVIDIA H100 Tensor Core GPUs and AWS’s latest advancements in networking to achieve up to 20 exaflops of computing power for the construction and training of the largest machine learning (ML) models. This release marks the culmination of over a decade of partnership between AWS and NVIDIA, delivering a series of visual computing, AI, and high-performance computing (HPC) clusters across various instance types, including the Cluster GPU (cg1) instances (2010), G2 (2013), P2 (2016), P3 (2017), G3 (2017), P3dn (2018), G4 (2019), P4 (2020), G5 (2021), and P4de instances (2022).
Notably, the size of ML models has surged to trillions of parameters. However, this increased complexity has extended the training duration for our customers, with the latest LLMs often requiring several months to train. Similarly, HPC customers are experiencing longer times to solution due to the growing fidelity of their data collection and the escalation of data sets to exabyte scales.
Introducing EC2 P5 Instances
Today, we are excited to announce the general availability of Amazon EC2 P5 instances, the next generation of GPU instances tailored to meet the high performance and scalability demands of AI/ML and HPC workloads. Powered by the cutting-edge NVIDIA H100 Tensor Core GPUs, P5 instances can reduce training times by up to 6 times (from days to hours) when compared to previous GPU-based instances. This improvement translates to training costs that are approximately 40 percent lower for customers.
Equipped with 8 NVIDIA H100 Tensor Core GPUs, 640 GB of high-bandwidth GPU memory, 3rd Gen AMD EPYC processors, 2 TB of system memory, and 30 TB of local NVMe storage, P5 instances also offer an impressive 3200 Gbps of aggregate network bandwidth, supporting GPUDirect RDMA. This feature minimizes latency and enhances performance by allowing direct communication between GPUs without the CPU.
Specifications of the P5 Instance:
Instance Size | vCPUs | Memory (GiB) | GPUs (H100) | Network Bandwidth (Gbps) | EBS Bandwidth (Gbps) | Local Storage (TB) |
---|---|---|---|---|---|---|
p5.48xlarge | 192 | 2048 | 8 | 3200 | 80 | 8 x 3.84 |
For a quick visual representation, you can explore how P5 instances compare to previous models and processors. P5 instances are primed for training and executing inference on increasingly intricate LLMs and computer vision models, supporting demanding generative AI applications such as question answering, code generation, video and image generation, speech recognition, and more. They promise up to 6 times shorter training times across these applications. Users employing lower precision FP8 data types in their workloads will benefit even further, achieving performance gains of 6 times thanks to the NVIDIA Transformer Engine.
HPC clients utilizing P5 instances can scale their applications more effectively in areas such as pharmaceutical discovery, seismic analysis, weather forecasting, and financial modeling. Those employing dynamic programming (DP) algorithms for genome sequencing or accelerated data analytics will also reap the rewards from P5, thanks to the new DPX instruction set. This facilitates exploration of previously unreachable problem domains, accelerates solution iteration, and promotes quicker market entry.
Detailed instance specifications and comparisons
Feature | p4d.24xlarge | p5.48xlarge | Comparison |
---|---|---|---|
Number & Type of Accelerators | 8 x NVIDIA A100 | 8 x NVIDIA H100 | – |
FP8 TFLOPS per Server | – | 16,000 | 6.4x vs. A100 FP16 |
FP16 TFLOPS per Server | 2,496 | 8,000 | – |
GPU Memory | 40 GB | 80 GB | 2x |
GPU Memory Bandwidth | 12.8 TB/s | 26.8 TB/s | 2x |
CPU Family | Intel Cascade Lake | AMD Milan | – |
vCPUs | 96 | 192 | 2x |
Total System Memory | 1152 GB | 2048 GB | 2x |
Networking Throughput | 400 Gbps | 3200 Gbps | 8x |
EBS Throughput | 19 Gbps | 80 Gbps | 4x |
Local Instance Storage | 8 TBs NVMe | 30 TBs NVMe | 3.75x |
GPU to GPU Interconnect | 600 GB/s | 900 GB/s | 1.5x |
The second-generation Amazon EC2 UltraClusters and Elastic Fabric Adaptor P5 instances provide unrivaled scale-out capabilities for multi-node distributed training and tightly coupled HPC workloads, delivering up to 3,200 Gbps of networking, which is 8 times faster than P4d instances. To fulfill customer demands for large-scale and low-latency solutions, P5 instances are integrated into the second-generation EC2 UltraClusters, enabling low-latency communication across more than 20,000 NVIDIA H100 Tensor Core GPUs. This establishes the largest ML infrastructure available in the cloud, offering up to 20 exaflops of aggregate compute power.
EC2 UltraClusters utilize Amazon FSx for Lustre, a fully managed shared storage solution built on a popular high-performance parallel file system. FSx for Lustre allows for rapid processing of massive datasets on demand, achieving sub-millisecond latencies. Its low-latency, high-throughput characteristics are optimized for deep learning, generative AI, and HPC workloads on EC2 UltraClusters. FSx for Lustre ensures that GPUs and ML accelerators within the clusters are efficiently supplied with data, accelerating the most demanding workloads, including LLM training, generative AI inference, and HPC tasks like genomics and financial risk modeling.
Getting Started with EC2 P5 Instances
You can begin utilizing P5 instances in the US East (N. Virginia) and US West (Oregon) Regions. When launching P5 instances, make sure to select AWS Deep Learning AMIs (DLAMIs) to support these instances. DLAMIs provide ML practitioners and researchers with the necessary infrastructure and tools to quickly build scalable, secure applications.
For further insights on this topic, check out another blog post here. If you’re looking for an authority on this subject, you can visit this resource. Also, explore excellent resources available for learning and development.
Leave a Reply