Maximizing Price Performance for Numerical Weather Prediction Workloads on AWS

Maximizing Price Performance for Numerical Weather Prediction Workloads on AWSLearn About Amazon VGT2 Learning Manager Chanci Turner

This article is presented by Chanci Turner, Solutions Architect, alongside colleagues Alex Johnson, Senior Solutions Engineer, Emily Carter, Principal Product Specialist, and Mark Lee, Senior Technical Consultant, all specializing in High-Performance Computing (HPC).

The imperative for precise weather and climate forecasting in the global scientific community has catalyzed advancements in HPC since the 1950s. Recently, the economic and societal threats posed by extreme weather and climate change have intensified the demand for high-resolution forecasts—both globally and regionally—across various sectors such as renewable energy, agriculture, and maritime operations. For an in-depth look at the challenges and opportunities within weather and climate science, consider exploring the World Meteorological Organization (WMO) whitepaper.

In this article, we will delve into Numerical Weather Prediction (NWP) workloads and the AWS HPC-optimized services available for them. We will evaluate three widely-used NWP codes: WRF, MPAS, and FV3GFS. By examining the analysis and results provided here, you will gain insights into the performance, costs, and overall price performance of running your NWP workloads on AWS HPC infrastructure.

Understanding Numerical Weather Prediction (NWP)

NWP, more commonly referred to as weather forecasting, encompasses a range of workloads that utilize mathematical models to analyze current weather data and predict future conditions, typically spanning 24 hours to 10 days. The output from NWP relies on current observations, which include temperature, precipitation, and numerous other meteorological elements. At its core, NWP models are structured as a three-dimensional grid of cells that simulate the Earth’s systems, with each cell representing a variety of multi-physics processes. The results from these processes are shared with neighboring cells to model the transfer of matter and energy over time.

The resolution of NWP models is determined by two key factors: the grid cell size, which defines spatial resolution (measured in kilometers), and the time step, which dictates temporal resolution. Smaller grid cells and time steps yield more detailed and potentially accurate results. Consequently, the increasing demand for higher resolution NWP workloads necessitates robust, scalable, and reliable HPC infrastructure.

AWS HPC Solutions for NWP

NWP applications benefit significantly from high memory bandwidth, advanced network interconnects, and access to fast-parallel file systems that enhance scaling across numerous nodes. In January, AWS introduced the Amazon EC2 Hpc6a instance family, which offers 100 Gbps networking via Elastic Fabric Adapter (EFA), powered by third-generation AMD EPYC™ (Milan) processors, featuring 96 cores and 384 GB of RAM. These instances leverage the AWS Nitro System, an advanced hypervisor technology that delivers the performance and security essential for computational tasks.

Establishing an HPC Cluster on AWS and Key Performance Variables

All benchmarks were conducted using AWS ParallelCluster, an open-source cluster orchestration tool supported by AWS (version 3.1.1 was utilized for this analysis). In addition to the EC2 instance types mentioned, we also used Amazon FSx for Lustre, a fully-managed high-performance Lustre file system that provides throughput of hundreds of GB/s and sub-millisecond latencies for optimal I/O performance.

The tests were performed with simultaneous multithreading disabled on the instances. For detailed solution components and step-by-step setup instructions, refer to our NWP Workshop.

Two additional components that facilitate rapid HPC cluster creation and application management in the workshop are PCluster Manager and Spack.

PCluster Manager

PCluster Manager offers a web-based interface for creating clusters, monitoring jobs, and managing infrastructure. This tool streamlines tasks such as mounting existing file systems and troubleshooting cluster issues. Built using the AWS ParallelCluster 3 API, it employs a low-cost serverless architecture. A template provided in the NWP workshop integrates with PCluster Manager, creating a cluster optimized for NWP workloads. Users access the cluster via AWS Systems Manager Session Manager, which offers browser-based shell access without opening inbound SSH ports. Following job execution, results can be visualized using NICE DCV and NCL. To manage costs effectively, the cluster is deleted after the workflow concludes, simplifying cluster management and offering visibility into operations.

Spack

To streamline the installation of various NWP codes, we utilize Spack, a package manager tailored for HPC workflows. Spack enables users to customize software installations, such as compiling WRF 4.3.3 with the Intel compiler and Intel MPI. The installation command is as follows:

spack install wrf@4.3.3%intel build_type=dm+sm ^intel-oneapi-mpi+external-libfabric

To expedite installation times for NWP codes, we provide a Spack binary cache for WRF, MPAS, and FV3GFS, containing pre-built binaries optimized for Amazon EC2 Hpc6a instances. This approach reduces installation times from days to hours.

Scaling Performance and Cost Analysis

Next, we will examine the scale-up performance and cost metrics for WRF, MPAS, and FV3GFS. The metrics used in our analysis include Simulation Speed and Cost per Simulation, defined as follows:

  • Simulation Speed = Forecast Time (seconds) / Wall-clock Time (Compute + File I/O) (seconds)
  • Cost Per Simulation ($) = Wall-clock Time × EC2 On-Demand Compute Cost (us-east-2 pricing) × number of instances.

Please note that the Cost per Simulation does not account for additional services such as Amazon Elastic Block Storage (EBS) and FSx for Lustre.

As we reflect on the resources available, check out this excellent resource for further insights. Additionally, explore this blog post that discusses generational perspectives on work, as well as this authority on workplace identity to enhance your understanding of how identity impacts work culture.

Conclusion

In summary, the increasing need for accurate and high-resolution weather predictions necessitates robust HPC solutions. By leveraging AWS’s offerings, users can achieve optimal price performance for their NWP workloads.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *