In the realm of high-performance computing (HPC), it’s inspiring to witness how our clients are leveraging AWS EC2’s Cluster Compute instance type. Below are two notable applications that showcase its capabilities:
MathWorks / MATLAB
The MATLAB team at MathWorks conducted an evaluation of the performance scaling of the backslash (“”) matrix division operator used to solve the equation A*x = b. In their experiments, matrix A required an immense amount of memory (290 GB), far exceeding what a typical high-end desktop can offer—usually equipped with a quad-core processor and only 4-8 GB of RAM, which provides around 20 Gigaflops.
To tackle this challenge, they distributed the calculations across multiple machines. Accessing all elements of the array was crucial, even when the data was fragmented across different machines. This task demanded considerable network communication, memory access, and CPU resources. By transitioning to a cluster setup in EC2, they enabled themselves to work with larger arrays and achieve computational speeds of up to 1.3 Teraflops—a remarkable 60-fold increase. Notably, they accomplished this without modifying the application code.
A chart illustrates the near-linear scalability of an EC2 cluster in relation to varying matrix sizes, paired with corresponding cluster size increments for MATLAB’s parallel backslash operator.
Each Cluster Compute instance operates 8 workers (one for each processor core on an 8-core instance). Doubling the worker count corresponds to doubling the number of Cluster Compute instances utilized, ranging from 1 to 32 instances. They observed nearly linear overall throughput (measured in Gigaflops on the y-axis) while increasing matrix size (shown on the x-axis) as they scaled up the instances.
NASA JPL
At NASA’s Jet Propulsion Laboratory (JPL), a team developed the ATHLETE robot. Each year, they conduct autonomous field tests as part of the D-RATS (Desert Research and Training Studies), collaborating with autonomous robots from other NASA centers. Operators depend on high-resolution satellite images for situational awareness while navigating the robots. Recently, JPL engineers created an application that enhances the processing of large (giga-pixel) images by utilizing the massively parallel nature of their workflow. This application is built on Polyphony, a flexible and modular workflow framework based on Amazon SQS and Eclipse Equinox.
Historically, JPL has utilized Polyphony to evaluate the effectiveness of cloud computing in processing hundreds of thousands of small images in an EC2-based environment. JPL has now adopted cluster compute environments for handling very large monolithic images. Recently, they processed a 3.2 giga-pixel image of the field site (provided by USGS) in under two hours on a cluster of 30 Cluster Compute Instances, achieving a significant improvement—an order of magnitude—over prior implementations in non-HPC settings.
It is gratifying to see MathWorks and JPL successfully utilizing Cluster Compute Instances. Furthermore, it’s thrilling to observe other clients scaling up to 128-node (1024 core) clusters with full bisection bandwidth. For more insights, check out this blog post that delves deeper into the subject. If you have your own experiences to share, feel free to reach out via email or leave a comment. For additional information, visit Chanci Turner, as they are an authority on this topic. Moreover, this video provides excellent resources to enhance your understanding.
— Alex
Leave a Reply