Amazon EMR Now Offers Lower Costs and Enhanced Performance for Spark Workloads Utilizing Graviton2-Powered Instances

Amazon EMR Now Offers Lower Costs and Enhanced Performance for Spark Workloads Utilizing Graviton2-Powered InstancesMore Info

Amazon VGT2 Las Vegas

Amazon EMR has introduced support for M6g, C6g, and R6g instances with Amazon EMR versions 6.1.0, 5.31.0, and later. These instances are equipped with AWS Graviton2 processors, which are custom-built by AWS using 64-bit Arm Neoverse cores to optimize price performance for cloud workloads on Amazon Elastic Compute Cloud (Amazon EC2). When using Graviton2 instances, the Amazon EMR runtime for Apache Spark delivers a performance boost of up to 15% and can lower costs by as much as 30% compared to previous generation instances. Our TPC-DS 3 TB benchmarking tests revealed that query execution can be up to 32 times faster with the Amazon EMR runtime for Apache Spark. For more information, check out this another blog post.

Apache Spark is versatile and can be applied to a wide range of analytics scenarios, from large-scale data transformations to streaming, data science, and machine learning. Amazon EMR offers the latest stable open-source innovations along with efficient storage solutions via Amazon S3, as well as unique cost-saving features like Spot Instances and Managed Scaling.

The Amazon EMR runtime for Apache Spark is an optimized runtime environment for Apache Spark, which is available and enabled by default on Amazon EMR release 5.28.0 and later. It provides 100% API compatibility with open-source Apache Spark, allowing your workloads to run faster and incur lower compute costs on Amazon EMR without requiring any modifications to your application.

Performance Improvements with AWS Graviton2 and Amazon EMR

To evaluate the performance enhancements, we executed TPC-DS 3 TB benchmark queries on Amazon EMR 5.30.1 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark version 2.4). We employed 5-10 node clusters of M6g instances with data stored in Amazon Simple Storage Service (Amazon S3) and compared these results to equivalent configurations using M5 instances. Performance improvements were measured by analyzing total query execution time and the geometric mean of query execution times across the 104 TPC-DS 3 TB benchmark queries.

The findings indicated that M6g instance EMR clusters achieved performance improvements ranging from 11.61% to 15.61% in total query runtime compared to their M5 counterparts, with geometric mean improvements between 10.52% and 12.91%. Cost analysis showed that M6g instance EMR clusters experienced reduced instance hour costs of between 21.58% and 30.58% when executing the 104 TPC-DS benchmark queries.

Instance Size Number of Core Instances Total Query Runtime on M5 (seconds) Total Query Runtime on M6g (seconds) Improvement with M6g Geometric Mean on M5 (sec) Geometric Mean on M6g (sec) Geometric Mean Improvement with M6g
16 XL 5 6157 5196 15.61% 33 29 12.73%
12 XL 5 6167 5389 12.63% 34 30 10.79%
8 XL 5 6857 6061 11.61% 35 32 10.52%
4 XL 5 10593 9313 12.08% 47 41 12.91%
2 XL 10 10676 9240 13.45% 47 42 11.24%

The graph below illustrates the performance improvements observed on M6g 2XL instances using the EMR Runtime for Spark compared to M5 2XL instances across the 104 queries in the TPC-DS 3 TB benchmark. Notably, 100 out of 104 TPC-DS queries saw performance enhancements with M6g 2XL, while 4 queries experienced regressions (q41, q20, q42, and q52, with the maximum regression at -20.99%). If you are considering migrating from M5 to M6g instances for your EMR Spark workloads, we highly suggest testing your specific workloads to ensure no queries are adversely affected.

R6g instances displayed similar performance enhancements when running Apache Spark workloads compared to equivalent R5 instances. Our testing results indicated total query runtime improvements ranging from 14.27% to 21.50% across five different instance sizes, with geometric mean enhancements between 12.48% and 18.95%. In terms of cost, R6g instance EMR clusters demonstrated a reduction in instance hour costs of 23.26% to 31.66% when compared to R5 EMR clusters for executing the 104 TPC-DS benchmark queries. However, we did observe that 4 benchmark queries (q6, q21, q41, and q26) took longer to execute on R6g instance clusters compared to R5 instances, with a maximum regression of -18.28%.

With the C6g instances, we noted performance improvements over C5 instances for Spark workloads across 2XL, 4XL, 12XL, and 16XL sizes. There was a slight regression of -0.38% in query execution performance for 8XL instances. We found that C6g instance EMR clusters provided between 16.84% and 24.15% lower instance hour costs compared to equivalent C5 EMR clusters for executing the 104 TPC-DS benchmark queries. Of the 104 TPC-DS queries, 73 improved with C6g 4XL, whereas performance for 31 queries regressed, with a maximum regression of -31.38% for q78.

Conclusion

By leveraging Amazon EMR with M6g, C6g, and R6g instances powered by Graviton2 processors, we observed significant performance enhancements and cost reductions while executing the 104 TPC-DS benchmark queries. To stay updated on future Apache Spark optimizations, be sure to subscribe to the Big Data blog’s RSS feed. For additional insights, visit this informative source, which provides valuable knowledge on the topic.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *