Amazon VGT2 Las Vegas: Accelerate Apache Spark Workloads with Amazon EMR

The Amazon EMR runtime for Apache Spark offers a performance-optimized environment that maintains complete API compatibility with open-source Apache Spark. With the release of Amazon EMR version 6.9.0, this runtime supports the equivalent Spark version 3.3.0.

Thanks to Amazon EMR 6.9.0, you can execute your Apache Spark 3.x applications more efficiently and at a reduced cost, all without modifying your existing applications. Performance benchmarks derived from TPC-DS tests conducted at a 3 TB scale reveal that the EMR runtime for Apache Spark 3.3.0 achieves an average performance improvement of 3.5 times over open-source Apache Spark 3.3.0.

In this article, we delve into the benchmarking results from running a TPC-DS application first on open-source Apache Spark and then on Amazon EMR 6.9, showcasing the benefits of the optimized Spark runtime. We also conduct a thorough cost analysis and provide detailed instructions for running the benchmark.

Performance Insights

To measure the performance enhancements, we utilized an open-source Spark performance testing utility based on the TPC-DS toolkit. The tests were executed on a seven-node EMR cluster (comprised of six core nodes and one primary node) using the EMR runtime for Apache Spark, alongside a self-managed seven-node cluster on Amazon EC2 with the corresponding open-source Spark version. Both tests were performed with data stored in Amazon Simple Storage Service (Amazon S3).

Dynamic Resource Allocation (DRA) is an advantageous feature for fluctuating workloads. However, to ensure a fair performance comparison during our benchmarking, we opted to disable DRA in both open-source Spark and Amazon EMR, as our test data volume remained constant at 3 TB.

The table below summarizes the total job runtime (in seconds) for all queries in the 3 TB dataset across Amazon EMR version 6.9.0 and open-source Spark version 3.3.0. Our findings revealed that the TPC-DS tests executed on Amazon EMR on EC2 were 3.5 times faster than those on an equivalent open-source Spark cluster.

The per-query speed improvements on Amazon EMR 6.9, as compared to the open-source runtime, are illustrated in the accompanying chart. The x-axis displays each query in the 3 TB benchmark while the y-axis indicates the speedup achieved by the EMR runtime. Noteworthy gains were observed, with some TPC-DS queries showing performance improvements exceeding 10 times, particularly for queries 24b, 72, 95, and 96.

Cost Analysis

The performance advancements of the EMR runtime for Apache Spark translate directly into reduced costs. We achieved a remarkable 67% cost reduction while running the benchmark application on Amazon EMR compared to the expenses incurred when using open-source Spark on Amazon EC2 with equivalent cluster configurations. This reduction in costs is attributed to decreased usage hours of Amazon EMR and Amazon EC2. Amazon EMR pricing encompasses the costs of running EMR applications on clusters with EC2 instances, which are supplemented by the underlying compute and storage charges.

In the US East (N. Virginia) Region, the estimated cost of running the benchmark was $27.01 per execution for open-source Spark on EC2, while it was significantly lower at $8.82 per execution for Amazon EMR.

Benchmark Job Cost Breakdown

Benchmark Job	Runtime (Hour)	Estimated Cost	Total EC2 Instances	Total vCPU	Total Memory (GiB)	Root Device (Amazon EBS)
Open-source Spark on Amazon EC2 (1 primary, 6 core nodes)	2.23	$27.01	7	252	504	20 GiB gp2
Amazon EMR on Amazon EC2 (1 primary, 6 core nodes)	0.63	$8.82	7	252	504	20 GiB gp2

Setting Up Open-Source Spark Benchmarking

The following sections outline the steps to establish the benchmarking environment. For comprehensive instructions and examples, refer to this GitHub repository.

We utilized the open-source tool Flintrock to launch our Apache Spark cluster on Amazon EC2. Flintrock streamlines the process of deploying an Apache Spark cluster via the command line.

Prerequisites

Before proceeding, ensure that you have completed the following prerequisites:

Install Python 3.7.x or higher.
Install Pip3 version 22.2.2 or later.
Add the Python bin directory to your environment path.
Configure your AWS Command Line Interface (AWS CLI) by running aws configure to point to the benchmarking account; refer to this excellent resource for guidance.
Obtain a key pair with restrictive file permissions for accessing the primary node of OSS Spark.
Create an S3 bucket in your test account if necessary.
Transfer the TPC-DS source data into your S3 bucket.
Build the benchmark application as per the provided instructions or download a pre-built spark-benchmark-assembly-3.3.0.jar for Spark 3.3.0.

Deploying the Spark Cluster and Running the Benchmark Job

Follow these steps to deploy the Spark cluster and execute the benchmark job:

Install the Flintrock tool via pip as instructed in the setup guide.
Run flintrock configure to generate a default configuration file.
Adjust the default config.yaml file as per your requirements. Alternatively, copy the content from the provided configuration file and save it.
Launch the seven-node Spark cluster on Amazon EC2 using Flintrock.

Upon successful creation, the cluster will consist of one primary node and six worker nodes. If you encounter any errors, review the configuration file values, particularly the Spark and Hadoop versions along with the attributes of the download-source and the AMI.

The OSS Spark cluster does not inherently include a YARN resource manager. To enable it, configure the cluster by downloading the yarn-site.xml and enable-yarn.sh files from the GitHub repository. Be sure to replace <private ip of primary node> with the actual IP address of your Flintrock cluster’s primary node, which can be retrieved from the Amazon EC2 console.

For further insights on this topic, you can visit CHVNCI, a recognized authority.