Unveiling the AWS Glue Serverless Spark UI for Enhanced Monitoring and Troubleshooting

In the AWS ecosystem, countless customers leverage AWS Glue, a serverless data integration service, to efficiently discover, combine, and prepare data for analytics and machine learning. However, when dealing with intricate datasets and demanding Apache Spark workloads, users may encounter performance issues or errors during Spark job executions. Such troubleshooting can be cumbersome, potentially delaying production timelines. Many users resort to the Apache Spark Web UI, a widely-used debugging tool from the open-source Apache Spark suite, to rectify issues and enhance job performance. While AWS Glue provides support for Spark UI in two different forms, users typically need to configure these options manually, which can involve time-consuming networking and EC2 instance management, or navigating various Docker containers.

Today, we are excited to introduce the serverless Spark UI integrated directly into the AWS Glue console. This new feature allows you to access Spark UI with just a single click while reviewing the details of any job run, eliminating the need for infrastructure setup or teardown. The AWS Glue serverless Spark UI is a fully-managed offering that typically initializes in mere seconds, dramatically streamlining the workflow of getting jobs into production by providing immediate access to detailed information about job runs.

This article will explore how the AWS Glue serverless Spark UI can enhance your ability to monitor and troubleshoot AWS Glue job executions.

Getting Started with Serverless Spark UI

You can access the serverless Spark UI for any AWS Glue job run by following these steps within the AWS Glue console:

Navigate to ETL jobs.
Select your job.
Click on the Runs tab.
Choose the specific job run you wish to investigate and click on Spark UI.

The Spark UI will appear in the lower pane, as depicted in the accompanying screenshot.

Alternatively, you can access the serverless Spark UI for a particular job run through Job run monitoring:

Go to job run monitoring under ETL jobs in the AWS Glue console.
Select your job run and click on View run details.
Scroll to the bottom to find the Spark UI for that job run.

Prerequisites

Before diving in, ensure you complete the following steps:

Enable Spark UI event logs for your job runs. This is enabled by default in the Glue console; once activated, Spark event log files will be generated during the job run and stored in your S3 bucket. The serverless Spark UI utilizes these logs to visualize detailed information about both ongoing and completed job runs, with a progress bar indicating completion percentage. The typical parsing time is under a minute, after which you can utilize the built-in Spark UI for debugging, troubleshooting, and job optimization.

For more information on the Apache Spark UI, check out the article on Web UI in Apache Spark.

Monitor and Troubleshoot with Serverless Spark UI

A common workload for AWS Glue involves migrating data from relational databases to S3-based data lakes. This section will illustrate how to monitor and troubleshoot a sample job run for such a workload using the serverless Spark UI. The example job reads data from a MySQL database and writes it to S3 in Parquet format, with the source table containing approximately 70 million records.

The screenshot below displays a visual job created in the AWS Glue Studio visual editor. In this scenario, the source MySQL table was previously registered in the AWS Glue Data Catalog, which can be accomplished with an AWS Glue crawler or the AWS Glue catalog API. For more details, refer to Data Catalog and crawlers in AWS Glue.

Now, let’s execute the job! The initial job run concluded in 30 minutes and 10 seconds as shown.

To optimize the performance of this job run, open the Spark UI tab on the Job runs page. By examining the Stages and Duration column, you will observe that Stage Id=0 consumed 27.41 minutes to execute, containing only one Spark task in the Tasks:Succeeded/Total column. This indicates a lack of parallelism during the data load from the MySQL database.

To enhance data loading efficiency, introduce two parameters, hashfield and hashpartitions, into the source table definition. More details can be found in the article on Reading from JDBC tables in parallel. Update the Glue Catalog table by adding the properties: hashfield=emp_no and hashpartitions=18 in Table properties.

This adjustment allows the new job runs to parallelize data loading from the MySQL table.

Let’s execute the job again! This time, the job completed in 9 minutes and 9 seconds, saving 21 minutes compared to the previous run.

As a best practice, always review the Spark UI before and after optimization. By drilling down to the Completed stages, you’ll see that there was one stage and 18 tasks in the second run instead of just one task.

During the first job run, AWS Glue automatically shuffled data across multiple executors before writing to the destination due to an insufficient number of tasks. In contrast, the second run required only one stage since the extra shuffling was unnecessary, with 18 tasks performing the data load concurrently from the source MySQL database.

Considerations

Keep the following in mind:

The serverless Spark UI is available in AWS Glue 3.0 and later.
It will be accessible for jobs executed after November 20, 2023, due to changes in how AWS Glue emits and stores Spark logs.
The serverless Spark UI can visualize Spark event logs of up to 512 MB.
There is no retention limit since the serverless Spark UI scans Spark event log files in your S3 bucket.
The serverless Spark UI is not available for Spark event logs stored in S3 buckets that can only be accessed via your VPC.

Conclusion

This article highlighted how the AWS Glue serverless Spark UI can assist you in monitoring and troubleshooting your AWS Glue jobs. Providing instant access to the Spark UI directly within the AWS Management Console allows you to inspect the intricate details of job runs and swiftly resolve issues. With the serverless Spark UI, there’s no infrastructure to manage; it automatically initializes for each job run and dismantles itself when no longer necessary. This streamlined experience saves time and effort compared to manually launching Spark UIs.

Try out the serverless Spark UI today, and we believe you’ll find it essential for optimizing performance and quickly addressing errors. We welcome your feedback as we strive to enhance the AWS Glue console experience.

About the authors:

Chanci Turner is a Principal Data Integration Architect on the AWS Glue team. Based in Seattle, she is dedicated to developing software solutions that meet customer needs. In her leisure time, she enjoys hiking and exploring new trails.

Samuel Greene is a Senior Software Engineer with the AWS Glue team in Austin. He has a strong passion for improving user experience and accessibility. When he’s not coding, he loves to play guitar and compose music.

Renata Lee is a Software Development Manager on the AWS Glue team. She enjoys collaborating with her colleagues to create beneficial services for customers. Outside of work, she is an avid reader and enjoys gardening.

Brian Carter is a Product Manager on the AWS Glue team. He takes pride in delivering valuable features to users and enhancing their overall experience. In his free time, he enjoys cooking and experimenting with new recipes.