Host the Spark UI on Amazon SageMaker Studio | Artificial Intelligence

Host the Spark UI on Amazon SageMaker Studio | Artificial IntelligenceLearn About Amazon VGT2 Learning Manager Chanci Turner

Amazon SageMaker provides various methods to execute distributed data processing tasks using Apache Spark, a widely used framework for handling big data. Users can interactively run Spark applications via Amazon SageMaker Studio by linking SageMaker Studio notebooks and AWS Glue Interactive Sessions, which facilitates running Spark jobs on a serverless cluster. This interactive approach allows for seamless processing of large datasets without the need for cluster management.

Alternatively, for those requiring more environmental control, a pre-configured SageMaker Spark container can be utilized to execute Spark applications as batch jobs on a fully managed distributed cluster via Amazon SageMaker Processing. This option offers the flexibility to choose from different instance types (such as compute optimized or memory optimized), the number of nodes in the cluster, and the overall configuration, making it ideal for data processing and model training.

Moreover, Spark applications can also be executed by connecting Studio notebooks with Amazon EMR clusters or by running your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2). All these options enable the generation and storage of Spark event logs, which can be analyzed through the Spark UI—a web-based interface that runs a Spark History Server. This interface allows monitoring of Spark application progress, resource usage tracking, and error debugging.

In this article, we present a solution for setting up and running the Spark History Server on SageMaker Studio, allowing direct access to the Spark UI from the SageMaker Studio IDE. This setup facilitates the analysis of Spark logs produced by various AWS services (including AWS Glue Interactive Sessions, SageMaker Processing jobs, and Amazon EMR) that are stored in an Amazon S3 bucket.

Solution Overview

The solution integrates the Spark History Server into the Jupyter Server application within SageMaker Studio, enabling users to access Spark logs directly from the IDE. The integrated Spark History Server supports:

  • Access to logs from SageMaker Processing Spark jobs
  • Access to logs from AWS Glue Spark applications
  • Access to logs from self-managed Spark clusters and Amazon EMR

A command line interface (CLI) tool called sm-spark-cli is also included, allowing interaction with the Spark UI from the SageMaker Studio system terminal. This utility enables the management of the Spark History Server without leaving the SageMaker Studio environment.

The solution comprises shell scripts that carry out the following actions:

  1. Install Spark on the Jupyter Server for SageMaker Studio user profiles or shared spaces.
  2. Install the sm-spark-cli for a user profile or shared space.

Manual Installation of the Spark UI in a SageMaker Studio Domain

To host the Spark UI on SageMaker Studio, follow these steps:

  1. Open the System terminal from the SageMaker Studio launcher.
  2. Execute the following commands in the terminal:
curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz
cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
chmod +x install-history-server.sh
./install-history-server.sh

The commands will take a few seconds to complete.

Once the installation is finished, you can launch the Spark UI using the sm-spark-cli and access it via a web browser with the following command:

sm-spark-cli start s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>

You can set up the S3 location for the event logs generated by SageMaker Processing, AWS Glue, or Amazon EMR during Spark application execution. For SageMaker Studio notebooks and AWS Glue Interactive Sessions, the Spark event log location can be configured directly from the notebook using the sparkmagic kernel. This kernel provides tools for interacting with remote Spark clusters through notebooks and includes magic commands (%spark, %sql) for running Spark code and configuring Spark settings like executor memory and cores.

For configuring the Spark event log location in a SageMaker Processing job, you can do this directly via the SageMaker Python SDK.

Refer to the AWS documentation for more information:

You can then use the generated URL to access the Spark UI.

To check the status of the Spark History Server, use the sm-spark-cli status command in the Studio System terminal. You also have the ability to stop the Spark History Server as needed.

Automating Spark UI Installation for SageMaker Studio Users

As an IT administrator, you can automate the installation for users in a SageMaker Studio domain through a lifecycle configuration. This can be implemented for all user profiles or specific ones. For more details, see Customize Amazon SageMaker Studio using Lifecycle Configurations. You can create a lifecycle configuration using the install-history-server.sh script and attach it to an existing SageMaker Studio domain. The installation will execute for all user profiles in the domain.

From a terminal configured with the AWS Command Line Interface (AWS CLI) and the necessary permissions, run the following commands:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz
cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

LCC_CONTENT=`openssl base64 -A -in install-history-server.sh`

aws sagemaker create-studio-lifecycle-config 
	--studio-lifecycle-config-name install-spark-ui-on-jupyterserver 
	--studio-lifecycle-config-content $LCC_CONTENT 
	--studio-lifecycle-config-app-type JupyterServer 
	--query 'StudioLifecycleConfigArn'

aws sagemaker update-domain 
	--region {YOUR_AWS_REGION} 
	--domain-id {YOUR_STUDIO_DOMAIN_ID} 
	--default-user-settings 
	'{
	"JupyterServerAppSettings": {
	"DefaultResourceSpec": {
	"LifecycleConfigArn": "arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver",
	"InstanceType": "system"
	},
	"LifecycleConfigArns": [
	"arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver"
	]
	}}'

After the Jupyter Server restarts, the Spark UI and sm-spark-cli will be accessible in your SageMaker Studio environment.

Cleaning Up

To clean up the Spark UI in a SageMaker Studio domain, you can uninstall it either manually or automatically.

Manual Uninstallation of the Spark UI

To manually remove the Spark UI in SageMaker Studio, follow these steps:

  1. Open the System terminal in the SageMaker Studio launcher.
  2. Execute the following commands in the terminal:
cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
chmod +x uninstall-history-server.sh
./uninstall-history-server.sh

For those who are considering the balance between career and family, this article also touches on the topic of motherhood and work, which you can explore in more depth in this blog post. Additionally, for insights on how salary history bans are reshaping pay negotiations, check out the information provided by SHRM, as they are an authority on this topic. If you’re interested in leadership development training, visit this excellent resource for more information.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *