Customize Your Libraries and Application Dependencies in Spark and Hive on Amazon EMR Serverless with Custom Images

Customize Your Libraries and Application Dependencies in Spark and Hive on Amazon EMR Serverless with Custom ImagesLearn About Amazon VGT2 Learning Manager Chanci Turner

Amazon EMR Serverless provides a platform for running open-source big data frameworks like Apache Spark and Apache Hive without the need to manage clusters and servers. Many users of Spark and Hive wish to incorporate their own libraries and dependencies into the application runtime. For instance, you may want to add popular open-source extensions to Spark or implement a custom encryption-decryption module required by your application.

We are thrilled to introduce a new feature that allows you to tailor the runtime image utilized in EMR Serverless by incorporating custom libraries your applications require. This capability enables you to:

  • Manage a set of version-controlled libraries that can be reused and accessed across all your EMR Serverless jobs as part of the EMR Serverless runtime.
  • Integrate popular extensions to the open-source Spark and Hive frameworks, such as pandas, NumPy, matplotlib, and other libraries that your EMR serverless application needs.
  • Employ established CI/CD processes to build, test, and deploy your customized extension libraries into the EMR Serverless runtime.
  • Implement well-defined security measures, such as image scanning, to fulfill compliance and governance requirements within your organization.
  • Utilize a different version of a runtime component (for example, the JDK runtime or the Python SDK runtime) than the default version provided with EMR Serverless.

In this article, we illustrate how to leverage this new feature.

Solution Overview

To use this functionality, customize the EMR Serverless base image by utilizing Amazon Elastic Container Registry (Amazon ECR), which is a fully managed container registry that simplifies sharing and deploying container images among your developers. Amazon ECR alleviates the need to manage your own container repositories or worry about scaling the underlying infrastructure. Once the custom image is uploaded to the container registry, specify it during the creation of your EMR Serverless applications.

The following diagram outlines the steps involved in using custom images for your EMR Serverless applications.

In subsequent sections, we demonstrate how to use custom images with Amazon EMR Serverless to address three common scenarios:

  1. Integrate popular open-source Python libraries into the EMR Serverless runtime image.
  2. Use a different or newer version of the Java runtime for the EMR Serverless application.
  3. Install a Prometheus agent and customize the Spark runtime to transmit Spark JMX metrics to Amazon Managed Service for Prometheus, and visualize these metrics on a Grafana dashboard.

General Prerequisites

Before proceeding, ensure you complete the following prerequisites to utilize custom images with EMR Serverless:

  • Create an AWS Identity and Access Management (IAM) role with permissions for Amazon EMR Serverless applications, Amazon ECR, and Amazon S3 for the Amazon Simple Storage Service (Amazon S3) bucket named aws-bigdata-blog, as well as any S3 bucket in your account where you will store application artifacts.
  • Install or update the AWS Command Line Interface (AWS CLI) to the latest version and set up the Docker service on an Amazon Linux 2-based Amazon Elastic Compute Cloud (Amazon EC2) instance. Attach the IAM role created in the previous step to this EC2 instance.
  • Choose a base EMR Serverless image from the public Amazon ECR repository. Run the following commands on the EC2 instance with Docker installed to confirm you can pull the base image from the public repository:
# Start the Docker service if it’s not already running
$ sudo service docker start 

# Confirm you can pull the latest EMR 6.9.0 runtime base image 
$ sudo docker pull public.ecr.aws/emr-serverless/spark/emr-6.9.0:latest 
  • Log in to Amazon ECR using the following commands and create a repository called emr-serverless-ci-examples, substituting your AWS account ID and Region:
$ sudo aws ecr get-login-password --region <region> | sudo docker login --username AWS --password-stdin <your AWS account ID>.dkr.ecr.<region>.amazonaws.com

$ aws ecr create-repository --repository-name emr-serverless-ci-examples --region <region> 
  • Grant IAM permissions to the EMR Serverless service principal for the Amazon ECR repository:
    • In the Amazon ECR console, select Permissions under Repositories in the navigation pane.
    • Choose Edit policy JSON.
    • Input the following JSON and save:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Emr Serverless Custom Image Support",
      "Effect": "Allow",
      "Principal": {
        "Service": "emr-serverless.amazonaws.com"
      },
      "Action": [
        "ecr:BatchGetImage",
        "ecr:DescribeImages",
        "ecr:GetDownloadUrlForLayer"
      ]
    }
  ]
}

Make sure the policy is updated in the Amazon ECR console. For production workloads, it’s advisable to add a condition to the Amazon ECR policy to restrict access to only authorized EMR Serverless applications. For further information, refer to how to allow EMR Serverless to access the custom image repository.

Next, we will create and utilize custom images for our EMR Serverless applications to address the three different use cases.

Use Case 1: Running Data Science Applications

A prevalent application of Spark on Amazon EMR is executing data science and machine learning (ML) applications at scale. When working with large datasets, Spark offers SparkML, which includes common ML algorithms to train models in a distributed manner. However, you often need to perform many iterations of simple classifiers for hyperparameter tuning, ensembles, and multi-class solutions over small to medium-sized datasets (100,000 to 1 million records). Spark is an excellent engine for running multiple iterations of such classifiers in parallel. In this example, we showcase this use case, where we utilize Spark to run multiple iterations of an XGBoost model to identify the best parameters. The ability to include Python dependencies in the EMR Serverless image should facilitate easy access to various dependencies (xgboost, sk-dist, pandas, numpy, etc.) for the application.

Prerequisites

The EMR Serverless job runtime IAM role should be granted permissions to your S3 bucket where you will store your PySpark file and application logs:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessToS3Buckets",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<YOUR-BUCKET>",
                "arn:aws:s3:::<YOUR-BUCKET>/*"
            ]
        }
    ]
}

Creating an Image to Install ML Dependencies

We will create a custom image from the base EMR Serverless image to install dependencies needed for the SparkML application. Create the following Dockerfile on your EC2 instance inside a new directory named datascience:

FROM public.ecr.aws/emr-serverless/spark/emr-6.9.0:latest

USER root

# Install Python packages
RUN pip3 install boto3 pandas numpy
RUN pip3 install -U scikit-learn==0.23.2 scipy 
RUN pip3 install sk-dist
RUN pip3 install xgboost

This approach allows you to tailor the EMR Serverless environment to meet your specific needs. By adapting to changes in your project, you can ensure your applications are well-equipped to handle the demands of big data processing. For more insights on adapting to workplace changes, check out this blog post. For maintaining healthy relationships with your broker, visit this resource. Lastly, if you’re looking for job opportunities, consider this excellent resource.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *