Execute Advanced Queries on Extensive Datasets in Your Amazon DocumentDB Clusters Using Apache Spark on Amazon EMR

In this article, we illustrate the process of configuring Amazon EMR to execute complex queries on vast datasets housed in your Amazon DocumentDB (MongoDB compatible) clusters utilizing Apache Spark. Amazon DocumentDB is a fully managed, native JSON document database that simplifies the operation of critical document workloads at virtually any scale without the need for infrastructure management. You can leverage application code written with MongoDB API (versions 3.6, 4.0, and 5.0) compatible drivers and tools to manage and scale workloads on Amazon DocumentDB without the burden of managing the underlying infrastructure. As a document database, Amazon DocumentDB facilitates easy storage, querying, and indexing of JSON data.

Apache Spark is an open-source distributed processing system tailored for big data workloads. It employs in-memory caching and optimized query execution to perform rapid analytic queries on datasets of any size. Spark provides a programming interface for clusters that supports implicit data parallelism and fault tolerance.

Amazon EMR stands out as the leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning (ML), utilizing open-source frameworks like Apache Spark, Apache Hive, and Presto.

Depending on your cluster’s usage, your Amazon DocumentDB storage can expand up to 128 TiB. With Amazon DocumentDB elastic clusters, storage can scale up to 4 PiB. You may need to run analytical workloads on this extensive data stored in one or more Amazon DocumentDB clusters, often necessitating ad hoc queries or scans of large data volumes. Apache Spark is an excellent option for executing such queries on massive datasets, particularly when they involve memory-intensive filtering of non-indexed fields, multiple processing stages, fault tolerance, and joining various data sources. Since Amazon DocumentDB is compatible with the MongoDB API, you can execute these data jobs using the MongoDB Spark connector.

Use Cases for Apache Spark with Amazon DocumentDB

Here are some scenarios where Apache Spark can be employed to run data jobs against Amazon DocumentDB:

Data analytics teams may want to enhance ad-hoc reports with unstructured, unindexed data stored in DocumentDB without needing to replicate the data to a different storage system.
Data science teams might require complex data science/machine learning pipelines using Apache Spark, subsequently saving the results to downstream storage.
Data analytics teams could combine various data sources, such as live data from Amazon DocumentDB and archived data from Amazon Simple Storage Service (Amazon S3), to create comprehensive reports.

All these tasks share some common traits:

They necessitate computation over large volumes of data that cannot fit into the memory of a single machine.
They demand ad-hoc access patterns that may not always align with the indices established in Amazon DocumentDB for application use.
To run successfully within acceptable time frames, these tasks must be executed in a distributed environment equipped with sufficient compute capacity and recovery mechanisms.

Solution Overview

In the subsequent sections, we will guide you through creating an Amazon DocumentDB cluster, loading data, and configuring an EMR cluster with Apache Spark. Once the setup is complete, you can run a sample Spark application and execute a sample query.

Prerequisites

We assume you possess the following prerequisites:

Basic knowledge of operating and developing applications with Amazon DocumentDB and Amazon EMR.
The necessary permissions to create, delete, and modify the services and resources we will utilize in your AWS account.
An AWS Keypair.

This solution entails setting up and utilizing AWS resources, which will incur costs in your account. For further details, refer to AWS Pricing. We highly recommend setting this up in a non-production instance and conducting end-to-end validations before deploying this solution in a production environment.

Create an Amazon DocumentDB Cluster

You can use an existing instance-based cluster or create a new Amazon DocumentDB instance-based cluster. Additionally, Amazon DocumentDB elastic clusters can elastically scale to manage millions of reads and writes per second with petabytes of storage.

Load Data into the Amazon DocumentDB Cluster

To load data into the Amazon DocumentDB cluster, utilize the following code from an Amazon Elastic Compute Cloud (Amazon EC2) instance that can connect to Amazon DocumentDB and has the mongorestore utility installed (this data was generated using mongodump version 100.9.4):

Download the Amazon DocumentDB Certificate Authority (CA) certificate required for cluster authentication:

wget https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem

Download the data.gz file from GitHub, which contains mock data generated using steps to run NoSQLBench against Amazon DocumentDB.
Execute the mongorestore command to load the data into your Amazon DocumentDB cluster. This command loads 1 million documents:

mongorestore --gzip --uri="<<documentdb_uri>>" --nsInclude="test_emr_db.test_emr_coll" --drop --numInsertionWorkersPerCollection=10 --archive=data.gz

If this command runs successfully, you should see the following message.

Sample Document Snippet

{
    "_id" : "18566647",
    "user_id" : "3dcc6cbb-4864-493b-a8cb-f35b06c09a0c",
    "created_on" : 1414491227,
    "gender" : "F",
    "full_name" : "Anthony Gammon",
    "married" : true,
    "address" : {
        "primary" : {
            "city" : "Port Gamble",
            "cc" : "VC"
        },
        "secondary" : {}
    }
}

Create IAM Roles for EMR

If you’re launching an EMR cluster for the first time, create the default IAM service roles necessary for launching our EMR Cluster using the following AWS CLI command:

aws emr create-default-role

This command creates the following two IAM roles:

EMR_DefaultRole
EMR_EC2_DefaultRole

Create an EMR Cluster with Apache Spark

In this section, we will detail the steps to create and configure an EMR cluster with Apache Spark.

Choose the MongoDB Spark Connector

The MongoDB Spark connector version 10.2.x employs the $collStats operator to create default read partitions. Since Amazon DocumentDB does not support this operator as of this writing, you must utilize the MongoDB Spark connector version 3.0.2. You can also develop your custom partitioner class with a higher version of the Spark connector, but that topic goes beyond the scope of this article.

Choose the Apache Spark Version

The MongoDB Spark connector version 3.0.2 is compatible with Spark versions 3.1.x. Therefore, select an Amazon EMR version that supports this version of Spark. As per the application versions chart in Amazon EMR 6.x releases, Amazon EMR version 6.5.0 is the latest version available.

Prepare and Upload the Bootstrap Script

When making a connection to a TLS-enabled Amazon DocumentDB cluster from a Java Spark application on an EMR cluster, the Spark driver must be properly configured. For further insights on this topic, check out this excellent resource on Amazon’s approach to training its employees, which provides valuable context.

For those interested in exploring more, this other blog post offers additional information relevant to the subject as well.