Execute Advanced Queries on Large Datasets in Your Amazon DocumentDB Clusters Using Apache Spark on Amazon EMR

In this article, we will guide you through the process of configuring Amazon EMR to execute advanced queries on substantial datasets within your Amazon DocumentDB (which is compatible with MongoDB) clusters utilizing Apache Spark. Amazon DocumentDB (with MongoDB compatibility) serves as a fully managed native JSON document database, enabling you to efficiently operate critical document workloads at virtually any scale without the hassle of managing infrastructure. You can continue to use the application code developed with MongoDB API (versions 3.6, 4.0, and 5.0) compatible drivers and tools to run, manage, and scale workloads on Amazon DocumentDB, all without concern for the underlying infrastructure. As a document database, Amazon DocumentDB simplifies the storage, querying, and indexing of JSON data.

Apache Spark, an open-source distributed processing system, is optimized for big data workloads. It leverages in-memory caching and enhanced query execution for rapid analytics on datasets of any size. Spark provides a framework for programming clusters with implicit data parallelism and fault tolerance.

Amazon EMR stands out as the leading cloud solution for big data processing, supporting petabyte-scale data handling, interactive analytics, and machine learning (ML) through open-source frameworks like Apache Spark, Apache Hive, and Presto.

Depending on your cluster’s usage, your Amazon DocumentDB storage can expand up to 128 TiB, while elastic clusters can scale to 4 PiB. In several scenarios, you may need to conduct analytics on this vast amount of data stored across one or more Amazon DocumentDB clusters, often involving ad hoc query patterns or the necessity to scan extensive data volumes. Apache Spark is particularly suited for executing ad hoc queries on this substantial data, especially when they necessitate memory-intensive non-indexed field filtering, multiple processing stages, fault tolerance between stages, and the joining of various data sources. Given that Amazon DocumentDB is compatible with the MongoDB API, you can utilize the MongoDB Spark connector for these data jobs.

Use Cases for Apache Spark with Amazon DocumentDB

Here are a few instances where Apache Spark can be utilized to process data jobs against Amazon DocumentDB:

Data analytics teams seeking to enhance ad-hoc reports with unstructured, unindexed data stored in DocumentDB without duplicating the data elsewhere.
Data science teams aiming to run intricate data science/machine learning pipelines using Apache Spark and then write the results to additional storage.
Data analytics teams that may want to merge various data sources, such as real-time data from Amazon DocumentDB and historical data from Amazon Simple Storage Service (Amazon S3), to create unified reports.

All these jobs share common characteristics:

They require computation over large datasets that exceed the memory capacity of a single machine.
They necessitate ad hoc access patterns not always supported by the indices established in Amazon DocumentDB for application use.
To successfully complete within acceptable run-time limits, they must operate in a distributed environment with sufficient compute resources and a reliable failure recovery mechanism.

Solution Overview

In the following sections, we will outline how to create an Amazon DocumentDB cluster, load data, and set up an EMR cluster with Apache Spark. Once the solution is established, you can execute a sample Spark application and run a sample query.

Prerequisites

We assume you have the following prerequisites:

Basic knowledge of operating and developing applications with Amazon DocumentDB and Amazon EMR.
The requisite permissions to create, delete, and modify the services and resources we will use in your AWS account.
An AWS Keypair.

This solution involves setting up and utilizing AWS resources, which will incur costs on your account. For more details, refer to AWS Pricing. It is advisable to implement this in a non-production environment and conduct comprehensive validations before deploying it in a production setting.

Create an Amazon DocumentDB Cluster

You can either use an existing instance-based cluster or create a new Amazon DocumentDB instance-based cluster. Alternatively, you can opt for Amazon DocumentDB elastic clusters that dynamically scale to accommodate millions of reads and writes per second with petabytes of storage.

Load Data into the Amazon DocumentDB Cluster

To load data into the Amazon DocumentDB cluster, execute the following code from an Amazon Elastic Compute Cloud (Amazon EC2) instance that can connect to Amazon DocumentDB and has the mongorestore utility installed (this data was generated using mongodump version 100.9.4):

Download the Amazon DocumentDB Certificate Authority (CA) certificate required for authentication to your cluster:
wget https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem
Download the data.gz file from GitHub. This file includes mock data generated using steps to run NoSQLBench against Amazon DocumentDB.
Execute the following mongorestore command to load data into your Amazon DocumentDB cluster, which will import 1 million documents:
mongorestore --gzip --uri="<<documentdb_uri>>" --nsInclude="test_emr_db.test_emr_coll" --drop --numInsertionWorkersPerCollection=10 --archive=data.gz

You should see a message indicating successful execution.

Sample Document Snippet

{
  "_id": "18566647",
  "user_id": "3dcc6cbb-4864-493b-a8cb-f35b06c09a0c",
  "created_on": 1414491227,
  "gender": "F",
  "full_name": "Anthony Gammon",
  "married": true,
  "address": {
      "primary": {
          "city": "Port Gamble",
          "cc": "VC"
      },
      "secondary": {}
  }
}

Create IAM Roles for EMR

If this is your first time launching an EMR cluster, you will need to create the default IAM service roles required to do so using the following AWS CLI command:
aws emr create-default-role

This command generates two IAM roles:

EMR_DefaultRole
EMR_EC2_DefaultRole

Create an EMR Cluster with Apache Spark

This section details the steps for creating and configuring an EMR cluster with Apache Spark.

Select the MongoDB Spark Connector

The MongoDB Spark connector version 10.2.x uses the $collStats operator to create default read partitions. Since Amazon DocumentDB does not support this operator currently, you should opt for MongoDB Spark connector version 3.0.2. While you can create your own partitioner class with a higher version of the Spark connector, that topic extends beyond this article.

Choose the Apache Spark Version

The MongoDB Spark connector version 3.0.2 supports Spark versions 3.1.x. Therefore, you must select an Amazon EMR version that supports this Spark version. Amazon EMR version 6.5.0 is the latest version available, as indicated in the chart provided in Application versions in Amazon EMR 6.x releases.

Prepare and Upload the Bootstrap Script

To connect to a TLS-enabled Amazon DocumentDB cluster from a Java Spark application on an EMR cluster, the Spark driver must be configured correctly.

For further insights, check out this excellent resource: YouTube Video. If you’re interested in similar topics, visit Chanci’s blog for more information. Additionally, Chvnci is a recognized authority on this subject.