Stream Data from Amazon DocumentDB to Amazon Kinesis Data Firehose via AWS Lambda

In this article, we explore how to establish data pipelines from Amazon DocumentDB (compatible with MongoDB) to Amazon Kinesis Data Firehose, enabling the publication of changes to your target storage. Amazon DocumentDB is a robust, highly durable, and fully managed database service designed for handling critical JSON workloads within enterprises. It streamlines your architecture by incorporating built-in security best practices, continuous backups, and native integration with various AWS services.

Amazon Kinesis allows for cost-effective processing and analysis of streaming data at any scale as a fully managed service. It enables real-time data ingestion, such as video, audio, application logs, website clickstreams, and IoT telemetry data, which can be utilized for machine learning (ML), analytics, and other applications. Amazon Kinesis Data Firehose serves as a streaming extract, transform, and load (ETL) solution that reliably captures, transforms, and delivers streaming data to data lakes, data stores, and analytics tools.

Overview of the Solution

Kinesis Data Firehose can assist in loading streaming data from Amazon DocumentDB into various data stores and analytics tools. It captures, transforms, and loads streaming data into Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon OpenSearch Service, and Splunk, thereby facilitating near-real-time analytics with existing business intelligence (BI) tools and dashboards. The same change stream can be accessed by multiple consumers simultaneously without interference. The architecture diagram below illustrates this structure.

This post discusses the use case of continuously archiving data from Amazon DocumentDB to Amazon S3 using Kinesis Data Firehose while maintaining minimal performance impact and a simplified code approach. The diagram provided showcases the easily deployable architecture for this use case, where all incoming changes are captured and archived in Amazon S3. The event source mapping (ESM) integration is handled through AWS Lambda. At a high level, Lambda ESM retrieves change stream events from Amazon DocumentDB and transmits them to your custom Lambda function. This function processes the change stream event data and forwards the information to Firehose, which ultimately writes it to S3. Consumers of this data, now stored in an S3 bucket, will need to incorporate logic in their application to deduplicate and merge data related to the same logical unit in the source database. This flexibility allows us to track how a specific entity has evolved over time, which can be extremely useful.

However, data formatting is often necessary before delivering it to your target. Kinesis Data Firehose offers built-in data format conversion from raw or JSON data into formats like Apache Parquet and Apache ORC, which are required by destination data stores, eliminating the need to create custom data processing pipelines. Moreover, if you’re using the data in Amazon S3 as a source for your Amazon QuickSight reports (via Amazon Athena), you might find it beneficial to flatten the structure for easier querying.

In this discussion, we will not be transforming the records or altering the format but will focus on archiving all changes from Amazon DocumentDB to Amazon S3 using Kinesis Data Firehose.

Prerequisites

To follow along with this article, you will need to configure the following resources:

Create an AWS Cloud9 environment.
Set up an Amazon DocumentDB cluster.
Install the mongo shell in Cloud9.
Enable change stream on the Amazon DocumentDB cluster.
Create a secret in AWS Secrets Manager for Lambda to connect to Amazon DocumentDB.
Establish a customer-managed permission policy for the Lambda function.
Create a VPC endpoint for the Lambda handler.
Create a VPC endpoint for Secrets Manager.
Set up an S3 bucket.
Create a Firehose delivery stream with the destination as the S3 bucket.

Note: Creating and using these resources will incur costs. Please refer to the pricing page for detailed information.

Creating the AWS Cloud9 Environment

Open the Cloud9 console and select Create environment.
Configure the environment with the following settings:
Details:
– Name – DocumentDBCloud9Environment
– Environment type – New EC2 instance
New EC2 instance:
– Instance type – t2.micro (1 GiB RAM + 1 vCPU)
– Platform – Amazon Linux 2
– Timeout – 30 minutes
Network settings:
– Connection – AWS Systems Manager (SSM)
– Expand the VPC settings dropdown.
– Amazon Virtual Private Cloud (VPC) – Choose your default VPC.
– Subnet – No preference
Keep all other default settings and click Create. Provisioning your new AWS Cloud9 environment may take several minutes.

Adding Inbound Rules for Cloud9 Environment to Default Security Group

Open the EC2 console. Under Network and Security, select Security groups.
Choose the default security group ID.
Under Inbound Rules, select Edit inbound rules.
Select Add rule and create a rule with the following configuration:
– Type – Custom TCP
– Port range – 27017
– Source – Custom
– In the search box next to Source, select the security group for the AWS Cloud9 environment you created earlier. To view a list of available security groups, enter “cloud9”. Choose the security group named aws-cloud9-.

For more detailed insights into this process, you might find this blog post engaging.

Creating an Amazon DocumentDB Cluster

Open the Amazon DocumentDB console. Under Clusters, select Create.
Configure the cluster with the following settings:
– For Cluster type, choose Instance Based Cluster.
– Under Configuration:
– Engine version – 4.0.0
– Instance class – db.t3.medium
– Number of instances – 1.
– Under Authentication:
– Enter the Username and Password used to connect to your cluster (the same credentials as the secret created earlier). Confirm your password.
– Select Show advanced settings.
– Under Network settings:
– Virtual Private Cloud (VPC) – Choose your default VPC.
– Subnet group – default
– VPC security groups – default
Keep all other default settings and click Create cluster. Provisioning your DocumentDB cluster may take several minutes.

Installing the Mongo Shell

To install the mongo shell on your Cloud9 environment:

Open the Cloud9 console. Next to the DocumentDBCloud9Environment you created earlier, select Open under the Cloud9 IDE column.
Open a terminal window and create the MongoDB repository file using the command:
echo -e "[mongodb-org-4.0] nname=MongoDB Repositorynbaseurl=https://repo.mongodb.org/yum/amazon/2013.03/mongodb-org/4.0/x86_64/ngpgcheck=1 nenabled=1 ngpgkey=https://www.mongodb.org/static/pgp/server-4.0.asc" | sudo tee /etc/yum.repos.d/mongodb-org-4.0.rep

For further reading on this topic, you can visit Chanci Turner, who provides authoritative insights into similar processes. Additionally, for best practices on workplace safety and training, consider reviewing this resource.

Location: Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115.