Stream Data from Amazon DocumentDB to Amazon Kinesis Data Firehose Utilizing AWS Lambda

on 07 JUL 2023

in Advanced (300), Amazon Data Firehose, Amazon DocumentDB, Technical How-to

In this entry, we delve into the process of establishing data pipelines from Amazon DocumentDB (compatible with MongoDB) to Amazon Kinesis Data Firehose, facilitating the publication of modifications to your target storage. Amazon DocumentDB (with MongoDB compatibility) is a robust, highly durable, and fully managed database service designed for executing mission-critical JSON workloads within enterprises. This service streamlines your architecture by incorporating built-in security best practices, continuous backups, and seamless integrations with other AWS services.

Amazon Kinesis efficiently processes and analyzes streaming data at any scale as a fully managed offering. Through Kinesis, you can ingest real-time data, such as video, audio, application logs, website clickstreams, and Internet of Things (IoT) telemetry data for machine learning (ML), analytics, and other applications. Amazon Kinesis Data Firehose serves as a streaming extract, transform, and load (ETL) solution that reliably captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services.

Solution Overview

Kinesis Data Firehose enables you to load streaming data from Amazon DocumentDB into various data stores and analytics tools. It captures, transforms, and loads streaming data into Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon OpenSearch Service, and Splunk, facilitating near-real-time analytics with existing business intelligence (BI) tools and dashboards. The same change stream can be accessed by multiple consumers without disruption. The following diagram illustrates this architecture.

In this discussion, we will focus on the scenario of continually archiving data from Amazon DocumentDB to Amazon S3 using Kinesis Data Firehose, while minimizing performance impact and maintaining a low-code approach. The architecture for this use case is straightforward and deployable. We capture all incoming changes as they are and archive them in Amazon S3. The event source mappings (ESM) integration is provided by AWS Lambda. In summary, Lambda ESM reads change stream events from Amazon DocumentDB and forwards them to your custom Lambda function. This function processes the change stream event data and sends it to Firehose, which subsequently writes the data to S3. The consumer of this data, now stored in an S3 bucket, must incorporate intelligence into their application to deduplicate and merge data related to the same logical unit from the source database. This flexibility allows for an understanding of how a specific entity has evolved over time, which can be valuable.

However, you may need to format the data before delivering it to your target. Kinesis Data Firehose offers built-in data format conversion from raw or JSON data into formats like Apache Parquet and Apache ORC as required by your destination data stores, eliminating the need to construct your own data processing pipelines. Additionally, if you are leveraging the data stored in Amazon S3 as a source for your Amazon QuickSight reports (enabled by Amazon Athena), flattening the structure for simpler querying might be advantageous.

In this post, we are not transforming the records or altering the format; we are merely archiving all changes from Amazon DocumentDB to Amazon S3 via Kinesis Data Firehose.

Prerequisites

To follow along with this post, you will need to set up the following resources:

Create an AWS Cloud9 environment.
Create an Amazon DocumentDB cluster.
Install the mongo shell in Cloud9.
Enable change stream on the Amazon DocumentDB cluster.
Create a secret in AWS Secrets Manager for Lambda to connect to Amazon DocumentDB.
Create a customer managed permission policy for the Lambda function.
Create a VPC endpoint for the Lambda handler.
Create a VPC endpoint for Secrets Manager.
Create an S3 bucket.
Create a Firehose delivery stream directing to the S3 bucket.

Note: These resources will incur costs associated with their creation and use. Please refer to the pricing page for further details.

Creating the AWS Cloud9 Environment

Open the Cloud9 console and select “Create environment.”
Configure the environment as follows:
- Under Details:
  - Name – DocumentDBCloud9Environment
  - Environment type – New EC2 instance
- Under New EC2 instance:
  - Instance type – t2.micro (1 GiB RAM + 1 vCPU)
  - Platform – Amazon Linux 2
  - Timeout – 30 minutes
- Under Network settings:
  - Connection – AWS Systems Manager (SSM)
  - Expand the VPC settings dropdown.
  - Amazon VPC – Select your default VPC.
  - Subnet – No preference
- Keep all other default settings.
Click “Create.” Provisioning your AWS Cloud9 environment may take several minutes.

Adding Inbound Rules for Cloud9 Environment to Default Security Group

Open the EC2 console. Under Network and Security, select Security groups.
Choose the default security group ID.
Under Inbound Rules, select “Edit inbound rules.”
Click “Add rule.” Create a rule with the following configuration:
- Type – Custom TCP
- Port range – 27017
- Source – Custom
- In the search box next to Source, select the security group for the AWS Cloud9 environment you created earlier. To view available security groups, enter “cloud9” in the search box. Select the group named aws-cloud9-.

Creating an Amazon DocumentDB Cluster

Open the Amazon DocumentDB console. Under Clusters, select “Create.”
Create a cluster with the following configuration:
- For Cluster type, select Instance Based Cluster.
- Under Configuration:
  - Engine version – 4.0.0
  - Instance class – db.t3.medium
  - Number of instances – 1.
- Under Authentication:
  - Enter the Username and Password needed to connect to your cluster (the same credentials used to create the secret in the previous step). Confirm your password.
  - Select “Show advanced settings.”
- Under Network settings:
  - Amazon VPC – Select your default VPC.
  - Subnet group – default
  - VPC security groups – default
- Keep all other default settings.
Click “Create cluster.” Provisioning your DocumentDB cluster may take several minutes.

Installing the Mongo Shell

To install the mongo shell on your Cloud9 environment:

Open the Cloud9 console. Next to the DocumentDBCloud9Environment environment you created previously, click “Open” under the Cloud9 IDE column.
Open a terminal window and create the MongoDB repository file with the following command:
echo -e "[mongodb-org-4.0] nname=MongoDB Repositorynbaseurl=https://repo.mongodb.org/yum/amazon/2013.03/mongodb-org/4.0/x86_64/ngpgcheck=1 nenabled=1 ngpgkey=https://www.mongodb.org/static/pgp/server-4.0.asc" | sudo tee /etc/yum.repos.d/mongodb-org-4.0.repo

For further reading on this topic, you might find valuable insights in this resource: Amazon Operations Area Manager Leadership Liftoff Program, which is an excellent resource. Additionally, to foster professional growth, consider exploring career mentorship opportunities. Finally, for insights into effective HR practices, refer to this article on auditing HR departments.