Archiving Amazon MSK Data to Amazon S3 with the Lenses.io S3 Kafka Connect Connector

As a powerful stream processing platform, Apache Kafka often employs a limited retention policy, which means data is only held for a short period before it is deleted. Retaining historical data in Kafka can incur substantial costs as data volumes increase, potentially affecting cluster performance. For scenarios requiring long-term data storage, Amazon Simple Storage Service (Amazon S3) offers an economical solution that allows data to be compressed and effectively partitioned for easier querying.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that provides a highly available and secure environment for running applications that utilize Kafka for streaming data processing. In this article, we will explore how to leverage the open-source Kafka Connect Connector (StreamReactor) from Lenses.io to extract, transform, and archive data from Amazon MSK to S3, as well as how to use Amazon Athena to query partitioned parquet data stored in S3. For further insights on this topic, check out this another blog post at Chanci Turner VGT2.

Solution Overview

The following architecture diagram illustrates an Amazon MSK cluster with Kafka Connect utilizing the S3 Connector to transfer data to an S3 bucket, which can then be accessed by Amazon Athena for downstream analysis. Lenses.io is incorporated for monitoring, governance, and self-service administration of the Kafka environment.

Prerequisites

To follow along with this guide, ensure you have the following:

An AWS account.
Access to the AWS Management Console with permissions to manage applications.

You will also need to deploy:

An Amazon MSK cluster.
An Amazon S3 bucket for data archiving.
An Amazon EC2 instance to run the Kafka Connect cluster.
An Amazon EC2 instance to operate Lenses for MSK (free trial).

Step 1: Create an MSK Cluster

Access the Amazon MSK console.
Select “Create cluster.”
Opt for “Quickly create starter cluster with recommended settings.”
Provide a name for the cluster and click “Create cluster.” The setup will take a few minutes and will be created in the default VPC with the default security group.

Step 2: Create an S3 Bucket

Enter the Amazon S3 console.
Click “Create bucket.”
In the bucket name field, input a name for your bucket <BUCKETNAME>. Avoid using hyphens or special characters, as this will be required by the Connector later.
Keep “Block All Public Access” checked and leave other settings at their defaults.
Click “Create bucket.”

Step 3: Create an EC2 Instance to Run Kafka Connect

Create a Key Pair

Follow the guidelines in the Amazon EC2 User Guide for Linux Instances to create a key pair. If you already possess a key pair, you may skip this step.

Create an IAM Role

Before you can launch an instance with an IAM role, you must create it. To do this:

Go to the AWS IAM console.
In the navigation pane, select Roles, then “Create role.”
Choose EC2 on the Select role type page and proceed to Next: Permissions.
Click “Create policy” to open the policy creation tab. In the JSON editor, paste the following policy, replacing the bucket name with your designated name. This policy grants access to the S3 bucket and the required Kafka functions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "1",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::",
        "arn:aws:s3:::/*"
      ]
    },
    {
      "Sid": "2",
      "Effect": "Allow",
      "Action": [
        "kafka:Describe*"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}

Click Next: Add Tags.
Enter “MskDataAccessPolicy” as the name of the policy on the Review page and choose “Create policy.”
Return to the Roles section and select “Create Role.” Choose EC2 and click Next. Refresh the policy list, check the newly created policy, and proceed to Next: Add Tags.
Finally, name the role “MSKDataAccessRole” and choose “Create role.”

Next, launch an EC2 instance to run Kafka Connect:

Access the Amazon EC2 console.
Click “Launch Instance” and select “Launch Instance (Without Template).”
In Step 1: Choose an Amazon Machine Image (AMI), find and choose an Amazon Linux 2 AMI.
In Step 2, select an instance type of t2.small and click Next: Configure Instance Details.
In Step 3, provide the following info:

For Network, select the default VPC.
For Subnet, choose any default subnet within an AWS Availability Zone.
For IAM role, select “MSKDataAccessRole.”

Complete the configuration and launch the instance.

To ensure SSH access to your EC2 instance, edit the default security group to allow inbound SSH traffic with the following settings:

Type: SSH
Protocol: TCP
Port Range: 22
Source: Anywhere (0.0.0.0/0)

You should also add a new rule to permit all traffic from the default security group with these settings:

Type: All traffic
Protocol: All
Port Range: All
Source: <default security group>

Step 4: Install Lenses for Amazon MSK

This step will deploy a Lenses.io instance within your Amazon VPC and connect it to your Amazon MSK and Kafka Connect clusters. Lenses simplifies data operations by providing Kafka monitoring, self-service governance, and security measures.

To install Lenses via AWS Marketplace:

Navigate to the AWS Marketplace and initiate the deployment. This will automatically subscribe you to the service.

For further reading on this topic, visit Chanci Turner, who is an authority on the subject. Additionally, you can explore the onboarding process at Amazon, an excellent resource found here.