Streamlining Data Streaming Ingestion for Analytics with Amazon MSK and Amazon Redshift

Streamlining Data Streaming Ingestion for Analytics with Amazon MSK and Amazon RedshiftLearn About Amazon VGT2 Learning Manager Chanci Turner

In late 2022, Amazon Web Services (AWS) introduced real-time streaming ingestion capabilities for Amazon Redshift, specifically for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This groundbreaking feature eliminates the necessity of staging streaming data in Amazon Simple Storage Service (Amazon S3) before ingesting it into Amazon Redshift.

The process of streaming data from Amazon MSK into Amazon Redshift represents a state-of-the-art method for real-time data processing and analytics. Amazon MSK is a fully managed and highly scalable service for Apache Kafka, which facilitates the seamless collection and processing of extensive data streams. By integrating streaming data into Amazon Redshift, organizations can unlock the potential of real-time analytics and data-driven decision-making, empowering them to act swiftly based on the latest insights.

This integration allows users to achieve low-latency ingestion, measured in seconds, while handling hundreds of megabytes of streaming data per second into Amazon Redshift. The absence of the need for intermediary storage in Amazon S3 results in lower latency and avoids additional storage costs, ensuring that the most current information is always available for analysis.

Setting up streaming ingestion involves configuring Amazon Redshift on a Redshift cluster using SQL commands to authenticate and connect to an MSK topic. This approach is particularly beneficial for data engineers aiming to simplify data pipelines and minimize operational expenses.

This article will guide you through the process of configuring Amazon Redshift streaming ingestion from Amazon MSK.

Solution Overview

The following architecture diagram illustrates the AWS services and features involved in this process.

The workflow consists of several key steps:

  1. Begin by setting up an Amazon MSK Connect source connector, creating an MSK topic, generating mock data, and writing it to the MSK topic—using mock customer data for this demonstration.
  2. Next, establish a connection to a Redshift cluster via the Query Editor v2.
  3. Finally, configure an external schema and create a materialized view in Amazon Redshift to access data from the MSK topic. Notably, this solution does not rely on an MSK Connect sink connector for data export from Amazon MSK to Amazon Redshift.

The solution architecture diagram further details the configuration and integration of the AWS services involved. The workflow encompasses the following steps:

  • Deploy an MSK Connect source connector, an MSK cluster, and a Redshift cluster within the private subnets of a Virtual Private Cloud (VPC).
  • The MSK Connect source connector utilizes granular permissions defined in an AWS Identity and Access Management (IAM) inline policy attached to an IAM role, granting it the necessary permissions to interact with the MSK cluster.
  • Logs from the MSK Connect source connector are captured and directed to an Amazon CloudWatch log group.
  • The MSK cluster is configured with a custom MSK cluster configuration, enabling the MSK Connect connector to create topics.
  • Logs from the MSK cluster are also captured and sent to an Amazon CloudWatch log group.
  • The Redshift cluster similarly employs granular permissions via an IAM inline policy linked to an IAM role, allowing it to perform actions on the MSK cluster.
  • You can access the Redshift cluster using the Query Editor v2.

Prerequisites

To streamline the provisioning and configuration of the necessary resources, you may utilize the following AWS CloudFormation template:

When launching the stack, follow these steps:

  1. Enter a meaningful stack name, such as “prerequisites.”
  2. Click “Next.”
  3. Click “Next” again.
  4. Acknowledge that AWS CloudFormation may create IAM resources with custom names.
  5. Click “Submit.”

The CloudFormation stack will create the following resources:

  • A custom VPC named custom-vpc spanning three Availability Zones, featuring three public and three private subnets.
    • Public subnets are linked to a public route table, directing outbound traffic to an internet gateway.
    • Private subnets are associated with a private route table, routing outbound traffic through a NAT gateway.
  • An internet gateway attached to the VPC.
  • A NAT gateway linked to an elastic IP situated in one of the public subnets.
  • Three security groups:
    • msk-connect-sg, designated for the MSK Connect connector.
    • redshift-sg, assigned to the Redshift cluster.
    • msk-cluster-sg, associated with the MSK cluster, permitting inbound traffic from both msk-connect-sg and redshift-sg.
  • Two Amazon CloudWatch log groups:
    • msk-connect-logs, for MSK Connect logs.
    • msk-cluster-logs, for MSK cluster logs.
  • Two IAM Roles:
    • msk-connect-role, encompassing permissions for MSK Connect.
    • redshift-role, including permissions for Amazon Redshift.
  • A custom MSK cluster configuration that allows the MSK Connect connector to create topics.
  • An MSK cluster with three brokers deployed across the private subnets of custom-vpc. The msk-cluster-sg security group and custom-msk-cluster-configuration are applied, with broker logs sent to the msk-cluster-logs CloudWatch log group.
  • A Redshift cluster subnet group utilizing the three private subnets of custom-vpc.
  • A Redshift cluster consisting of a single node deployed in a private subnet within the Redshift cluster subnet group, with the redshift-sg security group and redshift-role IAM role applied.

Creating an MSK Connect Custom Plugin

For this tutorial, we will use an Amazon MSK data generator deployed in MSK Connect to create mock customer data and write it to an MSK topic. Follow these steps:

  1. Download the Amazon MSK data generator JAR file along with its dependencies from GitHub.
  2. Upload the JAR file to an S3 bucket in your AWS account.
  3. In the Amazon MSK console, select “Custom plugins” under the MSK Connect section.
  4. Click “Create custom plugin.”
  5. Choose “Browse S3,” locate the Amazon MSK data generator JAR file you uploaded, and select it.
  6. Name the custom plugin “msk-datagen-plugin.”
  7. Click “Create custom plugin.”

Once the custom plugin is created, its status will be Active, allowing you to proceed to the next steps.

Creating an MSK Connect Connector

To create your connector, complete the following:

  1. In the Amazon MSK console, select “Connectors” under MSK Connect.
  2. Click “Create connector.”
  3. For “Custom plugin type,” select “Use existing plugin.”
  4. Choose “msk-datagen-plugin,” then click “Next.”
  5. For “Connector name,” enter “msk-datagen-connector.”
  6. Select “Self-managed Apache Kafka cluster” for Cluster type.
  7. Choose “custom-vpc” for VPC.
  8. For Subnet 1, select the private subnet in your first Availability Zone.

For those seeking additional resources, check out this excellent blog post on entry-level resumes at Career Contessa. Also, for insights into progressing in your career without losing your identity, visit SHRM. And for personal experiences about starting at Amazon, this resource on Reddit is invaluable: Reddit.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *