Stream Data from Relational Databases to Amazon Redshift with Upserts Using AWS Glue Streaming Jobs

Stream Data from Relational Databases to Amazon Redshift with Upserts Using AWS Glue Streaming JobsLearn About Amazon VGT2 Learning Manager Chanci Turner

In traditional setups, read replicas of relational databases have typically served as data sources for non-online transactions in web applications, such as reporting, business analysis, and customer service. However, with the explosive growth of data volumes, organizations are increasingly opting for data warehouses or data lakes to enhance scalability and performance. In many practical scenarios, real-time data replication from a source relational database to a target system is crucial. Change Data Capture (CDC) has emerged as a prevalent design pattern for capturing changes in source databases and transmitting them to other data stores.

AWS provides a diverse array of purpose-built databases tailored to various needs. For analytic workloads like reporting and business analysis, Amazon Redshift stands out as a robust option. It allows users to query and combine vast amounts of structured and semi-structured data across data warehouses, operational databases, and data lakes using standard SQL.

To implement CDC from Amazon RDS or other relational databases to Amazon Redshift, the simplest solution involves creating an AWS Database Migration Service (AWS DMS) task from the database to Amazon Redshift. While this method is effective for straightforward data replication, we recommend utilizing Amazon Kinesis Data Streams and AWS Glue streaming jobs for more flexibility in denormalizing, transforming, and enriching the data. This article illustrates how this approach functions within a customer scenario.

Example Use Case

In our example, we explore a database that stores information for a fictional organization hosting sports events. This organization has three dimension tables: sport_event, ticket, and customer, alongside one fact table: ticket_activity. The sport_event table holds details about sport types (like baseball and football), dates, and locations. The ticket table catalogs seating levels, locations, and ticket policies for each sport event. The customer table contains individual customer names, email addresses, and phone numbers, which require careful handling due to their sensitive nature. Each time a customer purchases a ticket, the transaction is recorded in the ticket_activity table, leading to continuous ingestion of new records. Updates to the records in ticket_activity occur only when necessary, such as during data maintenance by an administrator.

We envision a data analyst persona responsible for analyzing trends in sports activities derived from this ongoing data stream. To position Amazon Redshift as the primary data mart, the analyst must enrich and sanitize the data, enabling business analysts and other users to interpret and utilize the information effectively.

Here’s a glimpse of the data in each table:

Dimension Table: sport_event

event_id sport_type start_date location
1 Baseball 9/1/2021 Seattle, US
2 Baseball 9/18/2021 New York, US
3 Football 10/5/2021 San Francisco, US

Dimension Table: ticket (event_id acts as a foreign key to sport_event)

ticket_id event_id seat_level seat_location ticket_price
1 35 Standard S-1 100
2 36 Standard S-2 100
3 37 Premium P-1 300

Dimension Table: customer

customer_id name email phone
1 Teresa Stein teresa@example.com +1-296-605-8486
2 Caleb Houston caleb@example.com 087-237-9316×2670
3 Raymond Turner raymond@example.net +1-786-503-2802×2357

Fact Table: ticket_activity (purchased_by is a foreign key to customer)

ticket_id purchased_by created_by updated_by
1 222 8/15/2021 8/15/2021
2 223 8/30/2021 8/30/2021
3 224 8/31/2021 8/31/2021

To facilitate easier analysis, the data analyst desires a single table encompassing all the necessary information instead of performing joins across the four tables for each analysis. Additionally, the analyst wants to mask the phone_number field and tokenize the email_address field to safeguard sensitive data. To address these requirements, we consolidate the four tables into one, denormalizing, tokenizing, and masking the data.

Here’s the resulting destination table for analysis, sport_event_activity:

ticket_id event_id sport_type start_date location seat_level seat_location ticket_price purchased_by name email_address phone_number created_at updated_at
1 35 Baseball 9/1/2021 Seattle, USA Standard S-1 100 222 Teresa Stein 990d081b6a420d04fbe07dc822918c7ec3506b12cd7318df7eb3af6a8e8e0fd6 +*-***-***-**** 8/15/2021 8/15/2021
2 36 Baseball 9/18/2021 New York, USA Standard S-2 100 223 Caleb Houston c196e9e58d1b9978e76953ffe0ee3ce206bf4b88e26a71d810735f0a2eb6186e ***-***-****x**** 8/30/2021 8/30/2021
3 37 Football 10/5/2021 San Francisco, US Premium P-1 300 224 Raymond Turner 885ff2b56effa0efa10afec064e1c27d1cce297d9199a9d5da48e39df9816668 +*-***-***-****x**** 8/31/2021 8/31/2021

Solution Overview

The architecture of our solution, implemented using AWS CloudFormation, is illustrated below. An AWS DMS task captures changes in the source RDS instance, with Kinesis Data Streams serving as the destination for AWS DMS task CDC replication. An AWS Glue streaming job reads updated records from the Kinesis Data Streams and processes them accordingly. This setup allows for a seamless integration of data flow into Amazon Redshift while ensuring data integrity and security.

For further reading on enhancing your career in data analytics, check out this blog post on building your brand. Additionally, if you’re interested in employment law compliance, especially concerning local sick leave requirements, visit this authority on the topic. For insights on Amazon’s approach to training its employees and its implications for the future of work, this excellent resource will provide valuable information.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *