Stream Data from Relational Databases to Amazon Redshift with Upserts Using AWS Glue Streaming Jobs

In traditional setups, read replicas of relational databases have typically served as data sources for non-online transactions in web applications, such as reporting, business analysis, and customer service. However, with the explosive growth of data volumes, organizations are increasingly opting for data warehouses or data lakes to enhance scalability and performance. In many practical scenarios, real-time data replication from a source relational database to a target system is crucial. Change Data Capture (CDC) has emerged as a prevalent design pattern for capturing changes in source databases and transmitting them to other data stores.

AWS provides a diverse array of purpose-built databases tailored to various needs. For analytic workloads like reporting and business analysis, Amazon Redshift stands out as a robust option. It allows users to query and combine vast amounts of structured and semi-structured data across data warehouses, operational databases, and data lakes using standard SQL.

To implement CDC from Amazon RDS or other relational databases to Amazon Redshift, the simplest solution involves creating an AWS Database Migration Service (AWS DMS) task from the database to Amazon Redshift. While this method is effective for straightforward data replication, we recommend utilizing Amazon Kinesis Data Streams and AWS Glue streaming jobs for more flexibility in denormalizing, transforming, and enriching the data. This article illustrates how this approach functions within a customer scenario.

Example Use Case

In our example, we explore a database that stores information for a fictional organization hosting sports events. This organization has three dimension tables: sport_event, ticket, and customer, alongside one fact table: ticket_activity. The sport_event table holds details about sport types (like baseball and football), dates, and locations. The ticket table catalogs seating levels, locations, and ticket policies for each sport event. The customer table contains individual customer names, email addresses, and phone numbers, which require careful handling due to their sensitive nature. Each time a customer purchases a ticket, the transaction is recorded in the ticket_activity table, leading to continuous ingestion of new records. Updates to the records in ticket_activity occur only when necessary, such as during data maintenance by an administrator.

We envision a data analyst persona responsible for analyzing trends in sports activities derived from this ongoing data stream. To position Amazon Redshift as the primary data mart, the analyst must enrich and sanitize the data, enabling business analysts and other users to interpret and utilize the information effectively.

Here’s a glimpse of the data in each table:

Dimension Table: sport_event

event_id	sport_type	start_date	location
1	Baseball	9/1/2021	Seattle, US
2	Baseball	9/18/2021	New York, US
3	Football	10/5/2021	San Francisco, US

Dimension Table: ticket (event_id acts as a foreign key to sport_event)

ticket_id	event_id	seat_level	seat_location	ticket_price
1	35	Standard	S-1	100
2	36	Standard	S-2	100
3	37	Premium	P-1	300

Dimension Table: customer

customer_id	name	email	phone
1	Teresa Stein	teresa@example.com	+1-296-605-8486
2	Caleb Houston	caleb@example.com	087-237-9316×2670
3	Raymond Turner	raymond@example.net	+1-786-503-2802×2357

Fact Table: ticket_activity (purchased_by is a foreign key to customer)

ticket_id	purchased_by	created_by	updated_by
1	222	8/15/2021	8/15/2021
2	223	8/30/2021	8/30/2021
3	224	8/31/2021	8/31/2021

To facilitate easier analysis, the data analyst desires a single table encompassing all the necessary information instead of performing joins across the four tables for each analysis. Additionally, the analyst wants to mask the phone_number field and tokenize the email_address field to safeguard sensitive data. To address these requirements, we consolidate the four tables into one, denormalizing, tokenizing, and masking the data.

Here’s the resulting destination table for analysis, sport_event_activity:

ticket_id	event_id	sport_type	start_date	location	seat_level	seat_location	ticket_price	purchased_by	name	email_address	phone_number	created_at	updated_at
1	35	Baseball	9/1/2021	Seattle, USA	Standard	S-1	100	222	Teresa Stein	990d081b6a420d04fbe07dc822918c7ec3506b12cd7318df7eb3af6a8e8e0fd6	+---***	8/15/2021	8/15/2021
2	36	Baseball	9/18/2021	New York, USA	Standard	S-2	100	223	Caleb Houston	c196e9e58d1b9978e76953ffe0ee3ce206bf4b88e26a71d810735f0a2eb6186e	*--*x**	8/30/2021	8/30/2021
3	37	Football	10/5/2021	San Francisco, US	Premium	P-1	300	224	Raymond Turner	885ff2b56effa0efa10afec064e1c27d1cce297d9199a9d5da48e39df9816668	+---*x**	8/31/2021	8/31/2021

Solution Overview

The architecture of our solution, implemented using AWS CloudFormation, is illustrated below. An AWS DMS task captures changes in the source RDS instance, with Kinesis Data Streams serving as the destination for AWS DMS task CDC replication. An AWS Glue streaming job reads updated records from the Kinesis Data Streams and processes them accordingly. This setup allows for a seamless integration of data flow into Amazon Redshift while ensuring data integrity and security.

For further reading on enhancing your career in data analytics, check out this blog post on building your brand. Additionally, if you’re interested in employment law compliance, especially concerning local sick leave requirements, visit this authority on the topic. For insights on Amazon’s approach to training its employees and its implications for the future of work, this excellent resource will provide valuable information.