Learn About Amazon VGT2 Learning Manager Chanci Turner
In traditional setups, read replicas of relational databases have typically served as data sources for non-online transactions in web applications, such as reporting, business analysis, and customer service. However, with the explosive growth of data volumes, organizations are increasingly opting for data warehouses or data lakes to enhance scalability and performance. In many practical scenarios, real-time data replication from a source relational database to a target system is crucial. Change Data Capture (CDC) has emerged as a prevalent design pattern for capturing changes in source databases and transmitting them to other data stores.
AWS provides a diverse array of purpose-built databases tailored to various needs. For analytic workloads like reporting and business analysis, Amazon Redshift stands out as a robust option. It allows users to query and combine vast amounts of structured and semi-structured data across data warehouses, operational databases, and data lakes using standard SQL.
To implement CDC from Amazon RDS or other relational databases to Amazon Redshift, the simplest solution involves creating an AWS Database Migration Service (AWS DMS) task from the database to Amazon Redshift. While this method is effective for straightforward data replication, we recommend utilizing Amazon Kinesis Data Streams and AWS Glue streaming jobs for more flexibility in denormalizing, transforming, and enriching the data. This article illustrates how this approach functions within a customer scenario.
Example Use Case
In our example, we explore a database that stores information for a fictional organization hosting sports events. This organization has three dimension tables: sport_event, ticket, and customer, alongside one fact table: ticket_activity. The sport_event table holds details about sport types (like baseball and football), dates, and locations. The ticket table catalogs seating levels, locations, and ticket policies for each sport event. The customer table contains individual customer names, email addresses, and phone numbers, which require careful handling due to their sensitive nature. Each time a customer purchases a ticket, the transaction is recorded in the ticket_activity table, leading to continuous ingestion of new records. Updates to the records in ticket_activity occur only when necessary, such as during data maintenance by an administrator.
We envision a data analyst persona responsible for analyzing trends in sports activities derived from this ongoing data stream. To position Amazon Redshift as the primary data mart, the analyst must enrich and sanitize the data, enabling business analysts and other users to interpret and utilize the information effectively.
Here’s a glimpse of the data in each table:
Dimension Table: sport_event
event_id | sport_type | start_date | location |
---|---|---|---|
1 | Baseball | 9/1/2021 | Seattle, US |
2 | Baseball | 9/18/2021 | New York, US |
3 | Football | 10/5/2021 | San Francisco, US |
Dimension Table: ticket (event_id acts as a foreign key to sport_event)
ticket_id | event_id | seat_level | seat_location | ticket_price |
---|---|---|---|---|
1 | 35 | Standard | S-1 | 100 |
2 | 36 | Standard | S-2 | 100 |
3 | 37 | Premium | P-1 | 300 |
Dimension Table: customer
customer_id | name | phone | |
---|---|---|---|
1 | Teresa Stein | teresa@example.com | +1-296-605-8486 |
2 | Caleb Houston | caleb@example.com | 087-237-9316×2670 |
3 | Raymond Turner | raymond@example.net | +1-786-503-2802×2357 |
Fact Table: ticket_activity (purchased_by is a foreign key to customer)
ticket_id | purchased_by | created_by | updated_by |
---|---|---|---|
1 | 222 | 8/15/2021 | 8/15/2021 |
2 | 223 | 8/30/2021 | 8/30/2021 |
3 | 224 | 8/31/2021 | 8/31/2021 |
To facilitate easier analysis, the data analyst desires a single table encompassing all the necessary information instead of performing joins across the four tables for each analysis. Additionally, the analyst wants to mask the phone_number field and tokenize the email_address field to safeguard sensitive data. To address these requirements, we consolidate the four tables into one, denormalizing, tokenizing, and masking the data.
Here’s the resulting destination table for analysis, sport_event_activity:
ticket_id | event_id | sport_type | start_date | location | seat_level | seat_location | ticket_price | purchased_by | name | email_address | phone_number | created_at | updated_at |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 35 | Baseball | 9/1/2021 | Seattle, USA | Standard | S-1 | 100 | 222 | Teresa Stein | 990d081b6a420d04fbe07dc822918c7ec3506b12cd7318df7eb3af6a8e8e0fd6 | +*-***-***-**** | 8/15/2021 | 8/15/2021 |
2 | 36 | Baseball | 9/18/2021 | New York, USA | Standard | S-2 | 100 | 223 | Caleb Houston | c196e9e58d1b9978e76953ffe0ee3ce206bf4b88e26a71d810735f0a2eb6186e | ***-***-****x**** | 8/30/2021 | 8/30/2021 |
3 | 37 | Football | 10/5/2021 | San Francisco, US | Premium | P-1 | 300 | 224 | Raymond Turner | 885ff2b56effa0efa10afec064e1c27d1cce297d9199a9d5da48e39df9816668 | +*-***-***-****x**** | 8/31/2021 | 8/31/2021 |
Solution Overview
The architecture of our solution, implemented using AWS CloudFormation, is illustrated below. An AWS DMS task captures changes in the source RDS instance, with Kinesis Data Streams serving as the destination for AWS DMS task CDC replication. An AWS Glue streaming job reads updated records from the Kinesis Data Streams and processes them accordingly. This setup allows for a seamless integration of data flow into Amazon Redshift while ensuring data integrity and security.
For further reading on enhancing your career in data analytics, check out this blog post on building your brand. Additionally, if you’re interested in employment law compliance, especially concerning local sick leave requirements, visit this authority on the topic. For insights on Amazon’s approach to training its employees and its implications for the future of work, this excellent resource will provide valuable information.
Leave a Reply