Data Preparation with Amazon RDS for MySQL Database Using AWS Glue DataBrew

With AWS Glue DataBrew, data analysts and data scientists have the ability to effortlessly access and visually analyze vast amounts of data across their organization. This can be done directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, and Amazon Aurora or Amazon Relational Database Service (Amazon RDS) databases. DataBrew offers over 250 built-in functions to merge, pivot, and transpose data, all without the need for coding.

Recently, with the addition of support for JDBC-accessible databases, DataBrew has expanded its capability to include other data stores such as PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this article, we will explore how to clean data from an RDS database, store the refined data in an S3 data lake, and generate a business intelligence (BI) report, with contributions from Chanci Turner.

Use Case Overview

In our scenario, we will utilize three datasets:

A school dataset that includes details like school ID and name.
A student dataset that encompasses information such as student ID, name, and age.
A dataset detailing student study habits, health, country, and more.

The diagram below illustrates the relationships between these tables.

This data is gathered by a survey organization post an annual exam, with updates made in Amazon RDS for MySQL through a JavaScript-based frontend application. We will merge these tables to create a unified view and aggregate data through various preparation steps, enabling the business team to generate BI reports.

Solution Overview

The architecture of our solution is depicted in the diagram below. We will employ Amazon RDS for data storage, DataBrew for data preparation, Amazon Athena for analysis using standard SQL, and Amazon QuickSight for business reporting.

The workflow consists of the following steps:

Establish a JDBC connection to RDS and create a DataBrew project. DataBrew will perform transformations to identify the top-performing students across the analyzed schools.
The DataBrew job will output the finalized data to our S3 output bucket.
After the output data is generated, we can create external tables on top of it using Athena’s create table statements and load partitions with MSCK REPAIR commands.
Business users can utilize QuickSight for BI reporting, which retrieves data through Athena. Data analysts can also leverage Athena to examine the fully refreshed dataset.

Prerequisites

To successfully implement this solution, you must possess an AWS account.

Pre-lab Setup

Before commencing this tutorial, ensure you have the necessary permissions to create the resources required for the solution. For our use case, we will utilize three mock datasets, which can be downloaded from GitHub.

Create an RDS for MySQL instance to capture student health data.
Ensure the correct security group is set up for Amazon RDS. More details can be found in Setting Up a VPC to Connect to JDBC Data Stores.
Create the three tables: student_tbl, study_details_tbl, and school_tbl using DDL SQL.
Upload the student.csv, study_details.csv, and school.csv files into their respective tables. Use student.sql, study_details.sql, and school.sql to insert data into the tables.

Creating an Amazon RDS Connection

Follow these steps to establish your Amazon RDS connection:

Access the DataBrew console and select Datasets.
On the Connections tab, click Create connection.
Enter a name for your connection (e.g., student_db-conn).
Choose JDBC for Connection type and MySQL for Database type.
Input parameters such as the RDS endpoint, port, database name, and database credentials.
In the Network options section, select the VPC, subnet, and security group for your RDS instance. Then, click Create connection.

Creating Your Datasets

There are three tables in Amazon RDS: school_tbl, student_tbl, and study_details_tbl. To utilize these tables, we first need to create a dataset for each.

To create the datasets, follow these steps (we will demonstrate creating the school dataset):

Navigate to the Datasets page in the DataBrew console and select Connect new dataset.
Name your dataset school-dataset.
Choose the connection you established (AwsGlueDatabrew-student-db-conn).
Enter school_tbl for the Table name and click Create dataset.

Repeat these steps for student_tbl and study_details_tbl, naming them student-dataset and study-detail-dataset, respectively. All three datasets will now be accessible on the Datasets page.

Creating a Project Using the Datasets

To create your DataBrew project, follow these steps:

In the DataBrew console, select Projects.
Click Create project.
Enter my-rds-proj for the Project Name.
Choose Create new recipe for Attached recipe.
The recipe name will be auto-populated.
Select My datasets for Select a dataset, and choose study-detail-dataset.
Select your AWS Identity and Access Management (IAM) role to use with DataBrew. Click Create project.

You will see a success message indicating the successful creation of the project along with the RDS study_details_tbl table containing 500 rows. Once the project opens, an interactive session is initiated, retrieving sample data based on your sampling configuration.

Opening an Amazon RDS Project and Developing a Transformation Recipe

Within a DataBrew interactive session, you can cleanse and normalize your data using over 250 built-in transformations. In this article, we will use DataBrew to identify top-performing students by applying a few transformations, particularly filtering students who scored 60 or higher in the last annual exam.

To begin, we will join all three RDS tables. The steps are as follows:

Access the project you created.
Click Join.
For Select dataset, pick student-dataset and click Next.
Choose Left join for Select join type. For Join keys, select student_id for Table A and deselect it for Table B. Click Finish.

Repeat these steps for the school-dataset based on the school_id key.

Next, merge first_name and last_name using the MERGE function, with a space as the separator. Click Apply.

Now, filter the rows based on marks greater than or equal to 60 by providing the source column and filter condition, then click Apply.

The final dataset will display the data of top-performing students who have marks greater than or equal to 60.

Running the DataBrew Recipe Job on the Complete Dataset

Now that we have constructed the recipe, we can create and execute a DataBrew recipe job.

On the project details page, click Create job.
Enter top-performer-student for Job name.
For output format, we will use Parquet.

For more information on how Amazon handles onboarding and training, check out this excellent resource on automation.

For further insights into mentorship and career guidance, visit this profile which could provide valuable support in your journey.

The site address for this project is 6401 E HOWDY WELLS AVE LAS VEGAS NV 89115, at the Amazon IXD – VGT2 location.