Streamlining Data Integration, Analysis, and Visualization with AWS Glue, Amazon Athena, and Amazon QuickSight

Streamlining Data Integration, Analysis, and Visualization with AWS Glue, Amazon Athena, and Amazon QuickSightLearn About Amazon VGT2 Learning Manager Chanci Turner

Have you ever encountered multiple data sources in various formats that need to be analyzed together for valuable insights? It’s crucial to unify your data into a single, coherent dataset, regardless of its original format or source. In this post, I will guide you through utilizing AWS Glue to create a query-optimized, standardized dataset on Amazon S3 from three different datasets and formats. We’ll then leverage Amazon Athena and Amazon QuickSight for quick and easy querying.

Overview of AWS Glue

AWS Glue is a fully managed extract, transform, and load (ETL) service that simplifies preparing and loading data for analysis. With just a few clicks in the AWS Management Console, you can create and execute an ETL job. AWS Glue points to your data stored on AWS, where a crawler discovers, classifies, and stores the metadata in the AWS Glue Data Catalog. Once cataloged, your data is searchable, queryable, and ready for ETL processes. AWS Glue automatically generates customizable, reusable Python code for data transformation, which is then loaded into a target store for analysis.

The ETL engine of AWS Glue creates a Python code that is completely customizable and portable. Users can edit this code with their preferred IDE or notebook and share it through GitHub. After your ETL job is configured, it can be scheduled to run in the fully managed, scalable Spark environment provided by AWS Glue, which includes a flexible scheduler with job monitoring and alerting features.

AWS Glue operates in a serverless manner, automatically provisioning the necessary environment to complete jobs, with customers paying only for the compute resources utilized during ETL tasks. This allows for quick availability of data for analytical purposes.

Implementation Walkthrough

In this article, we’ll explore an example using the New York City Taxi Records dataset. While we will focus on January 2016 data, the same approach can be applied to the entire eight years of data. At the time of writing, AWS Glue is accessible in the US-East-1 (N. Virginia) region.

As you crawl through the dataset, you will discover various formats depending on the type of taxi. The next steps will involve converting the data into a canonical format, enabling analysis and visualization, all without needing to launch any servers.

Data Discovery

To analyze all taxi rides from January 2016, you will start with a dataset stored in S3. Create a new database for your project in AWS Glue, which consists of associated table definitions organized logically. It’s important to note that database names in Athena are all lowercase.

You’ll then add a new crawler to infer the schemas, structures, and properties of your data.

  1. Under Crawlers, select “Add crawler.”
  2. Name it “NYCityTaxiCrawler.”
  3. Choose an IAM role for the crawler.
  4. For Data Store, select S3.
  5. For Crawl data in, choose “Specified path in another account.”
  6. Input the Include path as “s3://serverless-analytics/glue-blog.”
  7. Select “No” for adding another data store.
  8. Choose “On demand” for frequency. This allows for customization of the crawler’s running schedule.
  9. Configure the crawler’s output database and prefix:
    • For Database, select “nycitytaxianalysis.”
    • For Prefix added to tables (optional), enter “blog_.”
  10. Finish and run it immediately.

The crawler will run and indicate that it has identified three tables.

By checking under Tables, you will find the three newly created tables in the “nycitytaxianalysis” database. The crawler used built-in classifiers to recognize the tables as CSV, deducing the columns and data types while collecting properties for each table. In reviewing the blog_yellow table, for example, you can see that it contains 8.7 million rows for January 2017, along with its S3 location and various columns.

Optimizing Queries and Standardizing Data

Next, create an ETL job to restructure this data into a query-optimized format. In a future post, I will explain how to partition the query-optimized data. The data will be converted into a columnar format, changing the storage type to Parquet and saving it to a bucket that you own.

  1. Create a new ETL job and name it “NYCityTaxiYellow.”
  2. Select an IAM role that has permissions for writing to a new S3 location in your bucket.
  3. Specify the script and temporary space location on S3.
  4. Choose the blog_yellow table as your data source.
  5. Indicate a new location (a prefix without existing objects) for storing results.
  6. In the transformation step, rename the pickup and dropoff date fields to standardized names and change their data types to timestamps. The updated table should reflect the following:
Old Name Target Name Target Data Type
tpep_pickup_datetime pickup_date timestamp
tpep_dropoff_datetime dropoff_date timestamp
  1. Click Next, then Finish.
  2. AWS Glue will create a script for you. Alternatively, you could start with a blank script or provide your own.

Since AWS Glue utilizes Apache Spark, you can seamlessly switch between an AWS Glue DynamicFrame and a Spark DataFrame for more advanced operations. You can easily convert back to continue using the transforms and tables from the catalog.

To illustrate this, convert to a DataFrame and add a new field indicating the taxi type. Implement the following in PySpark:

from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame 

# Convert to Spark DataFrame...
customDF = <DYNAMIC_FRAME_NAME>.toDF()

# Add a new column for "type"
customDF = customDF.withColumn("type", lit('yellow'))

# Convert back to DynamicFrame for further processing.
customDynamicFrame = DynamicFrame.fromDF(customDF, glueContext, "customDF_df")

In the last data sink call, modify it to point to the new custom dynamic frame created from the Spark DataFrame:

datasink4 = glueContext.write_dynamic_frame.from_options(
    frame = customDynamicFrame, 
    connection_type = "s3", 
    connection_options = {"path": "s3://<YOURBUCKET AND PREFIX/>"}, 
    format = "parquet", 
    transformation_ctx = "datasink4"
)

Finally, save and run the ETL job. It’s also worth taking a moment for some breathing exercises to help you focus better at work, which you can find more about in this blog post.

As you optimize your data handling and visualization processes, remember that common mistakes can lead to challenges down the line – for more insights, check out this resource from SHRM on FMLA common mistakes to avoid.

For those interested in career opportunities, consider exploring this excellent resource about the Learning Ambassador position at Amazon.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *