Amazon VGT2 Las Vegas: Leveraging Data Warehouse Tables with Machine Learning and Amazon SageMaker Notebooks

Amazon VGT2 Las Vegas: Leveraging Data Warehouse Tables with Machine Learning and Amazon SageMaker NotebooksMore Info

Are you a data scientist eager to analyze data warehouse tables within your machine learning (ML) environment? If so, keep reading.

In this article, I will guide you through performing exploratory analysis on extensive datasets stored in your data warehouse, which are cataloged in your AWS Glue Data Catalog, directly from your Amazon SageMaker notebook. I will explain how to locate and analyze datasets in your corporate data warehouse using a Jupyter notebook running on Amazon SageMaker. You will learn how to extract valuable information from Amazon Redshift into Amazon EMR for further transformation. Once that’s done, you can continue your analysis and visualization in your notebook, ensuring a smooth workflow.

Overview of This Article

  • Estimated Reading Time: 25 minutes
  • Estimated Completion Time: 2 hours
  • Cost to Complete: Under $10
  • Learning Level: Expert (400)
  • AWS Services Utilized: Amazon Redshift, Amazon EMR, AWS Glue, Amazon SageMaker

Before diving in, you may want to review some prior posts, such as “Build Amazon SageMaker notebooks backed by Spark in Amazon EMR” or “Access Amazon S3 data managed by AWS Glue Data Catalog from Amazon SageMaker notebooks.” These resources can provide valuable context.

Understanding Amazon SageMaker

Amazon SageMaker is a fully managed ML service that enables data scientists and developers to quickly build, train, and deploy ML models into a production-ready environment. It features an integrated Jupyter authoring environment for initial data exploration, analysis, and model building.

However, the challenge lies in identifying the datasets that matter. When data is stored in a data warehouse, you need to extract the relevant subset and load it into your Jupyter notebook for in-depth exploration or modeling. As datasets grow larger and more numerous, extracting all potentially interesting datasets to load into your notebook becomes impractical and hampers productivity. In fact, this data combination and exploration can occupy up to 80% of a data scientist’s time, making it crucial to enhance efficiency to accelerate ML project completion.

Many organizations are adopting Amazon Redshift as their preferred data warehouse. It allows for complex analytic queries against petabytes of structured data, utilizing advanced query optimization, columnar storage on high-performance local disks, and massively parallel query execution. These features attract data scientists seeking valuable insights. However, to execute ML tasks, the data must be transferred to an ML platform, enabling data scientists to work effectively. Amazon Redshift can be utilized to join and filter data as needed, extracting only the relevant information for ML-specific transformations.

Often, large enterprises utilize AWS Glue to manage their data lakes. AWS Glue is a fully managed ETL (extract, transform, load) service that simplifies the categorization, cleaning, enrichment, and reliable movement of data across various stores. It includes a central metadata repository called the AWS Glue Data Catalog, which allows for the search and querying of enriched and categorized data in the data lake. This metadata helps identify the names, locations, content, and characteristics of datasets of interest.

Even after filtering and joining data in Amazon Redshift, the remaining dataset may still be too large for your notebook to handle efficiently. For working with extremely large datasets, Apache Spark on EMR is an ideal solution.

Apache Spark is a cluster-computing framework equipped with built-in modules for analytics in various languages, including Python, Java, and Scala. Spark on EMR scales well, making it suitable for large datasets often found in corporate data lakes. If datasets are already cataloged in your AWS Glue Data Catalog, accessing them becomes easier, as the Data Catalog acts as an external Apache Hive Metastore in EMR. In Spark, you can perform complex transformations that go beyond SQL capabilities, making it an excellent platform for further processing, using Python and Spark MLlib.

In this approach, you will use Amazon Redshift to join and filter your source data. Next, you will iteratively transform the resulting reduced dataset using EMR for heavy processing. Throughout, your Amazon SageMaker notebook remains available for exploring and visualizing relevant data subsets. Each task—joining and filtering, complex transformation, and visualization—has been delegated to the service best suited for it.

Solution Overview

The first section involves querying the AWS Glue Data Catalog to find the desired database and review the tables and their definitions. These table declarations indicate the data location, which in this case is Amazon Redshift. The AWS Glue Data Catalog also provides the necessary information to build the Amazon Redshift connection string for data retrieval.

In the second part, the data is read into EMR if the extracted data size from Amazon Redshift is too large for direct loading into your notebook. The cluster-compute power of EMR provides essential scalability.

If the following conditions apply, a simpler solution exists. For more details, check out this another blog post on the topic: Amazon Redshift access demo sample notebook.

  • You know the Amazon Redshift cluster containing the data of interest.
  • You have the Amazon Redshift connection information.
  • The data you’re extracting and exploring is manageable for a JDBC connection.

This solution utilizes four AWS services along with some open-source components:

  1. An Amazon SageMaker notebook instance, providing hosted Jupyter notebook IDEs for data exploration and preprocessing.
    • Jupyter notebooks
    • SparkMagic: A set of tools for interactively working with remote Spark clusters via Livy in Jupyter. SparkMagic includes magics for running Spark code in multiple languages.
  2. An EMR cluster running Apache Spark, which includes:
    • Apache Livy: A service that facilitates interaction with Spark on an EMR cluster via a REST interface, enabling Spark usage for interactive web/mobile applications from your Jupyter notebook.
  3. The AWS Glue Data Catalog, serving as the central metadata repository and functioning as your external Hive Metastore for big data applications on EMR.
  4. Amazon Redshift, acting as your data warehouse.

In this setup, the EMR cluster with Spark reads from Amazon Redshift using a JDBC connection, which streamlines the process of data handling.

For more insights into the life of an Amazon warehouse worker, consider visiting this excellent resource.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *