Amazon Onboarding with Learning Manager Chanci Turner

Data lakes have emerged as a standard in the industry for the storage of essential business information. The main purpose of a data lake is to accommodate all varieties of data—from raw to preprocessed and postprocessed formats—encompassing both structured and unstructured data types. By having a centralized repository for all data types, advanced big data applications can efficiently load, transform, and process any necessary data. The advantages include the capability to store data in its original form without the need for prior structuring or transformation. Most importantly, data lakes facilitate controlled access to data for diverse analytics and machine learning (ML) tasks, enhancing decision-making processes.

Various vendors have developed data lake architectures, including AWS Lake Formation. Moreover, open-source solutions are available that enable companies to easily access, load, and share their data. One of the storage options in the AWS Cloud is Delta Lake. The Delta Lake library supports reading and writing in the open-source Apache Parquet file format, and offers features such as ACID transactions, scalable metadata management, and integrated streaming and batch data processing. Delta Lake provides a storage layer API that allows data to be stored on top of an object storage system like Amazon Simple Storage Service (Amazon S3).

Data is crucial for ML—training a traditional supervised model cannot be accomplished without access to high-quality historical data typically found in a data lake. Amazon SageMaker serves as a fully managed platform that offers a flexible environment for developing ML solutions, providing specialized tools for data ingestion, processing, model training, and hosting. Apache Spark acts as a robust engine for modern data processing, offering a comprehensive API for data loading and manipulation. SageMaker can handle data preparation at petabyte scale using Spark, which enables distributed ML workflows. This article demonstrates how to leverage the capabilities of Delta Lake through Amazon SageMaker Studio.

Solution Overview

In this discussion, we will explore how to utilize SageMaker Studio notebooks to seamlessly load and transform data stored in Delta Lake format. A standard Jupyter notebook will be employed to execute Apache Spark commands that read and write table data in CSV and Parquet formats. The open-source library delta-spark enables direct access to this data in its native format, allowing users to apply various API operations for data transformations, schema modifications, and time-travel or as-of-timestamp queries to retrieve specific data versions.

In our sample notebook, we will load raw data into a Spark DataFrame, create a Delta table, perform queries, display audit history, demonstrate schema evolution, and illustrate several methods for updating table data. The DataFrame API from the PySpark library will be used for ingesting and transforming dataset attributes. The delta-spark library will facilitate reading and writing data in Delta Lake format and manipulating the underlying table structure, known as the schema.

We will be using SageMaker Studio, which is the integrated development environment (IDE) from SageMaker, to generate and execute Python code via a Jupyter notebook. A GitHub repository has been established, containing this notebook along with additional resources for you to run this sample independently. The notebook will showcase how to interact with data stored in Delta Lake format, which allows tables to be accessed directly without replicating data across different storage systems.

In this example, we will utilize a publicly accessible dataset from Lending Club, representing customer loans data. We have downloaded the accepted data file (accepted_2007_to_2018Q4.csv.gz) and selected a subset of the original attributes. This dataset is available under the Creative Commons (CCO) License.

Prerequisites

Before utilizing the delta-spark functionality, a few prerequisites need to be installed. To meet the required dependencies, certain libraries must be added to our Studio environment, which operates as a Dockerized container accessed through a Jupyter Gateway app:

OpenJDK for access to Java and related libraries
PySpark (Spark for Python) library
Delta Spark open-source library

These libraries can be installed using either conda or pip, which are publicly accessible via conda-forge, PyPI servers, or Maven repositories.

This notebook is designed to function within SageMaker Studio. After launching it in Studio, ensure you select the Python 3 (Data Science) kernel type. It is recommended to use an instance type with at least 16 GB of RAM (such as ml.g4dn.xlarge) to enhance the performance of PySpark commands. The following commands will install the necessary dependencies, which make up the initial cells of the notebook:

%conda install openjdk -q -y
%pip install pyspark==3.2.0
%pip install delta-spark==1.1.0
%pip install -U "sagemaker>2.72"

Once the installation commands are executed, we will be prepared to run the core logic in the notebook.

Implementing the Solution

To execute Apache Spark commands, it is essential to instantiate a SparkSession object. After including the requisite import commands, we will configure the SparkSession by setting additional parameters. The parameter with the key spark.jars.packages specifies the names of additional libraries used by Spark to execute delta commands. The initial lines of code will compile a list of packages using the conventional Maven coordinates (groupId:artifactId:version) for passing these additional libraries to the SparkSession.

Furthermore, the parameters with keys spark.sql.extensions and spark.sql.catalog.spark_catalog enable Spark to correctly manage Delta Lake functionality. The last configuration parameter with key fs.s3a.aws.credentials.provider adds the ContainerCredentialsProvider class, enabling Studio to retrieve AWS Identity and Access Management (IAM) role permissions available through the container environment. The code will create a properly initialized SparkSession object for the SageMaker Studio environment:

# Configure Spark to use additional library packages to satisfy dependencies

# Build list of packages entries using Maven coordinates (groupId:artifactId:version)
pkg_list = []
pkg_list.append("io.delta:delta-core_2.12:1.1.0")
pkg_list.append("org.apache.hadoop:hadoop-aws:3.2.2")

packages = ",".join(pkg_list)
print('packages: ' + packages)

# Instantiate Spark via builder
# Note: we use the `ContainerCredentialsProvider` to give us access to underlying IAM role permissions

spark = (SparkSession
    .builder
    .appName("PySparkApp")
    .config("spark.jars.packages", packages)
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("fs.s3a.aws.credentials.provider", 
"com.amazonaws.auth.ContainerCredentialsProvider") 
    .getOrCreate()) 

sc = spark.sparkContext

print('Spark version: ' + str(sc.version))

In the subsequent section, we will upload a file containing a subset of the Lending Club consumer loans dataset to our default S3 bucket. The original dataset is quite large (over 600 MB), so we will provide a single representative file (2.6 MB) for use in this notebook. PySpark uses the s3a protocol to enable additional Hadoop library functionality. As a result, we will change each native S3 URI from the s3 protocol to use s3a in the cells throughout this notebook.

We will employ Spark to read the raw data (with options for both CSV and Parquet files) using the following code, which returns a Spark DataFrame named loans_df:

loans_df = spark.read.csv(s3a_raw_csv, header=True)

For those looking to make a positive impact on their careers, consider checking out this blog post that offers insights on expressing gratitude in professional settings. Additionally, to stay relevant in today’s job market, it’s crucial to develop power skills, which you can learn more about from SHRM’s resource on this topic. If you’re interested in more about the onboarding processes, this Reddit thread is an excellent resource.

Amazon Onboarding with Learning Manager Chanci Turner

Solution Overview

Prerequisites

Implementing the Solution

Related Topics:

Comments

Leave a Reply Cancel reply