Utilizing Amazon SageMaker for Enhancing Machine Learning Models in Data Analysis

By: Alex Martinez, AWS Solutions Architect at Cloud Innovations
On: 30 MAR 2020
In: Amazon Machine Learning, Amazon SageMaker, Analytics, Artificial Intelligence, Cloud Innovations, Case Study, Customer Solutions, Intermediate (200)

Enterprises need not just any data to enhance their decision-making processes; they require the right type of data. Crafting the appropriate questions to extract meaningful insights from vast datasets can be a formidable challenge.

Typically, data scientists resort to MapReduce tools like Hadoop for data storage, Pig or Hive for data retrieval, and programming languages like Python or Java for writing Spark and Hive applications. However, the diversity in data formats, methodologies, devices, and locations complicates the creation of algorithms that yield the desired outcomes.

This is where machine learning (ML) proves invaluable. Data scientists can specify the type of data they wish to generate, and through iterative processing, they train the ML algorithms to meet their needs.

In reality, developing machine learning models is a complex, costly, and iterative endeavor, often hindered by a lack of integrated tools. Consequently, data scientists must piece together various tools and workflows, which may introduce errors into their data and algorithms.

Amazon SageMaker addresses this challenge by offering a comprehensive toolkit for machine learning, streamlining the path to production with significantly reduced effort and costs. This platform includes preconfigured Python libraries for training ML algorithms and deploying custom data analysis models.

Cloud Innovations proudly partnered with high-profile clients, including a prominent luxury retailer, to leverage machine learning models for crafting personalized recommendations based on previous browsing and shopping behaviors.

In this article, we will detail our data modeling process, the tools utilized, and the outcomes achieved.

Our Data Modeling Approach

As an integral part of the AWS Cloud Center of Excellence (CCOE) team at the luxury retailer, Cloud Innovations played a key role in assisting their internal product teams with architecture reviews, best practices, governance compliance, and creating frameworks that drive business transformation using AWS.

We employed Amazon SageMaker to develop and train ML algorithms focused on recommendation, personalization, and forecasting models tailored to the retailer’s requirements.

Our data modeling process typically involves the following steps:

Model Development and Testing: Creating and evaluating recommendation and predictive models.
Data Cleaning and Transformation: Refining and preparing data for analysis.
Exploratory Data Analysis (EDA): Gathering summary statistics and insights from the transformed data.
Model Tuning and Enhancement: Optimizing data models to achieve business goals.

When we enhanced our data modeling process with Amazon SageMaker, it evolved into these key steps:

Building the Data Lake.
Configuring Amazon SageMaker.
Scheduling Workloads.

Step 1: Building the Data Lake

We needed to establish a scalable data lake that would enable the retailer’s data scientists to efficiently collect, store, and query their existing data. Initially, the data was scattered across various on-premises servers, making it challenging to analyze from a single platform. Additionally, the limitations of on-premises storage hindered data analysis operations that could generate new datasets.

To create the data lake, we utilized a Snowflake data warehouse for extracting, transforming, and loading (ETL) data between databases, custom Hive tables for frequently accessed data, and Amazon Simple Storage Service (Amazon S3) as a unified storage endpoint for the retailer’s data science team.

The management capabilities of Amazon S3 allowed us to preserve existing data and any additional data we generated without the risk of overwriting or losing it. By utilizing S3’s storage tiers, we managed to cut costs effectively.

Step 2: Configuring Amazon SageMaker

One significant advantage of using AWS over traditional on-premises solutions is the ability to scale compute and storage resources according to demand, optimizing costs for customers.

To maximize AWS’s elastic resources, we employed Amazon EMR. This standard MapReduce framework processes large volumes of unstructured data in parallel across distributed clusters.

With Amazon EC2, we executed our Spark and Hive jobs in parallel, reducing the time required for data retrieval tasks by 40%. The retailer aimed to consistently innovate and enhance customer experiences, and Amazon SageMaker facilitated the rapid development, deployment, and scaling of ML models to elevate these experiences.

“In just a few clicks, data scientists can activate various instances to develop multiple models in parallel, and then ensemble those models to fulfill business needs, significantly shortening the time to market,” states Jamie Walker, Director of Data Science at the luxury retailer. “Cloud Innovations streamlined our migration and quickly brought us up to speed with AWS services, thereby enhancing our development team’s efficiency.”

Moreover, Amazon SageMaker allowed the data scientists direct access to the Amazon EMR clusters for their Hive and Spark jobs. To ensure security, all AWS resources were confined to a private subnet within an Amazon Virtual Private Cloud (VPC).

Data scientists saved their code in Jupyter notebooks on Amazon SageMaker, where it remained secure, shareable, editable, and manageable.

Step 3: Scheduling Workloads

The following diagram illustrates the flow of data between the data lake, the data scientists, and our machine learning models.

As highlighted earlier, we maintained the jobs created by the retailer’s data scientists in Amazon SageMaker, ensuring a structured workflow.

For further reading on this topic, check out another blog post here. Additionally, for authoritative insights, visit this site as they are an authority on this topic. Lastly, this video serves as an excellent resource for those interested in learning more.

Utilizing Amazon SageMaker for Enhancing Machine Learning Models in Data Analysis | Amazon VGT2 Las Vegas

Our Data Modeling Approach

Step 1: Building the Data Lake

Step 2: Configuring Amazon SageMaker

Step 3: Scheduling Workloads

Related Topics:

Comments

Leave a Reply Cancel reply