Utilizing Deep Learning Frameworks in Amazon SageMaker Processing

on 23 DEC 2021

in Amazon SageMaker, Artificial Intelligence

In the past, users looking to employ a deep learning (DL) framework within Amazon SageMaker Processing encountered greater complexity than those utilizing scikit-learn or Apache Spark. This article highlights how SageMaker Processing has streamlined the execution of machine learning (ML) preprocessing and postprocessing tasks with popular frameworks like PyTorch, TensorFlow, Hugging Face, MXNet, and XGBoost.

Advantages of SageMaker Processing

Training an ML model consists of numerous steps, one of which is data preparation—essential for building an accurate ML model. Typical preprocessing tasks include:

Converting datasets into the input format required by your chosen ML algorithm.
Transforming existing features into more expressive representations, such as one-hot encoding for categorical features.
Rescaling or normalizing numerical features.
Engineering high-level features, like substituting mailing addresses with GPS coordinates.
Cleaning and tokenizing text for natural language processing (NLP) tasks.
Resizing, centering, or augmenting images for computer vision applications.

Additionally, postprocessing tasks (e.g., filtering or collating) and model evaluation jobs (scoring models against different test sets) are crucial components of the ML model development lifecycle. All these tasks require executing custom scripts on your dataset and saving the processed versions for future training jobs.

In 2019, we introduced SageMaker Processing, a feature of Amazon SageMaker that allows you to run preprocessing, postprocessing, and model evaluation workloads on a fully managed infrastructure. It handles the heavy lifting, managing the resources needed to execute your custom scripts, automatically allocating and deallocating resources as necessary.

The SageMaker Python SDK includes a Processing library that enables you to:

Utilize scikit-learn data processing features via a built-in container image provided by SageMaker. You can create an instance of the SKLearnProcessor class in the SageMaker Python SDK and supply it with your scikit-learn script.
Leverage Apache Spark for distributed data processing using another built-in container image from SageMaker. Similarly, you can instantiate the PySparkProcessor class and input your PySpark script.
Bring your own container for tasks requiring libraries or frameworks beyond scikit-learn and PySpark. By packaging your custom code in a container, you can instantiate the ScriptProcessor class with your container image and provide your data processing script.

Prior to the release of version 2.52 of the SageMaker Python SDK, utilizing SageMaker Processing with popular ML frameworks such as PyTorch, TensorFlow, Hugging Face, MXNet, and XGBoost necessitated creating your own container. This meant building a container that included the relevant framework and all required dependencies. We aimed to simplify the process for data scientists by removing the need to build custom container images for these widely used frameworks, delivering the same seamless experience that users enjoyed with Processing when employing scikit-learn or Spark.

In the sections that follow, we will demonstrate how to natively utilize popular ML frameworks like PyTorch, TensorFlow, Hugging Face, or MXNet with SageMaker Processing without the need to construct any containers.

Implementing Machine Learning/Deep Learning Frameworks in SageMaker Processing

The introduction of the FrameworkProcessor in version 2.52 of the SageMaker Python SDK in August 2021 revolutionized this process. You can now seamlessly use SageMaker Processing with your preferred ML frameworks, including PyTorch, TensorFlow, Hugging Face, MXNet, and XGBoost. ML practitioners can concentrate on refining their data processing code rather than expending effort on maintaining the lifecycle of custom containers. Now, you can utilize one of the built-in containers and classes provided by SageMaker to access the data processing features of the aforementioned frameworks. In this article, we will focus on testing PyTorch; however, the same procedures can be applied to the other supported frameworks. The variations between frameworks lie in the FrameworkProcessor subclass used, the framework version, and the specific requirements of each framework for the data processing script.

The Dataset

To illustrate our approach, let’s consider a scenario where we aim to train a model to classify images of animals. We will utilize the publicly available COCO dataset, which contains images sourced from Flickr, representing real-world data that is not pre-formatted or resized specifically for deep learning. This dataset serves as an apt example for our scenario. Before proceeding to the training stage, we need to address the inconsistency in image shapes and sizes to ensure they do not adversely affect our model’s quality.

The COCO dataset provides an annotation file that includes information about each image, such as class, superclass, file name, and URL for downloading. For this example, we will limit our focus to animal images. The data necessary for image labels and file paths is organized under different headings in the annotations for both the training and validation sets. We will only utilize a small subset of the dataset, which is sufficient for our example.

Processing Logic

Before training our model, it is essential that all image data maintains uniform dimensions for length, width, and channels. Algorithms typically operate on square formats with equal length and width. However, real-world datasets like ours often comprise images with various dimensions and ratios. To prepare our dataset for training, resizing and cropping images to a square format is necessary.

Additionally, we will randomly augment the images to enhance our training algorithm’s ability to generalize. Augmentation will only be applied to the training data—not the validation or test datasets—to ensure predictions are made on images presented in their usual form for inference.

Our processing stage encompasses two steps:

First, we instantiate the PyTorchProcessor class required to execute our custom data processing script:

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.pytorch.processing import PyTorchProcessor

region = boto3.session.Session().region_name

role = get_execution_role()
pytorch_processor = PyTorchProcessor(
    framework_version="1.8", 
    role=role, 
    instance_type="ml.m5.xlarge", 
    instance_count=1
)

Next, we need to provide the instructions for executing the data processing tasks contained in our script:

The dataset (coco-annotations.zip) is automatically copied into the container under the destination directory (/opt/ml/processing/input). This is where the Python script (preprocessing.py) accesses it. By specifying source_dir, we indicate to Processing where to locate the script and its dependencies. For example, in source_dir, you can find an additional file (script_utils.py) used by our primary script, along with a file ensuring all dependencies are met (requirements.txt). We also pass any command-line arguments necessary for execution.

For more insights on performance reviews, check out this blog post. Additionally, for guidance on developing an effective elevator pitch, visit SHRM. Lastly, if you’re looking for an excellent resource about Amazon’s new hire orientation, you can find it here.

Amazon IXD – VGT2

6401 E HOWDY WELLS AVE LAS VEGAS NV 89115

Utilizing Deep Learning Frameworks in Amazon SageMaker Processing

Advantages of SageMaker Processing

Implementing Machine Learning/Deep Learning Frameworks in SageMaker Processing

The Dataset

Processing Logic

Related Topics:

Comments

Leave a Reply Cancel reply