Next Generation Amazon SageMaker Experiments – Streamline, Monitor, and Analyze Your Machine Learning Training at Scale

We are excited to share enhancements to the Amazon SageMaker Experiments feature within Amazon SageMaker, enabling users to efficiently organize, monitor, compare, and evaluate machine learning experiments and model versions from any integrated development environment (IDE) through the SageMaker Python SDK or boto3, including local Jupyter Notebooks.

Machine learning (ML) is inherently iterative. Data scientists and ML engineers often go through numerous parameters to identify the most effective model configurations (also known as hyperparameters) to address specific business challenges. As they experiment with various models and hyperparameters, it can become increasingly challenging for ML teams to manage their model runs effectively without a system in place to track different experiments. An experiment tracking system simplifies the process of comparing iterations, promotes collaboration among team members, and thus enhances productivity while saving valuable time. This is accomplished by seamlessly organizing and managing ML experiments so that conclusions can be drawn easily, such as identifying the training run with the highest accuracy.

To address these needs, SageMaker offers SageMaker Experiments, a fully integrated feature that allows for logging model metrics, parameters, files, and artifacts, as well as generating visualizations from different metrics. It also captures various metadata, enabling model reproducibility. Data scientists can quickly evaluate model performance and hyperparameters through visual charts and tables and can share their findings with stakeholders by downloading these charts.

The latest updates to SageMaker Experiments now incorporate it directly into the SageMaker SDK, simplifying the workflow for data scientists and removing the need for an additional library to manage multiple model executions. We are introducing new core concepts:

Experiment: A collection of runs that are grouped together, including runs from various types initiated from anywhere using the SageMaker Python SDK.
Run: Each execution step of the model training process, encompassing all inputs, parameters, configurations, and results for a single iteration of training. Custom parameters and metrics can be logged using the log_parameter, log_parameters, and log_metric functions, while custom input and output can be logged with the log_file function.

The functionalities tied to the Run class are accessible from any IDE that has the SageMaker Python SDK installed. For SageMaker Training, Processing, and Transform Jobs, the SageMaker Experiment Run is automatically linked to the job if invoked within a run context, and the run object can be retrieved using load_run(). Additionally, data scientists can now automatically log confusion matrices, precision and recall graphs, and ROC curves for classification tasks using run.log_confusion_matrix, run.log_precision_recall, and run.log_roc_curve functions.

In this blog post, we will showcase how to utilize the new SageMaker Experiments features in a Jupyter notebook via the SageMaker SDK. We will demonstrate these capabilities with a PyTorch example to train a model for classifying MNIST handwritten digits. The experiment will follow this structure:

Creating Experiment Runs and Logging Parameters: We will first create a new experiment, initiate a new run for it, and log relevant parameters.
Logging Model Performance Metrics: We will document model performance metrics and create graphical representations of these metrics.
Comparing Model Runs: We will analyze different model runs based on hyperparameters and discuss how to leverage SageMaker Experiments to identify the best-performing model.
Running Experiments from SageMaker Jobs: We will show how to automatically share your experiment context with SageMaker processing, training, or batch transform jobs, allowing for easy recovery of your run context using the load_run function.
Integrating SageMaker Clarify Reports: We will illustrate how to combine SageMaker Clarify bias and explainability reports into a unified view alongside your trained model report.

Prerequisites

For this blog post, we will utilize Amazon SageMaker Studio to demonstrate how to log metrics using the updated SageMaker Experiments functionalities. To follow along, you will need:

A SageMaker Studio Domain
A SageMaker Studio user profile with full access
A SageMaker Studio notebook with at least an ml.t3.medium instance type

If you do not yet have a SageMaker Domain or user profile, you can set one up using this quick setup guide.

Logging Parameters

In this exercise, we will use torchvision, a PyTorch package that offers popular datasets, model architectures, and common image transformations for computer vision. SageMaker Studio provides a range of Docker images for various data science use cases, available on Amazon ECR. For PyTorch, you can choose images optimized for either CPU or GPU training. For this example, we will select the PyTorch 1.12 Python 3.8 CPU Optimized image and the Python 3 kernel. The following examples will focus on the SageMaker Experiments functionalities and are not code complete.

We will download the data using the torchvision package and track the number of samples in the training and testing datasets as parameters with SageMaker Experiments. Let’s assume that train_set and test_set have already been downloaded using torchvision.

from sagemaker.session import Session
from sagemaker.experiments.run import Run
import os

# Create a new experiment and start a run
experiment_name = "local-experiment-example"
run_name = "experiment-run"

with Run(experiment_name=experiment_name, sagemaker_session=Session(), run_name=run_name) as run:
    run.log_parameters({
        "num_train_samples": len(train_set.data),
        "num_test_samples": len(test_set.data)
    })
    for f in os.listdir(train_set.raw_folder):
        print("Logging", train_set.raw_folder + "/" + f)
        run.log_file(train_set.raw_folder + "/" + f, name=f, is_output=False)

In this code snippet, we use run.log_parameters to record the number of training and testing samples, and run.log_file to upload the raw datasets to Amazon S3, logging them as inputs for our experiment.

Training a Model and Logging Metrics

Having downloaded our MNIST dataset, we will now train a CNN model to recognize the digits. During training, we will load our existing experiment run, log new parameters, and track the model’s performance by logging metrics.

with load_run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=Session()) as run:
    train_model(
        run=run,
        train_set=train_set,
        test_set=test_set,
        epochs=10,
        hidden_channels=5,
        optimizer="adam"
    )

For further insights on onboarding strategies, visit this excellent resource. You can also explore online courses to enhance your skills. For considerations regarding exit strategies, refer to SHRM’s authoritative insights.

Next Generation Amazon SageMaker Experiments – Streamline, Monitor, and Analyze Your Machine Learning Training at Scale

Prerequisites

Logging Parameters

Training a Model and Logging Metrics

Related Topics:

Comments

Leave a Reply Cancel reply