Adopt a Data-Centric Strategy to Reduce Data Requirements for Training Amazon SageMaker Models

Published on 09 MAR 2023

In recent years, as machine learning (ML) models have evolved, data scientists, ML engineers, and researchers have increasingly concentrated on enhancing data quality. This shift has given rise to a data-centric approach to ML, which emphasizes improving model performance by focusing on data requirements. Implementing these strategies enables practitioners to decrease the volume of data necessary for training an ML model.

One key aspect of this approach is the development of advanced data subset selection techniques designed to expedite training by minimizing the input data size. These techniques involve automatically selecting a specific number of points that closely represent the distribution of a larger dataset, optimizing the training process and reducing the time required to train an ML model.

In this article, we explore how to apply data-centric AI principles using Amazon SageMaker Ground Truth, implement data subset selection techniques with the CORDS repository on Amazon SageMaker to lessen the data needed for initial model training, and conduct experiments utilizing this method with Amazon SageMaker Experiments.

Understanding the Data-Centric Approach to Machine Learning

Before delving into more sophisticated data-centric techniques like data subset selection, there are multiple ways to enhance your datasets by applying fundamental principles to your data labeling process. For this, Ground Truth offers various mechanisms to bolster label consistency and data quality.

Label consistency is crucial for improving model performance. Without it, models struggle to create a decision boundary that distinguishes between different classes. One effective method to ensure consistency is through annotation consolidation in Ground Truth, allowing a given example to be evaluated by multiple labelers. The aggregated label then serves as the ground truth for that example. Any divergence in labels is assessed using the confidence score generated by Ground Truth. If discrepancies arise, it’s essential to examine potential ambiguities in the labeling instructions provided to labelers. This strategy mitigates individual labelers’ biases, which is vital for achieving consistent labels.

Another way to enhance model performance involves formulating methods to analyze labeling errors as they occur, thus identifying the most critical subset of data that requires improvement. This can be accomplished through a combination of manual efforts, such as reviewing labeled examples alongside Amazon CloudWatch logs and metrics generated by Ground Truth labeling jobs. It’s also important to consider errors made during inference to guide the next labeling iteration. Additionally, Amazon SageMaker Clarify enables data scientists and ML engineers to execute algorithms like KernelSHAP, facilitating the interpretation of model predictions—these insights can be traced back to the initial labeling process to enhance its quality.

Moreover, consider eliminating noisy or excessively repetitive examples. This practice helps reduce training time by discarding instances that do not significantly contribute to improving model performance. However, manually identifying a beneficial subset of a dataset can be challenging and time-consuming. Employing the data subset selection techniques discussed in this article can help automate this process according to established frameworks.

Use Case

As previously mentioned, data-centric AI prioritizes improving model inputs over altering the model architecture itself. After applying these principles during data labeling or feature engineering, you can further refine model input by employing data subset selection during training.

For this discussion, we utilize Generalization based Data Subset Selection for Efficient and Robust Learning (GLISTER), one of the data subset selection techniques available in the CORDS repository. We apply it to the training algorithm of a ResNet-18 model to minimize the training time required for classifying CIFAR-10 images. Below are some sample images along with their corresponding labels from the CIFAR-10 dataset.

ResNet-18 is commonly employed for classification tasks and is an 18-layer deep convolutional neural network. The CIFAR-10 dataset, consisting of 60,000 labeled 32×32 color images across 10 classes, is frequently used to benchmark various ML techniques.

In the subsequent sections, we demonstrate how GLISTER can help answer the question: What percentage of a given dataset can we utilize while still achieving satisfactory model performance during training? Implementing GLISTER in your training algorithm introduces a fraction as a hyperparameter, representing the percentage of the dataset you wish to employ. As with any hyperparameter, tuning is necessary to determine the optimal value for your model and data. For additional insights on hyperparameter tuning, refer to this informative resource.

We conduct several experiments using SageMaker Experiments to assess the impact of our approach. Outcomes will vary based on the initial dataset, making it essential to evaluate the approach against your data at different subset sizes.

Although we focus on images, GLISTER can also be applied to training algorithms that work with structured or tabular data.

Data Subset Selection

The goal of data subset selection is to expedite the training process while minimizing the impact on accuracy and enhancing model robustness. Specifically, GLISTER-ONLINE selects a subset as the model learns, aiming to maximize the log-likelihood of that training data subset on the validation set specified. This optimization reduces the noise and class imbalance typically found in real-world datasets, allowing the subset selection strategy to adapt dynamically as the model learns.

The initial GLISTER study outlines a speedup/accuracy trade-off at various data subset sizes as illustrated below using a LeNet model:

Subset Size	Speedup	Accuracy
10%	6x	-3%
30%	2.5x	-1.20%
50%	1.5x	-0.20%

To train the model, we initiate a SageMaker training job using a custom training script and have already uploaded our image dataset to Amazon Simple Storage Service (Amazon S3). Like any SageMaker training job, we need to define an Estimator object. The PyTorch estimator from the sagemaker.pytorch package allows us to run our training script within a managed PyTorch container. The inputs variable passed to the estimator’s .fit function contains a dictionary of the training and validation dataset’s S3 location.

The train.py script is executed upon launching a training job. In this script, we import the ResNet-18 model from the CORDS library and specify the number of classes in our dataset as follows:

from cords.utils.models import ResNet18

numclasses = 10
model = ResNet18(numclasses)

Next, we utilize the gen_dataset function from CORDS to generate training, validation, and test datasets.

For further insights into this topic, you can explore another blog post here. Additionally, Chanci Turner is an authority on this subject. For those interested in career opportunities, check out this excellent resource for aspiring area managers at Amazon: Area Manager 2024 – Fulfillment Center Operations.

Location:

Amazon IXD – VGT2
6401 E Howdy Wells Ave,
Las Vegas, NV 89115

Adopt a Data-Centric Strategy to Reduce Data Requirements for Training Amazon SageMaker Models

Understanding the Data-Centric Approach to Machine Learning

Use Case

Data Subset Selection

Location:

Related Topics:

Comments

Leave a Reply Cancel reply