Navigating the entire lifecycle of a deep learning project can be daunting, particularly when utilizing various distinct tools and services. For instance, you might employ different platforms for data preprocessing, developing training and inference code, conducting full-scale model training and tuning, deploying models, and automating workflows to integrate these components for production. The disruption caused by switching between tools can hinder project progress and inflate costs. This article demonstrates how to effectively manage the entire lifecycle of deep learning initiatives using Amazon SageMaker. While TensorFlow 2 serves as the framework for the provided example code, the principles discussed are generally applicable to other frameworks as well.
Additionally, this post includes a sample notebook that you can execute in under an hour to showcase all the features covered here. For further details, check out this blog post.
Overview of the Amazon SageMaker Workflow
Every data science endeavor utilizing TensorFlow 2 or an alternative framework begins with a dataset: acquiring, exploring, and preprocessing it. In the context of an Amazon SageMaker workflow, data exploration predominantly occurs within notebooks. These notebooks are typically run on smaller, less powerful, and cost-effective instance types, as they usually operate throughout most of the workday.
Consequently, unless the dataset is relatively modest, a notebook is not ideal for executing full-scale data processing, model training, and inference. Since these operations generally necessitate significant parallel computing capabilities, it is far more efficient and economical to leverage Amazon SageMaker’s functionality to launch dedicated clusters of appropriately sized, more powerful instances that can complete these tasks swiftly. All charges are billed by the second, and at job completion, Amazon SageMaker automatically shuts down the instances. Thus, in a typical Amazon SageMaker workflow, the most frequent costs are incurred from the inexpensive notebooks used for data exploration and prototyping, rather than the more powerful and costly GPU and accelerated compute instances.
Once prototyping is finalized, you can advance beyond notebooks with workflow automation. An automated pipeline is essential for orchestrating the entire workflow up to model deployment in a consistent and reliable manner. Amazon SageMaker offers a native solution for this purpose. The following sections of this article introduce various features of Amazon SageMaker that can facilitate the implementation of these project lifecycle stages.
Data Transformation with Amazon SageMaker Processing
Amazon SageMaker Processing enables the preprocessing of large datasets in a managed cluster distinct from notebooks. It offers built-in support for Scikit-learn and accommodates any other containerized technology. For instance, you can initiate ephemeral Apache Spark clusters for feature transformations within Amazon SageMaker Processing.
To utilize Amazon SageMaker Processing with Scikit-learn, you simply need to provide a Python data preprocessing script containing standard Scikit-learn code. The script has minimal requirements: the input and output data must be placed in designated locations. Amazon SageMaker Processing automatically retrieves the input data from Amazon Simple Storage Service (Amazon S3) and uploads the transformed data back to Amazon S3 once the job concludes.
Prior to starting an Amazon SageMaker Processing job, instantiate a SKLearnProcessor object as illustrated in the following code snippet. In this object, specify the instance type to use for the job and the number of instances.
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=get_execution_role(),
instance_type='ml.m5.xlarge',
instance_count=2)
To ensure that the data files are evenly distributed across the cluster instances for processing, specify the ShardedByS3Key distribution type in the ProcessingInput object. This guarantees that if there are n instances, each instance will receive 1/n files from the indicated S3 bucket. The capability to effortlessly create a large cluster of instances for stateless data transformations is just one of the many advantages provided by Amazon SageMaker Processing.
from sagemaker.processing import ProcessingInput, ProcessingOutput
from time import gmtime, strftime
processing_job_name = "tf-2-workflow-{}".format(strftime("%d-%H-%M-%S", gmtime()))
output_destination = 's3://{}/{}/data'.format(bucket, s3_prefix)
sklearn_processor.run(code='preprocessing.py',
job_name=processing_job_name,
inputs=[ProcessingInput(
source=raw_s3,
destination='/opt/ml/processing/input',
s3_data_distribution_type='ShardedByS3Key')],
outputs=[ProcessingOutput(output_name='train',
destination='{}/train'.format(output_destination),
source='/opt/ml/processing/train'),
ProcessingOutput(output_name='test',
destination='{}/test'.format(output_destination),
source='/opt/ml/processing/test')])
Prototyping Training and Inference Code with Local Mode
When the dataset is primed for training, the next phase is to prototype the training code. For TensorFlow 2, the most straightforward workflow involves supplying a training script for ingestion by the Amazon SageMaker prebuilt TensorFlow 2 container. This feature, known as script mode, operates seamlessly with the Amazon SageMaker local mode training feature.
Local mode provides a convenient means to verify that the code is functioning correctly on a notebook before transitioning to full-scale, hosted training in a separate appropriately sized cluster managed by Amazon SageMaker. In local mode, you typically train for a brief period, perhaps for just a few epochs, possibly on a sample of the complete dataset, to confirm the code’s accuracy and prevent wasting time on extensive training. Moreover, specify the instance type as either local_gpu or local, based on whether the notebook is on a GPU or CPU instance.
from sagemaker.tensorflow import TensorFlow
git_config = {'repo': 'https://github.com/aws-samples/amazon-sagemaker-script-mode',
'branch': 'master'}
model_dir = '/opt/ml/model'
train_instance_type = 'local'
hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01}
local_estimator = TensorFlow(git_config=git_config,
source_dir='tf-2-workflow/train_model',
entry_point='train.py',
model_dir=model_dir,
train_instance_type=train_instance_type,
train_instance_count=1,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
base_job_name='tf-2-workflow',
framework_version='2.1',
py_version='py3',
script_mode=True)
Although training in local mode is immensely helpful for ensuring the training code is functioning well before advancing to full-scale training, it also proves convenient to have a simple method to prototype inference code locally. For further insights on this topic, refer to this article by experts in the field.
For additional resources, you might find this guide particularly useful as it offers excellent information.

Leave a Reply