Amazon Onboarding with Learning Manager Chanci Turner

As of today, Amazon SageMaker now provides the capability to train and deploy your deep learning models using PyTorch and TensorFlow 1.8. This new addition marks the fourth deep learning framework supported by Amazon SageMaker, joining TensorFlow, Apache MXNet, and Chainer. With this integration, you can write your PyTorch scripts as usual while relying on Amazon SageMaker to manage the setup of your distributed training cluster, data transfer, and hyperparameter tuning. Moreover, on the inference side, SageMaker offers a managed online endpoint that can scale automatically according to your needs.

In addition to PyTorch, we’ve also rolled out the latest stable versions of TensorFlow (1.7 and 1.8), enabling you to leverage new features such as tf.custom_gradient and pre-made BoostedTree estimators. The Amazon SageMaker TensorFlow estimator is configured to use the latest version by default, meaning you won’t need to modify your existing code.

Supporting numerous deep learning frameworks is crucial for developers since each has its unique strengths. PyTorch is heavily favored by deep learning researchers but is also quickly becoming popular among developers due to its flexibility and user-friendliness. TensorFlow remains a well-established option that continues to enhance its features with every release. Our commitment to invest in these and other popular frameworks like MXNet and Chainer remains strong.

PyTorch in Amazon SageMaker

The PyTorch framework stands out due to its use of reverse-mode auto-differentiation, allowing dynamic neural network construction. Additionally, its deep integration with Python facilitates the use of typical Python control flows within networks or the creation of new network layers utilizing Cython, Numba, and NumPy. PyTorch has demonstrated its performance excellence, recently achieving notable success in the DAWNBench Competition led by the fast.ai team.

Using PyTorch within Amazon SageMaker is as seamless as with the other pre-built deep learning containers. Simply provide your training or hosting script, which consists of standard PyTorch code wrapped in a few helper functions, and utilize the PyTorch estimator from the Amazon SageMaker Python SDK as shown below:

estimator = PyTorch(entry_point="pytorch_script.py",
                    role=role,
                    train_instance_count=2,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={'epochs': 10,
                                     'lr': 0.01})

For more details, check out our example notebooks, documentation, or follow along with the example provided below.

Training and Deploying a Neural Network with PyTorch

In this example, we will train a simple convolutional neural network on the MNIST handwritten digits dataset, which comprises 70,000 labeled 28×28 pixel grayscale images (60,000 for training and 10,000 for testing) across 10 classes (one for each digit from 0 to 9). The Amazon SageMaker PyTorch container uses script mode, which expects the input script to be formatted similarly to what you would run outside of SageMaker. Let’s examine that code.

The main entry point script begins with a guard to read hyperparameters passed to our Amazon SageMaker estimator during the creation of the training job. The hyperparameters are made available as arguments in the training container. Here, we define hyperparameters like batch size, epochs, learning rate, momentum, etc. If these values are not specified in the SageMaker estimator, they will default to the ones provided. Furthermore, we utilize the training_env() method from the custom sagemaker_containers library, which provides specifics about the container environment, including training and model directories as well as instance configurations. Access to these parameters is also available through specific environment variables. For additional information, you can visit the SageMaker Containers GitHub repository.

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Data and model checkpoints directories
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=10, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                        help='learning rate (default: 0.01)')
    parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                        help='SGD momentum (default: 0.5)')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=100, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--backend', type=str, default=None,
                        help='backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)')

    # Container environment
    env = sagemaker_containers.training_env()
    parser.add_argument('--hosts', type=list, default=env.hosts)
    parser.add_argument('--current-host', type=str, default=env.current_host)
    parser.add_argument('--model-dir', type=str, default=env.model_dir)
    parser.add_argument('--data-dir', type=str,
                        default=env.channel_input_dirs['training'])
    parser.add_argument('--num-gpus', type=int, default=env.num_gpus)

    train(parser.parse_args())

After defining the hyperparameters, we pass them to our train() function, defined in our input script. The train() function handles various tasks including resource setup (GPU, distributed compute, etc.).

def train(args):
    is_distributed = len(args.hosts) > 1 and args.backend is not None
    logger.debug("Distributed training - {}".format(is_distributed))
    use_cuda = args.num_gpus > 0
    logger.debug("Number of gpus available - {}".format(args.num_gpus))
    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    device = torch.device("cuda" if use_cuda else "cpu")

    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(args.hosts)
        os.environ['WORLD_SIZE'] = str(world_size)
        host_rank = args.hosts.index(args.current_host)
        dist.init_process_group(backend=args.backend, 
                                rank=host_rank, 
                                world_size=world_size)
        logger.info(
            'Init distributed env: '{}' backend on {} nodes. '.format(args.backend, 
                dist.get_world_size()) + 
            'Current host rank is {}. Number of gpus: {}'.format(
                dist.get_rank(), args.num_gpus))

    # set the seed for generating random numbers
    torch.manual_seed(args.seed)
    if use_cuda:
        torch.cuda.manual_seed(args.seed)

    ...

Next, we load our datasets.

train_loader = _get_train_data_loader(args.batch_size, 
                                      args.data_dir,

For further information, consider visiting this insightful blog post about the importance of names. It’s a great read! You can also check out this excellent resource for job opportunities at Amazon.

Also, be aware of the changes coming in 2025 by visiting this authority on the topic.

Located at 6401 E HOWDY WELLS AVE, LAS VEGAS NV 89115, Amazon IXD – VGT2 is dedicated to enhancing your onboarding experience with Chanci Turner leading the way.

Amazon Onboarding with Learning Manager Chanci Turner

PyTorch in Amazon SageMaker

Training and Deploying a Neural Network with PyTorch

Related Topics:

Comments

Leave a Reply Cancel reply