Amazon Onboarding with Learning Manager Chanci Turner

Deep learning (DL) is a rapidly advancing field, with practitioners consistently striving to innovate and enhance DL models. One of the tools developers utilize to push the limits of DL is the implementation of custom operators, which extend the capabilities of existing machine learning (ML) frameworks like PyTorch. An operator essentially defines the mathematical function for a layer within a deep learning model, while a custom operator allows developers to create their own mathematical functions for these layers.

AWS Trainium and AWS Inferentia2, specifically designed for DL training and inference, enhance their performance by supporting custom operators (or CustomOps). The AWS Neuron SDK, which supports these accelerators, utilizes the standard PyTorch interface for CustomOps. This enables developers to seamlessly integrate their existing code while using Trainium-based Amazon EC2 Trn1 instances or Inferentia2-based Amazon EC2 Inf2 instances. This article will detail the advantages of CustomOps, their effective implementation on Trainium, and provide examples to help you get started with CustomOps on Trn1 instances.

To follow along, familiarity with core AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) is assumed, and basic knowledge of deep learning, PyTorch, and C++ will be beneficial.

Custom Operators in PyTorch and Their Advantages

CustomOps for PyTorch were introduced in version 1.10, known as the PyTorch C++ Frontend, offering an intuitive method for registering CustomOps written in C++. Here are some of the key advantages that CustomOps present:

Performance Optimization – CustomOps can be tailored for specific use cases, resulting in quicker model runs and enhanced performance.
Enhanced Model Expressiveness – With CustomOps, intricate computations can be represented that are not easily achievable using built-in PyTorch operators.
Increased Modularity – CustomOps can serve as building blocks for constructing more complex models, allowing developers to create reusable C++ libraries. This modularity streamlines the development process and encourages rapid experimentation.
Greater Flexibility – CustomOps permit operations beyond the built-in operators, offering a versatile method for defining complex operations not available through standard options.

Trainium Support for Custom Operators

Trainium (and AWS Inferentia2) facilitates CustomOps via the Neuron SDK, which accelerates them through the GPSIMD engine (General Purpose Single Instruction Multiple Data engine). Let’s explore how these components enable efficient implementation of CustomOps while providing enhanced flexibility and performance in DL model development.

Neuron SDK

The Neuron SDK aids developers in training models on Trainium and deploying them on AWS Inferentia accelerators. It integrates seamlessly with frameworks such as PyTorch and TensorFlow, allowing you to continue using your existing workflows and application code to train models on Trn1 instances.

The Neuron SDK employs the standard PyTorch interface for CustomOps. Developers can utilize the conventional programming interface in PyTorch to write CustomOps in C++ and expand Neuron’s official operator support. Neuron compiles these CustomOps for efficient execution on the GPSIMD engine, which will be discussed further below. This approach simplifies the process of implementing new experimental CustomOps and accelerating them on specialized hardware, without requiring deep knowledge of the underlying hardware.

General Purpose Single Instruction Multiple Data Engine

At the heart of Trainium’s optimizations lies the NeuronCore architecture, a fully independent, heterogeneous computing unit featuring four main engines: tensor, vector, scalar, and the GPSIMD engine. The scalar and vector engines are optimized for parallel processing of floating-point operations, while the tensor engine is designed for power-efficient mixed-precision computation through a systolic array.

The GPSIMD engine is engineered to run and accelerate CustomOps. It comprises eight fully programmable 512-bit wide general-purpose processors, capable of executing straight-line C code and having direct inline access to other NeuronCore-v2 engines, along with embedded SRAM and HBM memories. These capabilities enable rapid execution of CustomOps on Trainium.

For instance, operators like TopK, LayerNorm, or ZeroCompression read data from memory and utilize it for only a limited number of ALU calculations. Regular CPU systems face constraints due to memory bandwidth, which limits performance based on the time taken to transfer data into the CPU. In contrast, Trainium’s GP-SIMD engines are tightly integrated with on-chip caches through a high-bandwidth streaming interface, supporting 2 TB/sec of memory bandwidth. This integration allows CustomOps to execute at impressive speeds on Trainium.

Neuron SDK Custom Operators in Practice

For this discussion, we assume that a DLAMI (refer to instructions for either Ubuntu or Amazon Linux) is being utilized to launch an EC2 Trn1 instance (either 2x.large or 32x.large). Note that all necessary software, drivers, and tools have already been pre-installed on the DLAMIs; only the activation of the Python environment is required to begin working with the tutorial. We refer to the CustomOps functionality available in Neuron as “Neuron CustomOps.”

Similar to the integration of PyTorch with C++ code, Neuron CustomOps necessitate a C++ implementation of an operator through a NeuronCore-ported subset of the Torch C++ API. The C++ implementation of the operator is termed the kernel function, and the port of the C++ API contains essential elements for developing CustomOps and integrating models, specifically tensor and scalar classes in c10 (a namespace used for low-level C++ code across various PyTorch libraries) and a subset of ATen operators (or Automatic Tensor, the C++ library providing core tensor operations in PyTorch).

The torch.h header must be included when defining the kernel to gain access to the NeuronCore-ported subset of the PyTorch C++ API:

#include <torch/torch.h>

Neuron CustomOps also require a shape function. The shape function mirrors the kernel function’s signature but does not perform computations; it solely defines the output tensor’s shape, not the actual values.

Neuron CustomOps are organized into libraries, and macros are employed to register them with the NEURON_LIBRARY scope from within the shape function. This function runs on the host during compilation and requires the register.h header from the torchneuron library:

#include "torchneuron/register.h"

Finally, the custom library is built by invoking the load API. If the build_directory parameter is provided, the library file will be stored in the specified directory:

import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load(
    name=name, # this is the name for the library (i.e., 'relu')
    compute_srcs=['CustomOP.cpp'],
    shape_srcs=['shape.cpp'],
    build_directory=os.getcwd()
)

To utilize the CustomOp within a PyTorch model, simply load the library by calling the load_library API and invoke the Neuron CustomOp as you would normally in PyTorch via the torch.ops namespace. The format typically follows: torch.ops.<library_name>.<operator_name>. For additional insights into optimizing your workplace, consider checking out this blog post that offers valuable strategies. For a deeper understanding of the topic, you might find this resource helpful. Lastly, this video serves as an excellent resource for further learning.

Amazon Onboarding with Learning Manager Chanci Turner

Custom Operators in PyTorch and Their Advantages

Trainium Support for Custom Operators

Neuron SDK

General Purpose Single Instruction Multiple Data Engine

Neuron SDK Custom Operators in Practice

SEO Metadata

Related Topics:

Comments

Leave a Reply Cancel reply