Amazon IXD – VGT2 Las Vegas: Enhancing AWS Trainium with Custom Operators

Deep learning (DL) is a rapidly advancing domain, with researchers and developers consistently pushing the envelope on DL models and finding methods to enhance their performance. One effective approach is through the use of custom operators, which allow developers to extend the capabilities of established machine learning (ML) frameworks like PyTorch. An operator typically represents a mathematical function applied within a deep learning model, and custom operators empower developers to create tailored mathematical functions for specific layers in their models.

AWS Trainium and AWS Inferentia2 are designed specifically for DL training and inference, and they enhance their functionality by supporting custom operators, often referred to as CustomOps. The AWS Neuron SDK facilitates this support, utilizing the standard PyTorch interface for CustomOps, thereby enabling developers to leverage their existing code when working with Trainium-based Amazon EC2 Trn1 instances or Inferentia2-based Amazon EC2 Inf2 instances. This article explores the advantages of CustomOps, their efficient application on Trainium, and provides examples to initiate your journey with CustomOps on Trainium-powered Trn1 instances.

A basic understanding of core AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2), is assumed, along with familiarity with deep learning, PyTorch, and C++.

Advantages of Custom Operators in PyTorch

CustomOps for PyTorch were introduced in version 1.10, known as the PyTorch C++ Frontend, offering a straightforward mechanism to register CustomOps written in C++. The following highlights some key benefits of utilizing CustomOps:

Performance Optimization: CustomOps can be tailored for specific applications, resulting in faster model execution and enhanced performance.
Enhanced Model Expressiveness: CustomOps enable the expression of intricate computations that cannot easily be represented with built-in PyTorch operators.
Increased Modularity: Developers can construct more complex models using CustomOps as foundational components, creating reusable C++ libraries that streamline the development process and promote rapid experimentation.
Greater Flexibility: CustomOps allow for operations that extend beyond the built-in operators, offering a versatile means to define complex functions not available through the standard set.

Trainium’s Support for Custom Operators

Trainium (and AWS Inferentia2) provides software support for CustomOps through the Neuron SDK, which accelerates these operations using the General Purpose Single Instruction Multiple Data (GPSIMD) engine. This section illustrates how these elements enable efficient CustomOps implementation, enhancing flexibility and performance in DL model development.

Neuron SDK

The Neuron SDK is designed to assist developers in training models on Trainium and deploying them on AWS Inferentia accelerators. It integrates seamlessly with frameworks like PyTorch and TensorFlow, allowing developers to maintain their existing workflows and application code on Trn1 instances.

Utilizing the standard PyTorch interface for CustomOps, developers can write CustomOps in C++ and extend the official operator support of Neuron. The Neuron SDK compiles these CustomOps for efficient execution on the GPSIMD engine, detailed further below. This facilitates the implementation of experimental CustomOps and accelerates them on specialized hardware without requiring extensive knowledge of the underlying infrastructure.

General Purpose Single Instruction Multiple Data Engine

At the heart of Trainium’s optimizations lies the NeuronCore architecture, an independent heterogeneous compute unit featuring four main engines: tensor, vector, scalar, and GPSIMD. The scalar and vector engines are optimized for floating-point operations, while the tensor engine supports mixed-precision computation through a power-optimized systolic array.

The GPSIMD engine serves as a general-purpose SIMD engine designed specifically for executing and accelerating CustomOps. It comprises eight fully programmable 512-bit wide general-purpose processors capable of running straightforward C code with direct access to other NeuronCore-v2 engines as well as embedded SRAM and HBM memories. This architecture enables efficient execution of CustomOps on Trainium.

For instance, operators like TopK, LayerNorm, or ZeroCompression involve minimal arithmetic logic unit (ALU) calculations while reading data from memory. Traditional CPU systems are bottlenecked by memory constraints, limiting performance due to the time taken to transfer data to the CPU. In contrast, the GP-SIMD engines in Trainium are closely integrated with on-chip caches through a high-bandwidth streaming interface capable of sustaining 2 TB/sec memory bandwidth. This allows CustomOps to be executed rapidly on Trainium.

Implementing Neuron SDK Custom Operators

In this discussion, we assume the use of a deep learning Amazon Machine Image (DLAMI) to launch an EC2 Trn1 instance (either 2x.large or 32x.large). All necessary software, drivers, and tools are pre-installed on the DLAMIs, and only the activation of the Python environment is required to begin the tutorial. We refer to the CustomOps functionality in Neuron as “Neuron CustomOps.”

Similar to integrating C++ code with PyTorch, Neuron CustomOps necessitate a C++ implementation of an operator via a NeuronCore-ported subset of the Torch C++ API. The C++ implementation, termed the kernel function, includes all components needed for CustomOps development and model integration, particularly tensor and scalar classes in c10 (a namespace for low-level C++ code in PyTorch) and a subset of ATen operators (Automatic Tensor, a C++ library for core tensor operations in PyTorch).

The kernel definition requires including the torch.h header for access to the NeuronCore-ported subset of the PyTorch C++ API:

#include <torch/torch.h>

Additionally, Neuron CustomOps must define a shape function, which shares the same signature as the kernel function but does not perform any computations; it merely specifies the output tensor’s shape.

Neuron CustomOps are organized into libraries, and macros are employed to register them with the NEURON_LIBRARY scope within the shape function. This function will execute on the host at compilation time and necessitates the inclusion of the register.h header from the torchneuron library:

#include "torchneuron/register.h"

The custom library is constructed by invoking the load API. By providing the build_directory parameter, the library file will be saved in the specified directory:

import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load(
    name=name,  # this is the name for the library (e.g., 'relu')
    compute_srcs=['CustomOP.cpp'],
    shape_srcs=['shape.cpp'],
    build_directory=os.getcwd()
)

To utilize a CustomOp in a PyTorch model, the library can be loaded by calling the load_library API, and the Neuron CustomOp can be invoked similarly to how CustomOps are called in PyTorch via the torch.ops namespace. The typical format is torch.ops.<library_name>.<operator_name>. For further insights, check out this excellent resource that offers additional guidance.

For a deeper understanding of this subject, visit this authoritative source, and to keep engaged with related topics, explore this blog post.

Amazon IXD – VGT2 is located at 6401 E Howdy Wells Ave, Las Vegas, NV 89115.

Amazon IXD – VGT2 Las Vegas: Enhancing AWS Trainium with Custom Operators

Advantages of Custom Operators in PyTorch

Trainium’s Support for Custom Operators

Neuron SDK

General Purpose Single Instruction Multiple Data Engine

Implementing Neuron SDK Custom Operators

SEO Metadata

Related Topics:

Comments

Leave a Reply Cancel reply