Amazon SageMaker Autopilot is designed to automatically train and fine-tune optimal machine learning (ML) models for both classification and regression tasks while ensuring users retain full control and visibility over the process. This feature enables data analysts, developers, and data scientists to train, tune, and deploy models with minimal coding, while also providing a comprehensive notebook that documents every step Autopilot undertook to produce the model. In certain scenarios, you may wish to enhance the pipelines constructed by Autopilot with your own custom components.
In this article, we’ll demonstrate how to create and utilize models with Autopilot in just a few clicks, and then we’ll discuss how to modify the code generated by SageMaker Autopilot to incorporate your own feature selectors and custom transformers for domain-specific enhancements. Additionally, we’ll utilize Autopilot’s dry run capability, which generates code for data preprocessing, algorithms, and their parameter settings without executing a full experiment. This can be achieved by selecting the option to run a pilot, resulting in a notebook with candidate definitions.
Modifying Autopilot Models
Customization of Autopilot models is usually unnecessary, as Autopilot produces high-quality models ready for deployment without alterations. It automatically conducts exploratory analysis on your data to identify which features might yield the best outcomes, thereby lowering the barrier to entry for ML across various users, from data analysts to developers looking to infuse AI/ML capabilities into their projects.
Nevertheless, advanced users can leverage Autopilot’s transparent AutoML process to significantly lessen the routine burdens associated with ML projects. For instance, you might want Autopilot to implement specific feature transformations or imputation techniques tailored to your organizational context. Although you can preprocess your data before inputting it into SageMaker Autopilot, this requires separate preprocessing pipelines outside of Autopilot. Alternatively, you can use Autopilot’s data processing pipeline to instruct it to apply your custom transformations and imputations. This approach allows you to concentrate on data gathering while Autopilot manages the heavy lifting of applying your desired transformations and ultimately identifying and deploying the best model.
Preparing Your Data and Autopilot Job
Let’s begin by setting up an Autopilot experiment utilizing the Forest Cover Type dataset.
- Download the dataset and upload it to Amazon Simple Storage Service (Amazon S3).
- Ensure that your Amazon SageMaker Studio user is created in the same Region as the S3 bucket.
- Launch SageMaker Studio.
- Create a job by providing the following details:
- Experiment name
- Training dataset location
- S3 bucket for storing Autopilot output data
- Type of ML problem
Your Autopilot job is now set to run. Instead of executing a complete experiment, we opt to have Autopilot generate a notebook with candidate definitions.
Examining the Autopilot-Generated Pipelines
SageMaker Autopilot automates critical tasks within an ML pipeline. It evaluates hundreds of models composed of various features, algorithms, and hyperparameters to find the most suitable fit for your data. It also presents a leaderboard featuring 250 models, enabling you to assess the performance of each candidate and select the best one for deployment. We will delve deeper into this in the concluding section of this article.
Once your experiment is finalized, you can review your candidate pipelines. A candidate refers to the combination of data preprocessing steps and algorithm selection utilized to train the 250 models. The candidate generation notebook includes the Python code that Autopilot employed to create these candidates.
- Select “Open candidate generation notebook.”
- Access your notebook.
- Choose “Import” to bring the notebook into your workspace.
- When prompted, select Python 3 (Data Science) as the kernel.
- Within the notebook, execute all cells in the SageMaker Setup.
This action copies the data preparation code generated by Autopilot into your workspace. In your root SageMaker Studio directory, you should now find a folder named after your Autopilot experiment, formatted as <Your Experiment Name>–artifacts. Inside this directory, you will find two subdirectories: generated_module and sagemaker_automl. The generated_module directory contains the data processing artifacts produced by Autopilot.
At this stage, the Autopilot job has analyzed the dataset and generated ML candidate pipelines that include a set of feature transformers and an ML algorithm. Navigate to the candidate_data_processors directory within the generated_module folder, which houses 12 files:
- dpp0.py–dpp9.py – Data processing candidates generated by Autopilot
- trainer.py – Script that executes the data processing candidates
- sagemaker_serve.py – Script for running the preprocessing pipeline during inference
Upon reviewing the dpp*.py files, you will note that Autopilot has generated code that constructs scikit-learn pipelines, which can be easily extended with your own transformations. You can either modify the existing dpp*.py files directly or extend the pipelines after instantiation in the trainer.py file, where you define a transformer for use within existing dpp*.py files. The latter method is advisable as it enhances maintainability and allows for simultaneous extension of all proposed processing pipelines rather than requiring individual modifications.
Utilizing Specific Transformers
You might want to invoke a particular transformer from scikit-learn or use one included in the open-source package sagemaker-scikit-learn-extension. The latter offers a variety of scikit-learn-compatible estimators and transformers that can be beneficial. For example, it features the Weight of Evidence (WoE) encoder, which is often utilized for encoding categorical features in binary classification contexts.
To incorporate additional transformers, begin by extending the import statements in the trainer.py file. For our example, we will add the following code:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sagemaker_sklearn_extension.preprocessing import RobustStandardScaler
If you encounter errors while modifying trainer.py and executing the notebook cell containing automl_interactive_runner.fit_data_transformers(…), you can refer to Amazon CloudWatch for debugging information under the log group /aws/sagemaker/TrainingJobs.
Implementing Custom Transformers
Returning to the forest cover type case, we have features representing the vertical and horizontal distance to hydrology. We wish to enhance this with an additional feature transformation that computes the straight-line distance to hydrology. We can accomplish this by adding a new file into the candidate_data_processors directory where we define our custom transform. Here’s an example of the code:
# additional_features.py
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
class HydrologyDistance(BaseEstimator, TransformerMixin):
def __init__(self, feature_index):
self._feature_index = feature_index
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X = X.copy()
For further insights on this topic, check out this blog post here, as well as this resource, which provides valuable information. If you want to dive deeper into advanced techniques, visit this authority on the subject.
Leave a Reply