Leverage the Built-in Amazon SageMaker Random Cut Forest Algorithm for Anomaly Detection

Leverage the Built-in Amazon SageMaker Random Cut Forest Algorithm for Anomaly DetectionMore Info

Today, we are excited to announce the addition of the Random Cut Forest (RCF) algorithm as a built-in feature in Amazon SageMaker. RCF is an unsupervised learning algorithm designed for identifying anomalous data points or outliers within a dataset. This blog post will delve into the challenges of anomaly detection, detail the workings of the Amazon SageMaker RCF algorithm, and showcase its application on a real-world dataset.

Understanding Anomaly Detection

Imagine collecting data on traffic patterns over time across various city blocks. Can you determine whether a sudden increase in traffic indicates an accident or simply the regular rush hour? Does it matter if the spike occurred at one block or multiple locations? Alternatively, consider monitoring network traffic between servers in a cluster. Are you able to distinguish whether an increase in activity signifies a distributed denial of service (DDoS) attack or if it is merely benign traffic?

An anomaly is a data point that deviates from a dataset that otherwise exhibits a clear pattern or structure. Examples of anomalies include unexpected spikes in time series data, periodic interruptions, or data points that are difficult to classify. The presence of these anomalies can significantly complicate machine learning tasks, as the “normal” data can often be modeled with simplicity.

The Amazon SageMaker Random Cut Forest Algorithm

The Amazon SageMaker Random Cut Forest (RCF) algorithm operates as an unsupervised method for identifying anomalous data points within a dataset. Specifically, the RCF algorithm assigns an anomaly score to each data point. A low anomaly score indicates a “normal” data point, while a high score suggests an anomaly. The definitions of “low” and “high” are application-dependent, but it is commonly accepted that scores exceeding three standard deviations from the mean are marked as anomalous.

The RCF algorithm begins by sampling from the training data. Due to potential size constraints, a technique known as reservoir sampling is employed to efficiently gather samples from a data stream. These samples are then distributed to each tree within the random cut forest. Each subsample is arranged into a binary tree by randomly partitioning bounding boxes until every leaf node represents a single data point. The anomaly score for a given data point is inversely related to its average depth across the forest. For further details, refer to the SageMaker RCF documentation page. The underlying algorithm is based on research noted in the references at the conclusion of this article.

Practical Example: Analyzing New York City Taxi Ridership Data

We will demonstrate the Amazon SageMaker RCF using a dataset that covers six months of taxi ridership in New York City. This dataset is publicly available from the Numenta Anomaly Benchmark (NAB) New York City Taxi dataset. The following code examples illustrate how to train a SageMaker RCF model and utilize it to detect anomalies in the ridership data. For more information on this process, check out another blog post here.

Acquiring, Reviewing, and Storing Data in Amazon S3

To begin, we obtain and visualize the NAB dataset. This data spans approximately six months of taxi ridership in New York City, with each data point representing a 30-minute interval of ridership volume.

import pandas
import urllib.request

data_filename = 'nyc_taxi.csv'
data_source = 'https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv'

urllib.request.urlretrieve(data_source, data_filename)
taxi_data = pandas.read_csv(data_filename, delimiter=',')
taxi_data.plot(title='Taxi Ridership in NYC')

As anticipated, taxi ridership exhibits a roughly periodic pattern: higher traffic during the day, particularly at typical commuting hours, and lower volume during late-night hours. Additionally, we observe a weekly trend, with increased ridership on weekdays compared to weekends. A closer examination of the plot reveals several anomalous data points—after all, humans are adept at visual pattern recognition, honed over countless generations. Notably, these anomalies coincide with events such as the New York City Marathon at t=5954, New Year’s Eve at t=8833, and a significant snowstorm at t=10090. For authoritative insights on this topic, visit this source.

To prepare the data for the Amazon SageMaker algorithms, we need to convert the CSV data into RecordIO Protobuf format and upload it to our Amazon S3 bucket.

def convert_and_upload_training_data(
    ndarray, bucket, prefix, filename='data.pbr'):
    import boto3
    import os
    from sagemaker.amazon.common import numpy_to_record_serializer

    # Convert Numpy array to Protobuf RecordIO format
    serializer = numpy_to_record_serializer()
    buffer = serializer(ndarray)

    # Upload to S3
    s3_object = os.path.join(prefix, 'train', filename)
    boto3.Session().resource('s3').Bucket(bucket).Object(s3_object).upload_fileobj(buffer)
    s3_path = 's3://{}/{}'.format(bucket, s3_object)
    return s3_path

bucket = '<my-s3-bucket>' # <-- replace with your own bucket name
prefix = 'sagemaker/randomcutforest'
s3_train_data = convert_and_upload_training_data(
    taxi_data.values.reshape(-1,1),
    bucket,
    prefix)

Training the Model

Before we can train our SageMaker Random Cut Forest model using this dataset, we need to define various training job parameters, including the Amazon Elastic Container Registry (ECR) Docker container for the Amazon SageMaker Random Cut Forest algorithm, the location of our training data, and the instance type designated for the algorithm’s execution. Additionally, we need to set algorithm-specific hyperparameters. The two key hyperparameters in the Amazon SageMaker RCF algorithm are num_trees and num_samples_per_tree.

The num_trees hyperparameter determines the number of trees in the RCF model. Each tree learns a distinct model from a subsample of the input training data and produces an anomaly score inversely proportional to the data point’s depth in the tree. The overall anomaly score generated by the RCF model is the average of the scores from all constituent trees. The num_samples_per_tree hyperparameter specifies the number of randomly sampled training data points assigned to each tree. An optimal choice for num_samples_per_tree would be one that approximates the expected percentage of anomalies within the dataset. For additional information, see Amazon SageMaker RCF – How it Works.

The following code snippet trains the SageMaker RCF model on the taxi data, utilizing 50 trees and 200 data points for each tree.

import boto3
import sagemaker

containers = {
    'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/randomcutforest:latest',
    'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/randomcutforest:latest',
    'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/randomcutforest:latest',
    'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/randomcutforest:latest'}
region_name = boto3.Session().region_name
container = containers[region_name]

session = sagemaker.Session()

rcf = sagemaker.estimator.Estimator(
    container,
    sagemaker.get_execution_role(),
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    train_inst

For further resources, including discussions related to this topic, check out this Reddit thread.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *