Anomaly Detection in Amazon DynamoDB Streams with Amazon SageMaker’s Random Cut Forest Algorithm

Have you thought about implementing anomaly detection technology within your organization? Anomaly detection is a valuable technique for identifying rare items, events, or observations that significantly diverge from the majority of the dataset being analyzed. Its applications are extensive, ranging from detecting unusual purchases or cyber intrusions in banking, identifying malignant tumors in MRI scans, recognizing fraudulent insurance claims, and spotting irregular machine behavior in manufacturing to detecting unusual patterns in network traffic that may indicate an intrusion.

While many commercial products exist for anomaly detection, you can easily create your own system using Amazon SageMaker, AWS Glue, and AWS Lambda. Amazon SageMaker is a fully-managed platform that allows you to quickly build, train, and deploy machine learning models at any scale. Meanwhile, AWS Glue is a fully-managed ETL service that simplifies data preparation for analytics, and AWS Lambda is a well-established serverless platform for real-time processing. By leveraging these services, your model can automatically update with new data, enhancing its accuracy for real-time anomaly detection.

In this post, I’ll outline how to utilize AWS Glue for data preparation and train an anomaly detection model using Amazon SageMaker. For demonstration purposes, I will store a sample of data from the NAB NYC Taxi dataset in Amazon DynamoDB, which will be streamed in real time using an AWS Lambda function.

The proposed solution offers several advantages:

You can maximize existing resources for anomaly detection. For instance, if you’re already using Amazon DynamoDB Streams for disaster recovery or other purposes, you can repurpose that data for anomaly detection. Additionally, underutilized standby storage can serve as training data.
The model can be retrained automatically with new data on a regular basis, requiring no manual intervention.
Utilizing the Random Cut Forest algorithm available in Amazon SageMaker is straightforward. Amazon SageMaker provides flexible distributed training options tailored to your specific workflows in a secure and scalable environment.

Solution Architecture

The diagram below illustrates the overall solution architecture.

The data flow through this architecture follows these steps:

A source DynamoDB captures changes and stores them in a DynamoDB stream.
An AWS Glue job regularly retrieves data from the target DynamoDB table and executes a training job with Amazon SageMaker to create or update model artifacts stored in Amazon S3.
The same AWS Glue job deploys the updated model to the Amazon SageMaker endpoint for real-time anomaly detection utilizing Random Cut Forest.
An AWS Lambda function polls data from the DynamoDB stream and invokes the Amazon SageMaker endpoint to obtain predictions.
The Lambda function alerts user applications upon detecting anomalies.

This blog is divided into two sections. The first section, “Creating the Auto-Updating Model,” explains how to automate the previous steps using AWS Glue. All sample scripts in this section operate within a single AWS Glue job. The second section, “Real-Time Anomaly Detection,” demonstrates how the AWS Lambda function processes the preceding steps for anomaly detection.

Creating the Auto-Updating Model

This section discusses how AWS Glue reads a DynamoDB table and automatically trains and deploys an Amazon SageMaker model. I will assume that the DynamoDB stream is enabled and that items are currently being written to the stream. If you need help setting this up, refer to the following resources: Capturing Table Activity with DynamoDB Streams, DynamoDB Streams and AWS Lambda Triggers, and Global Tables.

In this example, a DynamoDB table named “taxi_ridership” located in the us-west-2 Region is replicated to a table with the same name in the us-east-1 Region using DynamoDB Global Tables.

Creating an AWS Glue Job for Data Preparation

To prepare the data for model training, we will store our data in DynamoDB. The AWS Glue job retrieves data from the target DynamoDB table using the create_dynamic_frame_from_options() function with a dynamodb connection type. It is advisable to select only the columns necessary for model training and write them into Amazon S3 as CSV files. This can be accomplished using the ApplyMapping.apply() function in AWS Glue, mapping only the transaction_id and ridecount columns.

When executing the write_dynamic_frame.from_options function, include the option format_options = {"writeHeader": False , "quoteChar": "-1"} since column names and double quotation marks are unnecessary for training. Additionally, the AWS Glue job should be created in the same Region where the DynamoDB table exists (in this case, us-east-1). For detailed information on creating an AWS Glue job, see Adding Jobs in AWS Glue.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

my_region  = '<region name>'
my_bucket = '<bucket name>'
my_project = '<project name>'
my_train_data = f"s3://{my_bucket}/{my_project}/taxi-ridership-rawdata/"
my_dynamodb_table = "taxi_ridership"

## Read raw(source) data from target DynamoDB 
raw_data_dyf = glueContext.create_dynamic_frame_from_options("dynamodb", {"dynamodb.input.tableName" : my_dynamodb_table , "dynamodb.throughput.read.percent" : "0.7" }, transformation_ctx="raw_data_dyf")

## Write necessary columns into S3 as CSV format for creating Random Cut Forest(RCF)  model  
selected_data_dyf = ApplyMapping.apply(frame = raw_data_dyf, mappings = [("transaction_id", "string", "transaction_id", "string"), ("ridecount", "string", "ridecount", "string")], transformation_ctx = "selected_data_dyf")
datasink = glueContext.write_dynamic_frame.from_options(frame=selected_data_dyf , connection_type="s3", connection_options={ "path": my_train_data }, format="csv", format_options = {"writeHeader": False , "quoteChar": "-1" }, transformation_ctx="datasink")

This AWS Glue job will generate CSV files at the specified path on Amazon S3 (e.g., s3://<bucket name>/<project name>/taxi-ridership-rawdata/).

Running the Training Job and Updating the Model

Once the data is prepared, you can initiate a training job on Amazon SageMaker. To submit the training job, import the boto3 package, which is bundled with your AWS Glue ETL script, allowing you to utilize the low-level SDK for Python. For more information on creating a training job, visit Create a Training Job.

The create_training_job function generates model artifacts in the specified S3 path, which are essential for creating the model in the subsequent step.

For more insights on this topic, check out this additional blog post that dives deeper into anomaly detection applications. Furthermore, Amazon provides an excellent resource on fulfillment center safety and training which can be beneficial.

Anomaly Detection in Amazon DynamoDB Streams with Amazon SageMaker’s Random Cut Forest Algorithm

Solution Architecture

Creating the Auto-Updating Model

Creating an AWS Glue Job for Data Preparation

Running the Training Job and Updating the Model

Related Topics:

Comments

Leave a Reply Cancel reply