Learn About Amazon VGT2 Learning Manager Chanci Turner
The significance of data warehouses and the analytics capabilities they offer has been on the rise, as many organizations increasingly depend on these systems for essential operational decisions and long-term strategic initiatives. Historically, data warehouses have been updated through batch processing—whether that be monthly, weekly, or daily—allowing businesses to extract insights from them.
However, organizations are now recognizing that near-real-time data ingestion combined with advanced analytics can unlock new possibilities. For instance, a financial institution can detect potential fraudulent credit card transactions by employing an anomaly detection system in a near-real-time setup instead of relying on batch processing.
In this article, we will explore how Amazon Redshift can provide both streaming ingestion and machine learning (ML) predictions on a single platform. Amazon Redshift is a robust, scalable, and secure cloud data warehouse that simplifies and reduces costs for data analysis using standard SQL.
Moreover, Amazon Redshift ML enables data analysts and database developers to create, train, and implement ML models using familiar SQL commands within Amazon Redshift data warehouses. We’re thrilled to introduce Amazon Redshift Streaming Ingestion for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), allowing users to ingest data directly from a Kinesis data stream or Kafka topic without having to store it first in Amazon Simple Storage Service (Amazon S3). This streaming ingestion capability facilitates low-latency ingestion, typically within seconds, while processing hundreds of megabytes of data into your data warehouse.
This article will demonstrate how Amazon Redshift empowers you to develop near-real-time ML predictions by utilizing Amazon Redshift streaming ingestion and Redshift ML features with standard SQL.
Solution Overview
By following the steps detailed in this article, you’ll establish a producer streaming application on an Amazon Elastic Compute Cloud (Amazon EC2) instance that simulates credit card transactions and pushes data to Kinesis Data Streams in real time. You’ll set up an Amazon Redshift Streaming Ingestion materialized view on Amazon Redshift, which will receive the streaming data. Additionally, you will train and build a Redshift ML model to generate real-time inferences from the streaming data.
The architecture and process flow are visually represented in the following diagram. The step-by-step process includes:
- The EC2 instance simulates a credit card transaction application that inserts credit card transactions into the Kinesis data stream.
- The data stream retains the incoming credit card transaction data.
- An Amazon Redshift Streaming Ingestion materialized view is established on top of the data stream, which automatically ingests the streaming data into Amazon Redshift.
- You will build, train, and deploy an ML model using Redshift ML, which is trained using historical transaction data.
- The streaming data is transformed, and ML predictions are generated.
- Alerts can be sent to customers, or application updates can be made to manage risk.
This walkthrough utilizes simulated credit card transaction data that is entirely fictitious. The customer dataset is also imaginary, generated with random data functions.
Prerequisites
- Create an Amazon Redshift cluster.
- Configure the cluster to employ Redshift ML.
- Create an AWS Identity and Access Management (IAM) user.
- Update the IAM role linked to the Redshift cluster to permit access to the Kinesis data stream. For more details on the required policy, refer to Getting Started with Streaming Ingestion.
- Create an m5.4xlarge EC2 instance. While we tested the producer application with an m5.4xlarge instance, other instance types may be used. When setting it up, use the amzn2-ami-kernel-5.10-hvm-2.0.20220426.0-x86_64-gp2 AMI.
- Confirm that Python3 is installed on the EC2 instance by running the following command (ensure that the data extraction script functions only on Python 3):
python3 --version
- Install the necessary packages to execute the simulator program:
sudo yum install python3-pip
pip3 install numpy
pip3 install pandas
pip3 install matplotlib
pip3 install seaborn
pip3 install boto3
- Configure Amazon EC2 using the variables for the AWS credentials generated for the IAM user created earlier. The following screenshot illustrates an example using
aws configure
.
Setting Up Kinesis Data Streams
Amazon Kinesis Data Streams is a highly scalable and reliable real-time data streaming service. It can continuously capture gigabytes of data per second from countless sources, such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds, allowing for real-time analytics applications like dashboards, anomaly detection, and dynamic pricing. Kinesis Data Streams is favored because it’s a serverless solution that adjusts based on usage.
Creating a Kinesis Data Stream
To begin, create a Kinesis data stream to receive the streaming data:
- In the Amazon Kinesis console, select Data Streams from the navigation pane.
- Click on Create Data Stream.
- Enter “cust-payment-txn-stream” as the Data Stream Name.
- Select “On-demand” for Capacity Mode.
- For the remaining options, accept the defaults and follow the prompts to complete the setup.
- Make a note of the ARN for the created data stream for use when defining your IAM policy.
Setting Up Permissions
For your streaming application to write to Kinesis Data Streams, it requires access. Use the following policy statement to grant the simulator process access to the data stream. Substitute the ARN of the data stream you saved earlier:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt123",
"Effect": "Allow",
"Action": [
"kinesis:DescribeStream",
"kinesis:PutRecord",
"kinesis:PutRecords",
"kinesis:GetShardIterator",
"kinesis:GetRecords",
"kinesis:ListShards",
"kinesis:DescribeStreamSummary"
],
"Resource": [
"arn:aws:kinesis:us-west-2:xxxxxxxxxxxx:stream/cust-payment-txn-stream"
]
}
]
}
Configuring the Stream Producer
Before consuming streaming data in Amazon Redshift, a streaming data source must write data to the Kinesis data stream. This post utilizes a custom-built data generator alongside the AWS SDK for Python (Boto3) to publish data to the data stream. For setup instructions, refer to the Producer Simulator. This simulator process publishes the streaming data to the previously created data stream (cust-payment-txn-stream).
Configuring the Stream Consumer
Next, we will discuss configuring the stream consumer, which is the Amazon Redshift streaming ingestion view. Amazon Redshift Streaming Ingestion provides low-latency, high-speed ingestion of streaming data from Kinesis Data Streams into an Amazon Redshift materialized view.
In conclusion, this method allows for effective fraud detection in real time, enhancing operational efficiency and improving customer trust. For further insights into boosting self-confidence in professional settings, consider checking out this helpful resource here. Also, for expert insights on measuring EAP success, visit this authority. Lastly, if you’re interested in employee training and career skill development, this excellent resource can provide valuable guidance here.
Location:
6401 E HOWDY WELLS AVE LAS VEGAS NV 89115
Site Name: Amazon IXD – VGT2
Leave a Reply