Data Preparation for Machine Learning with Amazon Timestream

Data Preparation for Machine Learning with Amazon TimestreamMore Info

The concept of precognition, or the ability to foresee future events, has long intrigued humanity. While we may not have reached that level yet, time series forecasting offers a compelling alternative. The human brain is adept at anticipating future outcomes by reflecting on past experiences; however, it often struggles with the vast volumes of data produced by modern enterprises. Imagine leveraging a machine to record sequences of past events from countless sources, analyze that information, and generate predictions for your business.

Consider a software as a service (SaaS) company servicing thousands of clients across diverse sectors such as e-commerce, energy, and aviation. This SaaS provider utilizes Amazon Elastic Compute Cloud (Amazon EC2) instances distributed across multiple Regions and Availability Zones to ensure low-latency, highly available services. Both the provider and its clients demand in-depth visibility into their ever-evolving operational activities. By aggregating operational metrics over various timeframes, businesses can uncover trends and patterns that inform adjustments to their infrastructure. This adaptability allows for forecasting the operational dynamics necessary for optimal scaling and cost efficiency, thereby transforming collected data into actionable insights.

To efficiently store, transform, and analyze this operational data, the SaaS provider requires a database that can:

  • Handle trillions of data points daily in a cost-effective manner
  • Facilitate data preparation at a petabyte scale for data science and machine learning (ML)

While it’s technically feasible to store time series data in traditional databases, scalability and usability are paramount. For a SaaS provider with a large customer base and numerous devices generating time-based metrics, database performance and growth can become significant challenges. Amazon Timestream is specifically designed as a time series database that enables the collection and storage of millions of operational metrics per second, allowing for real-time data analysis that enhances application performance and availability.

In this article, we delve into utilizing Timestream within the framework of the Cross Industry Standard Process for Data Mining (CRISP-DM), focusing on the Data Understanding and Data Preparation stages—critical prerequisites for developing accurate ML models. CRISP-DM outlines key tasks in these phases, including collecting, describing, exploring, and verifying data in the Data Understanding phase, as well as selecting, cleaning, constructing, integrating, and formatting data in the Data Preparation phase.

The powerful, purpose-built query engine of Timestream, equipped with specialized time series functions, streamlines the exploration and verification of temporal data. The Timestream SQL query engine supports various data preparation operations, eliminating the need for batch data processing.

Benefits of Timestream Query Engine

Here are several benefits of employing the Timestream query engine compared to traditional batch processing methods:

  • Performance: Performing data preparation through queries typically demands less computational effort on the ML side, as the compute resources are closer to storage. This is in stark contrast to multi-step extract, transform, and load (ETL) processes. The serverless architecture of Timestream allows for fully decoupled data ingestion, storage, and query processing systems that can scale independently.
  • Scalability and Elasticity: Scaling data preparation to handle millions of time series over extended periods in a conventional setup necessitates self-managed horizontal and vertical scaling of infrastructure, often requiring extensive in-memory database storage to meet performance demands. Timestream alleviates this burden by being serverless and scaling on demand to accommodate the needs of data scientists.
  • Real-time Application and Query Sharing: Query-based data preparation utilizes the most recent data, while the results from lengthy ETL jobs may become outdated. Additionally, maintaining preprocessing scripts across different technology stacks or programming languages can be cumbersome. A Timestream query is reusable across all supported APIs and SDKs, including Java, Go, Python, Node.js, and .NET. Moreover, queries can be shared across phases and tools for visual data understanding in platforms like Amazon QuickSight and Grafana.

DevOps Data Generator

For this discussion, we enhanced the Timestream continuous ingestor tool from GitHub by incorporating a signal generator. This generator mixes a random cpu_user signal with periodic sinusoidal or sawtooth signals, allowing us to define the signal-to-noise ratio. For more information on the ingestion tool, check the README file of the ingestor tool. The tool generates 20 different metrics from the DevOps domain, but we will focus on cpu_user (CPU utilization ranging from 0–100) and network_bytes_in/out (network traffic ranging from 0 to 5 KB), with their time series sharing identical dimensions (Region, cell, silo, Availability Zone, microservice name, and instance name).

To create and visualize the simulated data, execute the following command for several minutes to generate and ingest sample data into Timestream:

python3 timestream_sample_continuous_data_ingestor_application.py 
--database-name <db_name> 
--table-name <table_name> 
--region <timestream region e.g. 'us-east-1'> 
--concurrency 1 
--include-region 'eu-central-1' 
--include-ms 'apollo' 1 
--saw-signal-cpu 100 
--saw-frq-cpu 'm'

The following figure was generated using Amazon SageMaker Studio, as outlined in the documentation on Timestream and SageMaker integrations. Below is the Python code used to query Timestream and display the results:

import timestreamquery as timestream
import boto3
import matplotlib.pyplot as plt
from IPython import display

# Timestream Configurations
ENDPOINT = "eu-west-1" 
PROFILE = "default" 
DB_NAME = "<db_name>" 
TABLE_NAME = "<table_name>" 
client = timestream.createQueryClient(ENDPOINT, profile=PROFILE)

# Timestream RAW Series Query
rawseries = """ 
rawseries AS (
  SELECT time, 
         measure_value::double {2}
    FROM {0}.{1}
   WHERE 1=1
     AND measure_name = '{2}'
     AND microservice_name = 'apollo'
     AND instance_name = 'i-AUa00Zt2-apollo-0000.amazonaws.com'
     ORDER BY time LIMIT 90)"""

query = """WITH {} SELECT * FROM rawseries""" 
            .format(rawseries).format(DB_NAME, TABLE_NAME, "cpu_user")
numCPUA = timestream.executeQueryAndReturnAsDataframe(client, query, True)

query = """WITH {} SELECT * FROM rawseries""" 
            .format(rawseries).format(DB_NAME, TABLE_NAME, "network_bytes_out")
numNETB = timestream.executeQueryAndReturnAsDataframe(client, query, True)

# Visualizing and Plotting
plt.rcParams['figure.figsize'] = [15, 10]
fig, ax = plt.subplots(2)

ax[0].title.set_text('CPU User (+) as Recorded')
ax[0].plot(numCPUA['time'], numCPUA['cpu_user'], color='darkorange', 
           marker='+', markersize=12, mew=3, linewidth=0.5, alpha=0.8)
ax[0].grid(which ='both', axis ='both', linestyle ='--')

ax[1].title.set_text('Bytes Out (o) and Bytes In (x) as Recorded')
ax[1].plot(numNETB['time'], numNETB['network_bytes_out'], color='black', 
           marker='o', markersize=10, mew=4, linewidth=0.5, alpha=0.8)
ax[1].grid(which ='both', axis ='both', linestyle ='--')

For further insights on this subject, consider reading another blog post here, which provides additional context. For expert advice on data preparation, visit this authority on the topic. Moreover, for those interested in growth opportunities, check out this excellent resource.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *