Amazon Onboarding with Learning Manager Chanci Turner

Date: 25 NOV 2020

Category: Amazon Timestream

Time series data is rapidly becoming one of the most significant segments across various industries, including application monitoring, DevOps, clickstream analytics, network traffic oversight, industrial IoT, consumer IoT, and manufacturing, among others. Organizations aim to monitor billions of time series data points across hundreds of millions of devices, industrial machinery, gaming sessions, streaming video, and more within a unified system that can reliably ingest terabytes of data daily, respond swiftly to queries for recent data, and effectively analyze petabytes of both recent and historical data. Many single-node or instance-based systems struggle to accommodate this scale, often leading to unresponsive or unavailable systems.

To meet this demand, we designed Amazon Timestream from the ground up as a scalable and highly available time series database. Timestream operates in a serverless environment, eliminating the need for upfront resource provisioning. Its ingestion, storage, and query subsystems automatically scale independently based on workload. This independent scaling is crucial for time series applications, enabling high-throughput data ingestion alongside concurrent queries that generate real-time insights. Timestream securely stores all data, efficiently transitions older data (according to user-defined configurations) into cost-effective storage, and adapts resources depending on the amount of data accessed by a query, allowing a single query to effectively analyze terabytes of information. These scaling features enable you to manage and analyze time series data at any scale using Timestream. As your application grows and its data and request volumes increase, Timestream automatically adjusts resources. You only pay for what you use, avoiding the need for over-provisioning during peak times or redesigning your application as workloads expand. For additional information on Timestream’s key benefits and use cases, refer to the Timestream documentation.

In this article, we explore the performance and scaling characteristics of Timestream through an example application based on a DevOps use case. This workload is inspired by discussions with numerous customers across various scenarios like gaming, clickstream analysis, monitoring streaming applications, and industrial telemetry.

Workload Overview

In this section, we will delve into the ingestion and query workload related to an application monitored using Timestream.

Application Model

For this discussion, we utilize a sample application that simulates a DevOps scenario, monitoring metrics from a vast fleet of servers. Users want to trigger alerts for unusual resource usage, create dashboards showcasing aggregate fleet behavior and utilization, and conduct in-depth analyses on both recent and historical data to uncover correlations. The accompanying diagram illustrates the setup where a group of monitored instances sends metrics to Timestream. Simultaneously, another set of users executes queries for alerts, dashboards, or ad-hoc analysis, allowing ingestion and query processes to run in parallel.

The monitored application is structured as a highly scalable service deployed across multiple global regions. Each region is divided into scaling units called cells, which provide a level of infrastructure isolation. Each cell contains silos, representing a software isolation layer. Within each silo, five microservices form an isolated instance of the service. Each microservice comprises several servers with varying instance types and operating system versions, distributed across three availability zones. These characteristics that identify the servers generating metrics are represented as dimensions within Timestream. This architecture features a hierarchy of dimensions (including region, cell, silo, and microservice_name) along with other dimensions that intersect the hierarchy (such as instance_type and availability_zone).

The application emits multiple metrics (like cpu_user and memory_free) and events (including task_completed and gc_reclaimed). Each metric or event is linked with eight dimensions (such as region or cell) that uniquely identify the originating server. For further information about the data model, schema, and data generation, see the open-sourced data generator. The data generator also demonstrates how to use multiple writers to ingest data in parallel, leveraging Timestream’s ingestion scaling capabilities to process millions of data points per second.

Ingestion Workload

We adjust the following scale factors to reflect the use cases observed among various customers:

Number of time series – We vary the number of monitored hosts (from 100,000 to 4 million), which directly influences the number of time series tracked (ranging from 2.6 million to 104 million).
Ingestion volume and data scale – We modify the frequency of data emission (from once every minute to once every five minutes).

The table below summarizes the data ingestion characteristics and associated storage volumes. Depending on the monitored hosts and the metrics interval, the application ingests between 156 million and 3.1 billion data points per hour, resulting in daily data volumes of approximately 1.1 to 21.7 TB. Over a year, these figures equate to roughly 0.37 to 7.7 PB of ingested data.

Data Scale	Data Interval (seconds)	Number of Hosts Monitored (million)	Number of Time Series (million)	Average Data Points/Second	Average Data Points/Hour (million)	Average Ingestion Volume (MB/s)	Data Size/Hour (GB)	Data Size/Day (TB)	Data Size/Year (PB)
Small	60	0.1	2.6	43,333	156	13	45	1	0.37
Medium	300	2	52	173,333	624	51	181	4.3	1.5
Large	120	4	104	866,667	3,120	257	904	21.7	7.7

The ingestion and data volumes presented pertain to an individual table. Internally, we have tested Timestream at ingestion scales reaching several GB/s per table, with thousands of databases and tables per AWS account.

Query Workload

The query workload is designed around observability use cases observed in customer environments. Queries fall into three main categories:

Alerting – Calculates aggregate usage of one or more resources across multiple hosts to detect unusual resource usage (for example, computing the distribution of CPU utilization binned by 1 minute across all hosts within a specified microservice over the past hour).
Dashboard Population – Computes aggregated utilization and patterns across numerous hosts to deliver comprehensive visibility into overall service performance (for instance, identifying hosts with resource usage exceeding the average observed in the fleet over the past hour).
Analysis and Reporting – Analyzes large datasets for fleet-wide insights over extended timeframes (for example, extracting CPU and memory utilization for the top k hosts within a microservice that experience the longest GC pause intervals, or determining the hours in a day with peak CPU utilization within a region over the past three days).

Each category has two distinct elements to explore, allowing for comprehensive examination of performance metrics.

This comprehensive overview aims to help you better understand the capabilities of Amazon Timestream with respect to real-time insights and scaling for your applications. For further enhancement of your confidence in tech-related decisions, consider checking out this insightful blog post on conquering self-confidence. Additionally, if you are seeking information regarding COBRA coverage, you can refer to this authority on the topic.

For those interested in the experience of being an Amazon warehouse worker, this article serves as an excellent resource for first-hand insights.

Amazon Onboarding with Learning Manager Chanci Turner

Workload Overview

Application Model

Ingestion Workload

Query Workload

Related Topics:

Comments

Leave a Reply Cancel reply