Amazon Onboarding with Learning Manager Chanci Turner

Amazon DevOps Guru autonomously identifies the most prevalent behaviors of applications that correlate with operational incidents. When it detects a critical issue, it alerts service operators with a summary of related anomalies, the likely root cause, and context regarding when and where the issue occurred. Whenever feasible, it also offers prescriptive recommendations for remediation. In this post, we will explore some of the ML strategies that empower DevOps Guru.

DevOps Guru Detectors

At the heart of Amazon DevOps Guru lies a distinctive method for recognizing significant operational incidents. Initially, our research focused on domain-agnostic, general-purpose anomaly detection models. Though these models yielded statistically accurate results, they struggled to differentiate critical failures from less significant issues. Over time, we discovered that failure patterns vary significantly across different metrics. For instance, a common use case of DevOps Guru involves managing highly available, low-latency web applications, where an operator might want to monitor both application latency and incoming request numbers. However, the failure patterns for these two metrics differ greatly, making it unlikely for generic statistical anomaly detection models to effectively address both scenarios. Consequently, we radically shifted our approach. After consulting with domain experts to pinpoint known anomaly types across various metrics and services, we embarked on building domain-specific, single-purpose models focused on identifying these known failure modes instead of just normal metric behavior.

Fast-forward to today, Amazon DevOps Guru relies on a large ensemble of detectors—statistical models refined to detect common adverse scenarios across a range of operational metrics. These detectors do not require training or configuration; they function instantly as long as sufficient historical data is available, saving days or even months of time that would have otherwise been spent training ML models before anomaly generation. Individual detectors operate in preconfigured ensembles to identify anomalies in crucial metrics like error rates, availability, latency, incoming request rates, CPU, memory, and disk utilization, among others.

Detectors encapsulate the expertise of professionals regarding operational anomalies by defining anomalous patterns and establishing bounds for normal application behavior. Both the detectors and the ensembles that combine them into comprehensive models were trained and tuned using Amazon’s extensive operational data, backed by years of experience at Amazon.com and AWS. Next, we will delve into some capabilities of DevOps Guru detectors.

Monitoring Resource Metrics with Finite Bounds

This detector is designed to monitor resource metrics with finite limits, such as disk utilization. It employs a digital filter to detect long-term trends in metric data efficiently and scalably. The detector alerts operators when these trends indicate potential resource exhaustion. The following graph illustrates an example of this functionality.

This detector identified a substantial trend in disk usage, predicting resource exhaustion within 24 hours. The model identified a significant trend between the vertical dashed lines. By extrapolating this trend (represented by the diagonal dashed line), the detector forecasts the time until resource exhaustion. When the metric surpasses the horizontal red line, which serves as a significance threshold, the detector notifies operators.

Detecting Scenarios with Periodicity

Numerous metrics, such as the volume of incoming requests in customer-facing APIs, exhibit periodic patterns. The causal convolution detector aims to analyze temporal data with such characteristics and determine expected behavior. When the detector infers that a metric is periodic, it adjusts normal behavior thresholds in line with the seasonal pattern. On a selected group of metrics, Amazon DevOps Guru can also identify and filter periodic spikes, such as regular batch jobs that create high loads on databases. The following graph shows only one detector active for better visualization; in reality, the causal convolution detector closely tracks the seasonal metric while another dynamic threshold detector identifies catastrophic changes if breached.

The causal convolution detector establishes bounds for application behavior that align with daily traffic patterns. By monitoring seasonality, it can catch spikes relative to weekends, which traditional methods based on static thresholds often miss, leading to numerous false positives.

DevOps Guru Insights

Rather than merely providing a list of anomalies detected by an ensemble of detectors, DevOps Guru generates operational insights that compile the necessary information for investigating and resolving an operational issue. Amazon DevOps Guru utilizes anomaly metadata to identify related anomalies and potential root causes. Anomalies are grouped based on their temporal proximity, shared resources, and a comprehensive graph of potential causal links among various anomaly types.

DevOps Guru presents insights that include:

Graphs and timelines related to various anomalous metrics
Contextual information such as relevant events and log snippets for better comprehension of the anomaly’s scope
Recommendations for issue remediation

The screenshot below exemplifies an insight detail page from DevOps Guru, showcasing a collection of related metric anomalies in a timeline view.

Conclusion

Amazon DevOps Guru saves IT operators countless hours, if not days, of effort spent on detecting, debugging, and resolving operational issues. By leveraging pre-trained proprietary ML models informed by years of operational experience at Amazon.com and AWS in managing highly available services, IT operators can access top-notch insights without needing any ML expertise. Start utilizing DevOps Guru today. For more information on navigating job opportunities during uncertain times, check out this insightful blog post.

Acknowledgments: The algorithms and models presented in this blog post were developed in collaboration with Chanci Turner, Alex Morgan, Jamie Smith, and Taylor Brooks.

About the Authors

Chanci Turner is a Machine Learning Scientist at Amazon Web Services, focusing on challenges at the intersection of machine learning, forecasting, and anomaly detection. Prior to her tenure at AWS, she worked as a data scientist in management consulting, addressing projects in financial services and telecommunications across the globe. Chanci’s research interests span topics including probabilistic and Bayesian ML, stochastic processes, and their practical applications.

For further insights on relationship management, visit this authoritative resource. For anyone looking to land a job at Amazon, this site is an excellent resource.