Optimizing Alarm Lifecycle with Amazon CloudWatch Metrics Insights Alarms

Do you manage a fleet of dynamically changing resources that you find difficult to monitor efficiently? Are you burdened with numerous redundant alarms that clutter your view and incur unnecessary costs? If you’re seeking a streamlined method to create alarms that automatically adapt to your fluctuating resources, this post is for you.

In this article, we will present a recommended, cost-effective strategy using Amazon CloudWatch to mitigate the risks associated with maintaining alarms for outdated AWS resources, while also ensuring that new resources are effectively monitored. This method minimizes the likelihood of having alarms linked to obsolete metrics or low-value alerts for which you would otherwise be paying, as well as reducing visual clutter in your CloudWatch dashboard. Alarms configured with Metrics Insights queries have lower operational overhead and costs due to their simplicity and singular definition. They automatically adjust to the AWS resources as they are added or removed, significantly decreasing the chance of dangling alarms.

Our previous blog post offers an automation solution to identify and delete low-value alarms. In this discussion, we will delve into setting up dynamic alarms that provide consistent monitoring of fast-evolving environments and alert you to anomalies as they arise.

Amazon CloudWatch Metrics Insights alarms empower users to monitor entire fleets of dynamically changing resources with a single alarm using standard SQL queries. This capability allows for rapid, flexible queries via CloudWatch Metrics Insights. By integrating CloudWatch alarms with Metrics Insights queries, you can establish dynamic alarms that provide ongoing monitoring of fast-moving environments, alerting you when anomalies surface.

Common Customer Use Cases

We will explore two prevalent scenarios where your alarms must swiftly adapt to resource changes, making manual maintenance a challenge. Both cases illustrate how alarming on Metrics Insights queries can address these issues effectively.

Use Case 1: Monitoring DynamoDB Throttling

Consider a typical situation where you need to track read throttling events across all DynamoDB tables in your account. This can occur when your DynamoDB tables receive a higher volume of read requests than what has been provisioned, potentially rendering your application unresponsive or blocking new users and transactions.

A common method for implementing this monitoring involves aggregating the individual ‘ReadThrottleEvents’ metrics through a metric math expression and setting an alarm based on that result. However, if a new DynamoDB table is added, the math expression does not automatically update, leaving a blind spot for new tables and risking missed errors. This requires manual intervention to ensure the math expression reflects the new resources. Additionally, if you need to aggregate more metrics than a single metric math expression permits, you may find yourself needing to create multiple alarms instead of just one.

With Metrics Insights alarms, you can utilize Metric Insights queries to monitor multiple resources without concern for whether new resources are added or existing ones are removed. In the aforementioned example, when a new DynamoDB table is created, the Metrics Insights alarm dynamically adjusts, eliminating the need for user intervention.

Use Case 2: Responding to 5XX Errors in ECS Clusters

Now, let’s examine another scenario where you want to be alerted if any ECS cluster in your account generates an HTTP 5XX response code. Typically, you would first create a metric math expression that sums the individual ‘HTTPCode_Target_5XX_Count’ metrics for each ECS cluster, then set an alarm based on the outcome of that math expression.

Similar to the previous case, if a new ECS cluster is added, the math expression fails to update automatically, resulting in potential blind spots for new instances. This again necessitates manual updates to the math expression. With Metrics Insights alarms, however, you can set alarms using Metric Insights queries that track multiple resources without the worry of new resources being spun up or old ones being deleted. In our example, when a new ECS cluster is added, the Metrics Insights alarm adapts automatically and alerts you when the threshold is breached without any manual intervention.

Solution Overview

This solution creates Metrics Insights alarms for the use cases discussed above. It provisions a Metrics Insights alarm named ‘DDBReadThrottleAlarm’ to monitor and alert on the ‘ReadThrottleEvents’ metric, as well as ‘ECSTarget5XXAlarm’ for the ‘HTTPCode_Target_5XX_Count’ metric. You can configure the threshold values when launching the AWS CloudFormation template. This solution also provisions an SNS topic to send notifications in case an alarm is triggered; you can enter your email address during the launch process. Additionally, this solution can be adapted to other AWS services or metrics relevant to your specific needs.

Deploying the Solution

You can deploy this solution and its associated resources into your AWS account using an AWS CloudFormation template.

Prerequisites

To follow this walkthrough, you should have the following:

An AWS account
Existing Amazon DynamoDB tables and Amazon ECS clusters

What Will the CloudFormation Template Deploy?

The CloudFormation template will deploy the following resources into your AWS account:

Amazon CloudWatch Metrics Insights alarms
- DDBReadThrottleAlarm – Monitors the ReadThrottleEvents metric and alerts when read throttled events occur in any DynamoDB table in the account.
- ECSTarget5XXAlarm – Monitors the HTTPCode_Target_5XX_Count metric and alerts when any ECS cluster generates an HTTP 5XX response code. This template can be modified to use any metric of your choice.
Amazon SNS Topic
- AlarmNotificationTopic – Sends an email notification when an alarm is triggered.

How to Deploy the CloudFormation Template

Download the YAML file.
Navigate to the CloudFormation console in your AWS Account.
Choose “Create stack.”
Choose “Template is ready,” upload a template file, and select the YAML file you just downloaded.
Choose “Next.”
Provide a name for the stack (maximum length 30 characters).
For the parameter ‘EmailToNotifyForAlarms,’ enter the email address for receiving alarm notifications, and for parameters ‘DDBReadThrottleThresh…

For additional information on enhancing your career in this field, check out this insightful blog post. If you’re looking to further your understanding of creativity in HR, you can read more from authoritative sources like SHRM. For a visual guide, this video is an excellent resource.