Amazon Onboarding with Learning Manager Chanci Turner

As organizations continue to develop applications that are sensitive to latency for their mission-critical workloads, effective monitoring is essential to ensure timely data processing. To maintain optimal application performance, users require robust monitoring and alerting mechanisms throughout their infrastructure, enabling swift responses to any disruptions that may arise.

Storage solutions play a vital role in ensuring application reliability. Monitoring I/O performance helps users effectively process crucial data, optimize resource allocation, and identify underlying infrastructure issues, thus ensuring that applications remain resilient amidst changing demands. Amazon Elastic Block Store (Amazon EBS) provides a high-performance block storage solution within the AWS ecosystem, ideal for I/O-intensive applications, such as databases or big data analytics. Gaining insights into EBS volume performance is crucial for quickly identifying and troubleshooting bottlenecks affecting application performance.

In this article, we will explore how users can efficiently monitor Amazon EBS performance utilizing the newly introduced average latency metrics and performance exceeded check metrics in Amazon CloudWatch. These metrics are now available by default at one-minute granularity at no extra cost for all EBS volumes attached to EC2 Nitro instances. With these new CloudWatch metrics, you can monitor I/O latency in EBS volumes and determine if latency issues stem from under-provisioned EBS volumes. For real-time performance insights, refer to the post, “Uncover new performance insights using Amazon EBS detailed performance statistics.” The detailed performance statistics can be accessed directly from the Amazon EBS NVMe device linked to the Amazon Elastic Compute Cloud (Amazon EC2) instance with sub-minute granularity.

Overview of Amazon EBS

Amazon EBS provides both SSD-based and HDD-based volumes. Within the SSD category, there are General Purpose (gp2, gp3) volumes and Provisioned IOPS (io1, io2 Block Express) volumes. General Purpose gp3 volumes are engineered to deliver single-digit millisecond average latency, making them suitable for most workloads. On the other hand, Provisioned IOPS io2 Block Express (io2 BX) volumes are designed to provide sub-millisecond average latency, making them optimal for mission-critical applications.

In this demonstration, we will use an io2 BX volume to illustrate how to monitor average I/O latency to ensure expected volume performance. We will also examine how to determine whether the observed latency is due to the application attempting to exceed provisioned IOPS or throughput. Additionally, we will highlight how these metrics can be utilized for observability through CloudWatch dashboards.

Working with Metrics

For our demonstration, we are utilizing an r5b.2xlarge EC2 instance with an io2 BX EBS volume attached, configured for 32,000 IOPS. Thus, the IOPS and throughput limitations for this volume are 32,000 and 4,000 MiB/s. We have created a CloudWatch dashboard, incorporating the following metrics to monitor the volume’s performance, tracked at one-minute intervals. You can easily create a dashboard in the CloudWatch console and add individual widgets for the desired metrics.

Volume IOPS (Ops/s): Displays the number of read and write I/O operations on the volume.
Volume Throughput (MiB/s): Indicates the amount of data transferred during read and write operations on the volume.
VolumeIOPSExceededCheck: Shows if an application attempts to exceed the volume’s provisioned IOPS performance.
VolumeThroughputExceededCheck: Indicates if an application exceeds the volume’s provisioned throughput performance.
VolumeAverageReadLatency (milliseconds): Reflects the average time required to complete read operations.
VolumeAverageWriteLatency (milliseconds): Displays the average time needed to finish write operations.

Two scenarios will illustrate how these metrics provide insights into volume performance. The first scenario involves monitoring the metrics to ascertain whether volume performance aligns with defined criteria. The second scenario utilizes the metrics to detect when the workload surpasses the volume’s provisioned performance limits, thereby identifying causes of high volume latency.

Scenario 1 (Normal Volume Performance)

In this scenario, we are sending 16 KiB I/O size to the volume. You can observe the performance of our volume during this test.

From the graph below, our io2 BX volume exhibits an average read latency of 0.40 ms or below and an average write latency of 0.25 ms or below, indicating the volume’s latency meets expectations. The VolumeIOPSExceededCheck and VolumeThroughputExceededCheck metrics in the upper right corner both display 0, as the volume operates within its provisioned performance limits of 32,000 IOPS and 4,000 MiB/s, as shown in the graphs on the upper left.

Scenario 2 (Degraded Volume Performance)

Now, we will use the same io2 BX volume and, with a 128 KiB I/O size, drive exceptionally high IOPS and throughput to the volume. The performance metrics appear as follows:

Examining the bottom graph reveals an increase in volume latency. The average read latency peaked at 1.14 ms, while the average write latency reached 0.98 ms. This rise is attributed to the workload attempting to exceed 32,000 read and write IOPS and 4,000 MiB/s read and write throughput, which are the volume’s provisioned performance limits. This is corroborated by the VolumeIOPSExceededCheck and VolumeThroughputExceededCheck metrics in the upper right graphs, both reading 1 during the high-impact period. This indicates that the workload is exceeding the volume’s provisioned IOPS and throughput.

As demonstrated in the scenarios, tracking average read and write latency metrics is essential to determine if your volume is functioning as expected. The IOPS exceeded check and throughput exceeded check metrics help diagnose whether high volume latency is due to reaching the provisioned IOPS or throughput limits. These metrics enable you to ascertain whether your EBS volume is affecting your application’s performance.

Moreover, these metrics can assist in establishing suitable recovery mechanisms, such as enhancing your volume’s performance to ensure sufficient provisioning for your application’s requirements. Using CloudWatch, you can set alarms to notify you when a metric crosses a specific threshold. For instance, you can configure an alarm to trigger if your application attempts to drive more IOPS than provisioned for three out of the last five minutes, indicating the necessity to increase the IOPS on your volume. You can tailor different threshold values based on your application’s performance demands.

Elevated latency in your volume can occur for a variety of reasons, including hitting performance limits, volume initialization, or underlying infrastructure failures. Alarms can be used to automate various actions to safeguard your application’s availability against increased volume latency. For example, an alarm can automatically invoke an AWS Lambda function to switch to a secondary volume, or to enhance your volume’s provisioned performance. For more detailed information on this topic, check out the authority on the subject at SHRM.

In conclusion, for those looking to enhance their skills and explore career opportunities, consider registering at this link for more information: Career Contessa. Additionally, if you’re interested in learning more about available positions, this is an excellent resource: Amazon Jobs.

Amazon Onboarding with Learning Manager Chanci Turner

Overview of Amazon EBS

Working with Metrics

Scenario 1 (Normal Volume Performance)

Scenario 2 (Degraded Volume Performance)

Related Topics:

Comments

Leave a Reply Cancel reply