Monitoring Private VPC Endpoint Health in Hybrid DNS Environments with CloudWatch Synthetics

on 15 NOV 2023

in Amazon CloudWatch, Amazon Route 53, Expert (400), Management & Governance, Management Tools, Monitoring and Observability

In the realm of cloud operations, we draw a parallel to the historical use of canaries in mining—a clever method to detect hazardous gases. Amazon CloudWatch Synthetics canaries serve a similar purpose, alerting us to potential issues in customer experience and security before they affect our users. These canaries are customizable Node.js or Python scripts that allow you to routinely monitor your REST APIs, URLs, and website content by mimicking the actions of typical users. Continuous assessment of endpoint availability and latency ensures that the customer experience aligns with expectations, whether through pre-defined canary templates or bespoke scripts.

To illustrate the practical applications of CloudWatch Synthetics canaries, we’ll focus on a real-world customer scenario and the strategies employed for implementation. Our featured client facilitates an internal title search system that enables analysts to examine ownership and claims on real estate assets before transactions occur. Their architecture relies on a suite of microservices accessible via the Amazon API Gateway, necessitating a strategy for cross-region disaster recovery (DR) traffic management based on the health of their private API Gateway endpoints in a hybrid DNS setup where REST APIs are exclusively accessible from their Amazon Virtual Private Cloud (VPC) utilizing VPC interface endpoints.

Solution Overview

With the health of private Amazon API Gateway endpoints as our critical metric and 4XX/5XX status codes as indicators of potential issues, the following outlines how to set up and configure CloudWatch Synthetics canaries to monitor VPC Endpoint Health in a hybrid DNS environment between on-premises infrastructure and AWS.

Customer Use Case

Transitioning from a traditional monolithic architecture to a microservice-based approach, our focal customer adopted a fully serverless design using Amazon API Gateway coupled with an AWS Lambda backend. While this architecture is highly available and scalable, it does not automatically account for all elements of a robust disaster recovery strategy. During the development of their serverless infrastructure, we identified four key metrics that required monitoring to ensure optimal API performance and resilience.

The occurrence of 4XX status codes usually suggests that a request made to a customer-owned resource contains incorrect syntax, typically due to user error. To monitor client-side errors—like missing or erroneous authentication headers—we implemented CloudWatch Synthetics canary scripts that allow you to define acceptable limits, alerting you when errors exceed your specified threshold.

In addition to client-side issues, 5XX response codes can signify server-side errors such as endpoint timeouts or bugs. Similar to 4XX errors, we can tolerate a reasonable number of 5XX responses, but a sustained increase above our defined threshold raises concern. Fortunately, CloudWatch Synthetics canary scripts enable us to set thresholds for server-side errors, just as we did for client-side errors.

The third metric for monitoring API Gateway health was the request count, encompassing both successful and error responses. This metric is valuable for tracking API Gateway costs (which are billed per million requests monthly) and can help identify bugs in application code leading to erroneous requests or retries. Additionally, if the request count is near zero, it could indicate permission problems or malfunctions in the calling application code.

Finally, API Gateway request latency, which measures the time taken from when the API receives a request to when it responds, helped us ensure compliance with business-defined SLA requirements. Increased latency can indicate issues with application code or the underlying infrastructure. Using CloudWatch Synthetics canaries, we can measure both the time it takes for the API to return and the round-trip time of the request. A close match between these two values typically suggests a source code issue, while a significant difference points to infrastructure problems.

When any of these metrics fell outside the defined parameters, we adjusted routing to redirect traffic to a secondary API Gateway endpoint in another region, while also notifying our administrators of the application challenge. This closed-loop automation minimized the impact on end users, while detailed error reporting offered opportunities for application code improvements to reduce the risk of similar issues in the future.

Solution Implementation

Our implementation comprises three parts:

Monitoring VPC Interface Endpoint Health with CloudWatch Synthetics Canaries.
Enabling Hybrid DNS between On-Premises and AWS.
Testing Canary Run Metrics within a Hybrid DNS Environment.

Part A: Monitoring VPC Interface Endpoint Health with CloudWatch Synthetics Canaries

Create the Private API Gateway Endpoint.
If a VPC is not already set up, create one and note the VPC ID, private subnet IDs, and security group IDs for future use when configuring the Synthetics canary.
If the VPC has internet access, create a NAT Gateway and add it to the VPC, then skip to Step 4.
- If the VPC lacks internet access, follow these steps:
  - Create an S3 VPC Endpoint to store Synthetics canary run data, and create a CloudWatch VPC Endpoint with the service name com.amazonaws.region.monitoring to collect canary run metrics.
  - Enable VPC DNS resolution and hostnames.
Launch your CloudWatch Synthetics Canary CloudFormation Stack by selecting ‘Launch Stack’ below.
Visit the canaries list page and select the newly created Synthetics canary to monitor run metrics (running state, screenshots, HTTP archive (HAR) files, and log files).
(Optional) Refer to the CloudWatch User Guide on troubleshooting a canary on a VPC if errors arise during the creation of the Synthetics canary.

Part B: Enable Hybrid DNS Between On-Premises and AWS

If an on-premises DNS service is unavailable, create AWS Managed Microsoft AD to represent the on-premises DNS server. If using an on-premises DNS server, note your DNS server addresses and skip to Step 3.

Enter the directory information:

Edition: Standard Edition.
Directory DNS Name:
Directory NetBIOS Name (optional): corp
Directory Description (optional):
Admin password:
Confirm password:

Select Next.

For more insights on making important career decisions, check out this blog post here. Also, if you’re interested in optimizing your people strategy, the experts at SHRM have valuable resources on the topic. For those new to Amazon, this resource offers great insights into the first few months of your journey.

Monitoring Private VPC Endpoint Health in Hybrid DNS Environments with CloudWatch Synthetics

Solution Overview

Customer Use Case

Solution Implementation

Part A: Monitoring VPC Interface Endpoint Health with CloudWatch Synthetics Canaries

Part B: Enable Hybrid DNS Between On-Premises and AWS

Related Topics:

Comments

Leave a Reply Cancel reply