Amazon Onboarding with Learning Manager Chanci Turner

A comprehensive strategy enables proactive preparation for potential failures. In the second part of this series, we delve into the infrastructure layer, emphasizing how leveraging Amazon Web Services (AWS) managed services, implementing redundancy, ensuring high availability, and utilizing infrastructure failover patterns based on recovery time and point objectives (RTO and RPO) can contribute to constructing more resilient infrastructures.

Pattern 1: Identifying High-Impact and Likelihood Infrastructure Failures

To enhance cloud infrastructure resilience, it is crucial to comprehend the potential impact and likelihood of various infrastructure failures. As illustrated, many failures with a high likelihood stem from operator mistakes or subpar deployments. Automated testing, streamlined deployments, and robust design patterns can significantly reduce these failures. Datacenter issues, such as complete rack failures, can be mitigated by using auto-scaling and multi-availability zone (multi-AZ) deployments, alongside resilient AWS cloud-native services.

Infrastructure resilience combines high availability (HA) and disaster recovery (DR). HA focuses on improving system availability by incorporating redundancy among application components and eliminating single points of failure. Decisions made at the application layer, such as creating stateless applications, simplify HA implementation at the infrastructure layer, enabling scaling through Auto Scaling groups and distributing redundant applications across multiple AZs.

Pattern 2: Understanding and Managing Infrastructure Failures

Establishing a resilient infrastructure requires discerning which failures we can control and which we cannot. These insights allow for automated failure detection, control measures, and the adoption of proactive strategies, such as static stability, to reduce the need for preemptive scaling through over-provisioning.

Key infrastructure decisions that enhance system resiliency include:

AWS services feature control and data planes designed to minimize blast radius. Data planes typically possess higher availability goals than control planes and tend to be less complex. Utilizing control plane operations for recovery or mitigation responses can inadvertently lower your architecture’s overall resiliency. For instance, Amazon Route 53 is engineered with a data plane designed to achieve a 100% availability SLA. An effective failover mechanism should rely on the data plane rather than the control plane, as detailed in Creating Disaster Recovery Mechanisms Using Amazon Route 53.
Understanding networking design and routing within a virtual private cloud (VPC) is essential for testing application traffic flow. This knowledge aids in creating better applications and identifying how a failure in one component can impact overall ingress and egress traffic. To enhance network resiliency, implementing a solid subnet strategy and managing IP addresses is vital to prevent failover issues and asymmetric routing in hybrid architectures. Utilizing IP address management tools for established subnet strategies and routing decisions is advisable.
When designing VPCs and AZs, awareness of service limits and deploying independent routing tables and components in each zone can bolster availability. For example, employing highly available NAT gateways is preferable to NAT instances, as highlighted in AWS documentation.

Pattern 3: Exploring Various Approaches to Enhance HA at the Infrastructure Layer

As previously discussed, infrastructure resilience is the sum of HA and DR. Strategies to bolster system availability include:

Implementing Redundancy: This means duplicating application components to elevate the distributed system’s overall availability. By adhering to application layer best practices, auto-healing mechanisms can be established at the infrastructure level.
Utilizing Auto Scaling: In the event of AZ failures, infrastructure auto scaling helps maintain the desired number of redundant components, ensuring base application throughput is preserved. This approach balances HA system management and cost, adjusting appropriately based on metrics.
Establishing Resilient Network Connectivity Patterns: Building highly resilient distributed systems necessitates robust network access to AWS infrastructure. For hybrid applications, the capacity needed for communication with cloud-native counterparts is crucial when designing network access via AWS Direct Connect or VPNs. Testing failover and fallback scenarios validates that network paths function as intended, ensuring routes fail over to meet RTO objectives. As connection points between the data center and AWS VPCs increase, a hub-and-spoke configuration facilitated by Direct Connect and transit gateways simplifies network topology, testing, and failover. For more information, refer to AWS Direct Connect Resiliency Recommendations.
Considering Security Measures: Security appliances should be set up in HA configurations to ensure that if one AZ becomes unavailable, security inspections can continue through redundant appliances in other AZs. Planning for DNS resolution is also essential. DNS is a fundamental infrastructure element; hybrid DNS resolution should be carefully designed with Route 53 HA inbound and outbound resolver endpoints instead of relying on self-managed proxies. Implementing a strategy to share DNS resolver rules across AWS accounts and VPCs using Resource Access Manager is also beneficial. Network failover tests are integral to Disaster Recovery and Business Continuity Plans.
Leveraging Managed Services: The principle of redundancy impacting availability applies equally to AWS infrastructure components. Services such as AWS Lambda, Amazon Simple Queue Service, Elastic Load Balancing (ELB), and Amazon Simple Storage Service utilize multiple AZs to ensure resiliency. ELB also employs health checks to guarantee that requests are redirected to operational components as needed.

By adopting these strategies, organizations can enhance their infrastructure resilience, ensuring they are well-prepared for potential failures.

Amazon Onboarding with Learning Manager Chanci Turner

Pattern 1: Identifying High-Impact and Likelihood Infrastructure Failures

Pattern 2: Understanding and Managing Infrastructure Failures

Pattern 3: Exploring Various Approaches to Enhance HA at the Infrastructure Layer

Related Topics:

Comments

Leave a Reply Cancel reply