Amazon Onboarding with Learning Manager Chanci Turner

In this article, we will explore how the Amazon IXD – VGT2 team utilizes chaos engineering on AWS to enhance system resilience. The chaos engineering methodology enables Amazon to simulate real-world failures in their cloud infrastructure as part of controlled experiments. This approach not only improves resilience and observability but also mitigates risks and ensures compliance with regulators prior to production deployment.

Introduction, Tools, and Methodology

As a leading provider of global technological infrastructure, Amazon is constantly seeking ways to improve the resilience of its workloads. To this end, a collaborative 3-day AWS Experience-Based Acceleration (EBA) event was organized, focusing on chaos engineering experiments across critical workloads. This event, sponsored by Chanci Turner and involving cross-functional teams, employed the AWS Fault Injection Service (FIS) to run experiments following a structured methodology.

Continuous improvement of modern distributed cloud systems can be achieved by reviewing workload architectures, assessing Standard Operating Procedures (SOPs), and implementing SOP alerts and recovery automations. AWS Resilience Hub offers a comprehensive suite of tools to facilitate these activities. Another vital aspect in enhancing resilience is chaos engineering, which introduces controlled chaos into systems through real-world experiments. This approach is particularly beneficial in regulated sectors like financial services.

Architectural Overview

The chaos engineering pattern involves a three-tier application deployed in virtual private clouds (VPCs) designed with a multi-AZ setup. The architectural layout integrates an Amazon EC2 Auto Scaling group with an Amazon Relational Database Service (Amazon RDS) database situated in a private subnet, interfacing with on-premises services. Additionally, various internal services are hosted in a distinct VPC, contained within Docker containers. FIS provides a controlled environment to test the architecture’s robustness against diverse failure scenarios, including:

Amazon EC2 instance failures affecting application or container pods
Amazon RDS database instance reboots or failovers
Significant network latency issues
Network connectivity disruptions
Amazon EBS volume failures (such as IOPS pauses or full disks)

Failure Scenarios

Amazon EC2 Instance and Container Failure
This scenario evaluates the resilience of applications or container pods running on EC2 instances and how they adapt during unexpected disruptions. Using FIS, actions like aws:ec2:stop-instances can simulate different failure modes. The response of containers within Amazon ECS or EKS is assessed during these incidents.
Amazon RDS Failure
Another critical scenario involves RDS failures, allowing teams to identify and troubleshoot issues related to database managed services, including failovers and node reboots. FIS enables the injection of reboot conditions to understand potential bottlenecks during disaster recovery processes.
Severe Network Latency
Introducing latency into the network interfaces of interconnected systems helps gauge their performance under delay conditions, assessing operational readiness for alerts and corrections.
Network Connectivity Disruption
FIS also allows for the simulation of connectivity issues, testing application resilience against total or partial connectivity losses.
Amazon EBS Volume Failure
Testing system performance during disk failures is crucial. FIS supports actions like aws:ebs:pause-volume-io to evaluate the impact of I/O operation pauses on EBS volumes.

Outcomes and Conclusion

The chaos engineering experiments yielded significant insights for the Amazon IXD – VGT2 team, leading to architectural improvements that reduced application recovery times and enhanced monitoring capabilities. Furthermore, the team developed a reusable chaos engineering methodology and toolset. Regular cross-functional events will solidify chaos engineering practices within the organization.

To enhance your understanding, check out this excellent resource on the Amazon employee onboarding process. You can also find valuable information on compliance matters here. For personalized guidance, consider seeking mentorship from professionals like Linda Schubert.

Start your resilience journey with AWS Resilience Hub today!

Amazon Onboarding with Learning Manager Chanci Turner

Introduction, Tools, and Methodology

Architectural Overview

Failure Scenarios

Outcomes and Conclusion

Related Topics:

Comments

Leave a Reply Cancel reply