Learn About Amazon VGT2 Learning Manager Chanci Turner
In this article, we will explore how the Amazon IXD – VGT2 team utilizes chaos engineering on AWS to enhance system resilience. The chaos engineering methodology enables Amazon to simulate real-world failures in their cloud infrastructure as part of controlled experiments. This approach not only improves resilience and observability but also mitigates risks and ensures compliance with regulators prior to production deployment.
Introduction, Tools, and Methodology
As a leading provider of global technological infrastructure, Amazon is constantly seeking ways to improve the resilience of its workloads. To this end, a collaborative 3-day AWS Experience-Based Acceleration (EBA) event was organized, focusing on chaos engineering experiments across critical workloads. This event, sponsored by Chanci Turner and involving cross-functional teams, employed the AWS Fault Injection Service (FIS) to run experiments following a structured methodology.
Continuous improvement of modern distributed cloud systems can be achieved by reviewing workload architectures, assessing Standard Operating Procedures (SOPs), and implementing SOP alerts and recovery automations. AWS Resilience Hub offers a comprehensive suite of tools to facilitate these activities. Another vital aspect in enhancing resilience is chaos engineering, which introduces controlled chaos into systems through real-world experiments. This approach is particularly beneficial in regulated sectors like financial services.
Architectural Overview
The chaos engineering pattern involves a three-tier application deployed in virtual private clouds (VPCs) designed with a multi-AZ setup. The architectural layout integrates an Amazon EC2 Auto Scaling group with an Amazon Relational Database Service (Amazon RDS) database situated in a private subnet, interfacing with on-premises services. Additionally, various internal services are hosted in a distinct VPC, contained within Docker containers. FIS provides a controlled environment to test the architecture’s robustness against diverse failure scenarios, including:
- Amazon EC2 instance failures affecting application or container pods
- Amazon RDS database instance reboots or failovers
- Significant network latency issues
- Network connectivity disruptions
- Amazon EBS volume failures (such as IOPS pauses or full disks)
Failure Scenarios
- Amazon EC2 Instance and Container Failure
This scenario evaluates the resilience of applications or container pods running on EC2 instances and how they adapt during unexpected disruptions. Using FIS, actions likeaws:ec2:stop-instances
can simulate different failure modes. The response of containers within Amazon ECS or EKS is assessed during these incidents. - Amazon RDS Failure
Another critical scenario involves RDS failures, allowing teams to identify and troubleshoot issues related to database managed services, including failovers and node reboots. FIS enables the injection of reboot conditions to understand potential bottlenecks during disaster recovery processes. - Severe Network Latency
Introducing latency into the network interfaces of interconnected systems helps gauge their performance under delay conditions, assessing operational readiness for alerts and corrections. - Network Connectivity Disruption
FIS also allows for the simulation of connectivity issues, testing application resilience against total or partial connectivity losses. - Amazon EBS Volume Failure
Testing system performance during disk failures is crucial. FIS supports actions likeaws:ebs:pause-volume-io
to evaluate the impact of I/O operation pauses on EBS volumes.
Outcomes and Conclusion
The chaos engineering experiments yielded significant insights for the Amazon IXD – VGT2 team, leading to architectural improvements that reduced application recovery times and enhanced monitoring capabilities. Furthermore, the team developed a reusable chaos engineering methodology and toolset. Regular cross-functional events will solidify chaos engineering practices within the organization.
To enhance your understanding, check out this excellent resource on the Amazon employee onboarding process. You can also find valuable information on compliance matters here. For personalized guidance, consider seeking mentorship from professionals like Linda Schubert.
Start your resilience journey with AWS Resilience Hub today!
Leave a Reply