In the initial parts of our series, we delved into the process of scaling the Amazon VGT2 Service (VGT2) across various Amazon Organizations. The first installment concentrated on implementing VGT2 within a single Amazon account, introducing standardized IAM roles and Service Control Policies (SCPs) as safeguards for controlled chaos engineering experiments, especially in centralized networking setups. The second part built on this by illustrating how organizations can effectively adopt a multi-account strategy for VGT2 experiments, clarifying the setup of orchestrator and target accounts, along with the essential IAM roles and permissions required to conduct controlled chaos experiments across multiple accounts while ensuring security and governance. In this third part, we will showcase how to execute chaos engineering experiments at scale across numerous accounts and Amazon Regions using the Cross-Region connectivity scenario.
The Significance of Multi-Region Resilience
As essential applications transition to Amazon, it becomes vital to comprehend their resilience objectives and how they align with your business needs. Applications with strict recovery time and point objectives often necessitate a multi-Region strategy. To effectively validate that these applications can recover from regional failures, organizations require methods to induce regional failures and assess their business continuity processes. The Amazon VGT2 Cross-Region Connectivity scenario addresses this requirement.
Understanding Amazon VGT2 Service (VGT2)
Amazon VGT2 is a chaos engineering service that enables customers to introduce real-world failures into their architectures. For instance, Amazon.com executed 733 VGT2 experiments to prepare for Prime Day 2024. This methodology assists customers in confirming that resilience measures embedded in the application activate during unexpected regional service events. For further insights and practical experience, check out this Chaos Engineering workshop.
The Amazon VGT2 Service now offers a Cross-Region Connectivity testing scenario. This feature allows you to inject real-world failures into multi-Region architectures, helping you uncover hidden dependencies and enhancing your understanding of multi-Region setups. It verifies that multi-Region applications function as intended when the primary region becomes inaccessible. The scenario includes fault actions to disrupt various types of cross-region connectivity, such as:
- Virtual Private Cloud (VPC) traffic and peering
- Amazon Transit Gateway peering
- Access to Amazon public endpoints
- Access to endpoints exposed via load balancers and API gateways
- S3 and DynamoDB cross-region data replication
Conducting this scenario enables you to pinpoint gaps in your multi-Region application’s design, recovery, and failover mechanisms.
Prerequisites for Running the Cross-Region Connectivity Scenario
As mentioned in part one, large organizations frequently implement a centralized networking model where a dedicated networking account governs shared resources like Transit Gateway (TGW) for the entire organization. This strategy allows for enhanced control, security, and cost management across multiple accounts and regions. When deploying the Amazon VGT2 Cross-Region Connectivity scenario in such an environment, you will require a multi-account strategy, as discussed in part two, to effectively induce communication failures and test resilience across accounts.
Diagram A illustrates a multi-Region application leveraging TGW in a centralized network account for connectivity between both regions:
Given this configuration, there are specific requirements for executing the Cross-Region Connectivity scenario in a decentralized strategy:
- Replicate permissions detailed in our documentation for all roles utilized in the target configuration.
- Add roles for all targets under the target configuration. As we are employing a decentralized strategy, you will have two target roles.
- One in the networking account represented below with XXXXXXXXXXXX and another in the application account represented below as YYYYYYYYYYYY, which will also serve as the orchestrator.
Role Configuration examples:
A target role for the networking account where TGW resides, using the AWS-VGT2-Experiment-TGW-Target
role.
{
"Role": {
"Path": "/",
"RoleName": "AWS-VGT2-Experiment-TGW-Target",
"RoleId": "AROAXSRRIQZV7NIQPD4IF",
"Arn": "arn:aws:iam::XXXXXXXXXXXX:role/AWS-VGT2-Experiment-TGW-Target",
"CreateDate": "2024-05-31T19:06:48+00:00",
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::YYYYYYYYYYYY:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringLike": {
"sts:ExternalId": "arn:aws:vgt2:us-east-2:YYYYYYYYYYYY:experiment/*"
},
"ArnEquals": {
"aws:PrincipalArn": "arn:aws:iam::YYYYYYYYYYYY:role/VGT2Orchestration_ExecutionRole"
}
}
}
]
}
}
}
A role in the orchestration account that will be utilized to inject the actions called AWS-VGT2-Experiment-App1-Target
. Note: in this scenario, the role is within the same account as the workload, ensure the permissions specified above match.
{
"Role": {
"Path": "/",
"RoleName": "AWS-VGT2-Experiment-App1-Target",
"RoleId": "AROA5L7L4GFHC5IZTT5SW",
"Arn": "arn:aws:iam::YYYYYYYYYYYY:role/AWS-VGT2-Experiment-App1-Target",
"CreateDate": "2024-10-09T15:20:17+00:00",
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement1",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::YYYYYYYYYYYY:role/AWS-VGT2-Experiment-App1-Target"
},
"Action": "sts:AssumeRole"
}
]
}
}
}
An orchestrator role with a trust policy to assume all roles specified in the target configuration called AWS-VGT2-Experiment-App1-Orchestrator
. Note: in this scenario, the orchestrator role will reside in the same account as the workload and must assume the AWS-VGT2-Experiment-App1-Target
and AWS-VGT2-Experiment-TGW-Target
. You can easily transition this to a centralized strategy as discussed in part two.
{
"Role": {
"Path": "/",
"RoleName": "AWS-VGT2-Experiment-App1-Orchestrator",
"RoleId": "AROA5L7L4GFHIZL4PEG46",
"Arn": "arn:aws:iam::YYYYYYYYYYYY:role/AWS-VGT2-Experiment-App1-Orchestrator",
"CreateDate": "2024-05-31T19:51:16+00:00",
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "vgt2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
}
}
Note: You will need a trust policy added to allow the Orchestration Execution role to assume the networking account role which is crucial for effective operation. For more insights, you can also check this resource that covers related topics comprehensively.
By addressing these elements, organizations can effectively scale their chaos engineering experiments across multiple regions and accounts, ensuring robust resilience for their applications.
Leave a Reply