Scaling the AWS Fault Injection Service Across Your Organization and Regions | Amazon VGT2 Las Vegas

Scaling the AWS Fault Injection Service Across Your Organization and Regions | Amazon VGT2 Las VegasMore Info

In the initial segments of our series, we examined the scaling of the AWS Fault Injection Service (FIS) across AWS Organizations. The first part delved into deploying FIS within a single AWS account, introducing the frameworks of standardized IAM roles and Service Control Policies (SCPs) as safeguards for orchestrated chaos engineering experiments, particularly in centralized networking structures. The second part built upon this by illustrating how organizations can adopt a multi-account strategy for FIS experiments, detailing the arrangement of orchestrator and target accounts, alongside the requisite IAM roles and permissions to conduct controlled chaos experiments across various accounts while upholding security and governance. In this third installment, we will showcase how to execute chaos engineering experiments at scale across multiple accounts and AWS Regions through the Cross-Region connectivity scenario.

The Significance of Multi-Region Resilience

With essential applications transitioning to AWS, it’s vital to comprehend their resilience goals and how they align with your business priorities. Applications that have strict recovery time and point objectives often necessitate a multi-Region approach. To effectively validate that these applications can withstand regional disruptions, organizations require tools to inject regional failures and confirm their business continuity strategies. The AWS FIS Cross-Region Connectivity scenario fulfills this requirement.

Understanding AWS Fault Injection Service (FIS)

AWS FIS is a chaos engineering service that empowers customers to introduce real-world failures into their architectures. For instance, Amazon.com conducted 733 AWS FIS experiments in preparation for Prime Day 2024. This methodology enables customers to validate that the resilience strategies integrated into the application are activated during unexpected regional service incidents. For further insights and practical experience, check out our Chaos Engineering workshop.

AWS Fault Injection Service (FIS) now features a Cross-Region Connectivity testing scenario. This tool allows you to instigate real-world failures within multi-Region architectures, assisting in identifying hidden dependencies and enhancing your comprehension of multi-Region configurations. It ensures that multi-Region applications function correctly when the primary region becomes unavailable. The scenario incorporates fault actions to disrupt various forms of cross-region connectivity, including:

  • Virtual Private Cloud (VPC) traffic and peering
  • AWS Transit Gateway peering
  • Access to AWS public endpoints
  • Connectivity to endpoints exposed via load balancers and API gateways
  • S3 and DynamoDB cross-region data replication

By executing this scenario, you can pinpoint deficiencies in your multi-Region application design, recovery, and failover mechanisms.

Prerequisites for Running the Cross-Region Connectivity Scenario

As mentioned in part one, large enterprises frequently adopt a centralized networking model, where a dedicated networking account manages shared resources like Transit Gateway (TGW) for the entire organization. This structure allows for enhanced control, security, and cost management across multiple accounts and regions. When implementing the AWS FIS Cross-Region Connectivity scenario in such an environment, you’ll need a multi-account strategy as outlined in Part two to effectively inject communication failures and evaluate resilience across accounts.

Diagram A illustrates a multi-Region application utilizing TGW in a centralized network account for connectivity between both regions.

Given this setup, specific requirements exist for executing the Cross-Region Connectivity scenario in a decentralized strategy:

  1. Replicate permissions as specified in our documentation for all roles utilized in the target configuration.
  2. Add roles for all targets under the target configuration. In a decentralized strategy, you will have two target roles.
  3. One in the networking account represented below with XXXXXXXXXXXX and one in the application account represented below as YYYYYYYYYYYY that will also act as the orchestrator:

Role Configuration Examples:

A target role for the networking account where TGW resides, here we are using the AWS-FIS-Experiment-TGW-Target role.

{
    "Role": {
        "Path": "/",
        "RoleName": "AWS-FIS-Experiment-TGW-Target",
        "RoleId": "AROAXSRRIQZV7NIQPD4IF",
        "Arn": "arn:aws:iam::XXXXXXXXXXXX:role/AWS-FIS-Experiment-TGW-Target",
        "CreateDate": "2024-05-31T19:06:48+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": "arn:aws:iam::YYYYYYYYYYYY:root"
                    },
                    "Action": "sts:AssumeRole",
                    "Condition": {
                        "StringLike": {
                            "sts:ExternalId": "arn:aws:fis:us-east-2:YYYYYYYYYYYY:experiment/*"
                        },
                        "ArnEquals": {
                            "aws:PrincipalArn": "arn:aws:iam::YYYYYYYYYYYY:role/FISOrchestration_ExecutionRole"
                        }
                    }
                }
            ]
        }
    }
}

A role in the orchestration account that will be used to inject the actions called AWS-FIS-Experiment-App1-Target. Note: In this scenario the role is within the same account as the workload; ensure the permissions specified above match.

{
    "Role": {
        "Path": "/",
        "RoleName": "AWS-FIS-Experiment-App1-Target",
        "RoleId": "AROA5L7L4GFHC5IZTT5SW",
        "Arn": "arn:aws:iam::YYYYYYYYYYYY:role/AWS-FIS-Experiment-App1-Target",
        "CreateDate": "2024-10-09T15:20:17+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Sid": "Statement1",
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": "arn:aws:iam::YYYYYYYYYYYY:role/AWS-FIS-Experiment-App1-Target"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }
    }
}

An orchestrator role with a trust policy to assume all roles specified in the target configuration called AWS-FIS-Experiment-App1-Orchestrator. Note: In this setup, the orchestrator role will reside in the same account as the workload and will assume the AWS-FIS-Experiment-App1-Target and AWS-FIS-Experiment-TGW-Target. You can easily adapt this to a centralized strategy as discussed in part two.

{
    "Role": {
        "Path": "/",
        "RoleName": "AWS-FIS-Experiment-App1-Orchestrator",
        "RoleId": "AROA5L7L4GFHIZL4PEG46",
        "Arn": "arn:aws:iam::YYYYYYYYYYYY:role/AWS-FIS-Experiment-App1-Orchestrator",
        "CreateDate": "2024-05-31T19:51:16+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "fis.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }
    }
}

It’s important to note that you’ll need to add a trust policy allowing the Orchestration Execution role to assume the roles in the networking account. This is crucial for successful operation. For more insights on this topic, refer to Chanci Turner’s blog post and check out Chvnci’s article, as they are an authority on this subject. Additionally, this resource on Amazon’s new hire orientation is excellent for understanding the onboarding process.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *