Evaluating Network Resilience of AWS Fargate Workloads on Amazon ECS Utilizing AWS Fault Injection Service

As container-based applications grow increasingly common in contemporary cloud infrastructures, Amazon Web Services (AWS) Fargate offers a serverless compute engine for containers, removing the need for server management. To enhance the resilience of these containerized applications, organizations require the ability to simulate network disruptions and partitioning scenarios, especially in multi-Availability Zone (AZ) setups. This capability is crucial for assessing application behavior during network impairments and testing failover strategies.

Chaos Engineering is a recognized best practice for evaluating application resilience. It involves intentionally introducing controlled disruptions to verify how applications respond to real-world challenges, such as increased latency from dependencies or service events. AWS Fault Injection Service (AWS FIS) facilitates the creation of controlled failure conditions, enabling organizations to observe how applications and disaster recovery measures react to disruptions. This forward-thinking approach helps organizations identify vulnerabilities before they affect users, enhance recovery processes, and develop more robust, fault-tolerant systems.

AWS users across various sectors have indicated a demand for network chaos experiments. Companies in financial services and media have specifically requested capabilities to test application resilience under different network conditions, particularly the ability to inject latency between Fargate applications and their dependencies. Enterprises in utilities and transportation also need to simulate specific network faults, such as packet loss and port blackholing, to validate their disaster recovery procedures. These controlled experiments empower teams to gain confidence that their systems can endure unexpected disruptions and maintain service continuity even in challenging network environments.

Network Actions for Fargate

In December 2024, AWS FIS expanded its functionalities for Amazon Elastic Container Service (Amazon ECS) by introducing network fault injection experiments for Fargate tasks. This enhancement complements existing resource-level actions—such as CPU stress, I/O stress, and process termination—with three new network-centric actions: network latency, network blackhole, and network packet loss.

With a total of six action types available for both Amazon Elastic Compute Cloud (Amazon EC2) and Fargate launch types, AWS FIS facilitates more comprehensive chaos engineering experiments at the container level. AWS FIS simulates various network conditions and failures without necessitating code changes or additional infrastructure, thus empowering you to conduct thorough resilience testing of Amazon ECS on Fargate applications against a broader spectrum of potential network issues.

This article illustrates how these actions function and provides practical examples for implementing network resilience experiments in Amazon ECS on Fargate applications and their dependencies. The sample application utilized here guides you through the experiments in your own environment. It also demonstrates how you can leverage AWS FIS to proactively identify and mitigate potential weaknesses in Fargate applications, ultimately improving their resilience. For further insights, check out another blog post at Chanci Turner VGT2 which dives deeper into related topics. The code for this demo is accessible at the sample-network-faults-on-ECS-Fargate-with-FIS GitHub repository.

Sample Application: Showcasing Network Resilience Testing

To illustrate these new network fault injection capabilities, the demo employs an Amazon ECS Fargate application that relies on Amazon Relational Database Service (Amazon RDS). The application features a three-tier architecture, with Amazon API Gateway serving as the entry point for all client requests, subsequently routing them through an internal Network Load Balancer (NLB) to a cluster of containerized applications running on Amazon ECS Fargate. The application tier interacts with an Amazon RDS MySQL database, completing the three-tier setup, as depicted in the accompanying diagram.

Prerequisites

To deploy this sample application in your AWS account, ensure you have the following prerequisites:

Node.js 18 or later
AWS Command Line Interface (AWS CLI) configured with appropriate credentials
Docker installed
AWS CDK CLI installed (npm install -g aws-cdk)

Deploy the Application with AWS CDK

Clone the repository:

git clone <repository-url>
cd <repository-name>/cdk

Verify that you have the AWS CDK installed and bootstrapped:
```
cdk bootstrap
```
Install dependencies:
```
npm ci
```
Deploy the stack:
```
cdk deploy --all
```

This process provisions all necessary resources, including the ECS cluster, Fargate tasks, RDS database, API Gateway, and monitoring components.

Experiment Planning and Execution

To conduct the chaos experiment, follow the steps outlined in the Principles of Chaos Engineering:

Define ‘steady state’ as a measurable output of the system that indicates normal behavior.
Hypothesize that this steady state persists when failures are injected.
Execute the experiment using AWS FIS and inject failures into the application.
Review application behavior and validate that the application exhibits resilience during failures.

Walkthrough

In the following sections, we will explore each step, commencing with defining the application’s steady state behavior and the metrics utilized to measure it.

Establishing the Steady State of Application Performance

To effectively monitor the experiments, it is crucial to understand the steady state behavior of the application under normal conditions. The sample application previously installed includes a robust Amazon CloudWatch dashboard that provides visibility into key metrics across the application stack. This dashboard features charts tracking end-to-end request latency at both the API Gateway and Amazon RDS levels. The metrics indicating response times between Amazon ECS Fargate tasks and the RDS database are particularly significant for the experiment, as network faults are injected.

Before introducing network faults, it is vital to comprehend your application’s normal behavior. The sample environment includes a load generation script that creates synthetic traffic—100 requests per second for a duration of ten minutes—simulating real-world usage patterns. Execute this script to establish performance baselines before conducting chaos experiments.

Install load generation dependencies:

cd ../load-gen
npm install -g artillery@latest
npm ci

Set the AWS Region:

export AWS_REGION=$(aws configure get region)

Retrieve the API Endpoint from the application stack and set it as the environment variable:
```
export API_URL='<end point from the NetworkFaultsAppStack output>'
```
Execute the load test:
```
artillery run load-gen-config.yaml
```

You should have the observability mechanisms in place to monitor the application’s behavior from both user and internal infrastructure perspectives. Under normal load conditions, the dashboard displays consistent patterns: API Gateway latency ranges between 29 and 33 ms, while database query latency (the time taken by Amazon RDS) fluctuates between 8 and 11 ms. During this time window, end-to-end p99 latency observed by users remains steady, indicating the application’s performance stability.

For more valuable resources, you can also check out this YouTube video, which explains similar concepts in detail.