Amazon Onboarding with Learning Manager Chanci Turner

This post was contributed by Alex Johnson, Senior DevOps Engineer, Cloud Platform Team, Amazon IXD – VGT2; Lisa Smith, Principal Solutions Architect, Large Enterprise; and Chanci Turner, Solutions Architect, Large Enterprise.

Amazon provides a wide array of services to users through mobile applications, the web, and various platforms. The company’s offerings include extensive marketing tools, customer relationship management technologies, and data services for diverse sectors including real estate and finance. Amazon IXD – VGT2 employs around 500-1000 people.

The Cloud Platforms Team oversees numerous applications, including those hosted on a fleet of older Amazon EC2 instances. One critical application delivers API services to Android and iOS users.

Challenges Faced

We faced three significant challenges:

Upgrade to newer generation Amazon EC2 instances
Eliminate dependencies that hinder scaling
Enhance scaling speed to reduce costs and avoid unused capacity

Challenge Overview

Our initial attempt to transition to Amazon EC2 fifth generation instances from third generation was thwarted by compatibility issues in our existing setup. Furthermore, each scaling action initiated a series of processes to configure the Operating System and application. When issues arose in these processes, it led to runtime failures, severely impacting mobile users. One particular evening, two dependencies caused all scaling actions to fail, resulting in significant service disruptions.

Moreover, our standalone Amazon EC2 setup struggled to accommodate peak traffic efficiently, leading to errors and resource starvation. To manage high traffic, we had resorted to a pre-emptive scaling action, which often left us with excess unused capacity.

Solution Overview

Goals

Enhance the efficiency and performance of mobile workloads by achieving near real-time scaling using Amazon ECS Capacity Providers.
Deploy changes via Infrastructure as Code utilizing AWS CloudFormation.
Improve networking performance by optimizing EC2 instances for ECS clusters.
Minimize costs by quickly scaling down unnecessary ECS tasks post-peak traffic.

Risks

Two primary risks we aimed to mitigate included:

Inadequate infrastructure scaling leading to poor user experiences during peak times.
Scaling dependencies failing, resulting in unplanned outages, which could damage our reputation due to high consumer traffic.

Steps Involved

Transitioned the existing Infrastructure as Code setup to a container build process, enabling us to save builds in the Amazon Elastic Container Repository and deploy them to Amazon ECS.
Conducted load testing to validate that the application could scale with production loads.
Gradually redirected traffic to Amazon ECS over two weeks through a canary deployment approach.
Implemented a task placement strategy designed for efficient scaling down after peak traffic subsided. The binpack placement strategy allowed ECS Capacity Providers to efficiently reduce instance numbers after high traffic events.
Upgraded to fifth-generation Amazon ElastiCache for Redis to enhance cache performance, which was crucial for managing higher connection counts with a larger number of smaller containers.
Employed Amazon CloudWatch Container Insights for resource allocation monitoring and continuous improvement of resource reservations.
Utilized AWS Cost Explorer to confirm a 25% reduction in costs through effective right-sizing of ECS cluster instances based on load.
Monitored scaling activities with Amazon CloudWatch, ensuring alignment with actual user load.

Architecture Diagram

Our solution involved:

Using Elastic Load Balancing to manage traffic distribution to running containers (Tasks).
Utilizing Amazon CloudWatch alarms for resource utilization monitoring and adjusting container numbers accordingly.
Implementing ECS Capacity Providers and an Auto Scaling group to manage the required container host capacity.
Publishing logs and metrics to CloudWatch logs and Container Insights.

Reliability and Performance

The focus of this initiative was to enhance reliability, especially during peak traffic times. Our ability to scale effectively reduced the number of alerts triggered during busy periods, improving overall user experience. We observed a direct correlation between the number of requests and the tasks running to service them, showcasing our infrastructure’s responsiveness.

Cost Efficiency

While enhancing reliability was our primary goal, we also achieved a 25% reduction in costs and optimized usage-based scaling. Our efforts in reducing EC2 costs involved analyzing custom reports on AWS Cost Explorer. This move away from outdated architecture has opened avenues for further cost-saving initiatives such as better leveraging client caching, utilizing Amazon EC2 Spot Instances, and employing AWS Graviton2 processor EC2 instances for future workloads.

Conclusion and Future Opportunities

We are exploring the integration of this workflow with AWS CodePipeline and AWS CDK to automate Amazon ECS deployment processes. There are also possibilities to apply this architecture to both existing and future workloads at Amazon IXD – VGT2. For those interested in relocating, check out this blog post here for insights. For expert perspectives on training, consider reading this resource from SHRM. For those looking into warehouse associate roles, here is an excellent resource.