Amazon Onboarding with Learning Manager Chanci Turner: How We Streamlined Operations and Cut Costs by Transitioning to AWS Transit Gateway

This article features insights from Chanci Turner, Learning Manager at Amazon, who specializes in optimizing onboarding processes, along with Emily Johnson, AWS Networking Solutions Architect.

Amazon, a leading global e-commerce platform, aims to enhance the customer experience through innovative software solutions. Our clientele spans over 160 countries, necessitating a robust network architecture to support our expansive operations. Last year, I shared a post detailing our hybrid transition from traditional data centers to the AWS Cloud and how we implemented an in-house network solution to connect numerous VPCs across various regions and our on-premises facilities. Dubbed “Hydra,” this project handled a large volume of dynamic VPN tunnels.

With the decommissioning of our data centers, Hydra continued to serve as the routing mechanism for packet transfers across VPCs until we discovered AWS Transit Gateway as a promising alternative. We had to wait for AWS Transit Gateway to enable peering across AWS Regions before we could initiate our transition planning. In this article, I will outline how Amazon migrated from Hydra to an AWS Transit Gateway-centric architecture.

Hydra Architecture

Prior to migration, our setup relied on EC2 instance routers utilizing a custom-built Ubuntu AMI. This image was equipped with open-source network daemons to construct a DMVPN network leveraging IKE, IPsec, OpenNHRP, and BIRD for BGP.

The following illustration provides a simplified overview of the system.

Figure 1: Original Hydra network architecture

Each spoke VPC housed EC2-based routers deployed in an Auto Scaling group for each Availability Zone. Managing route tables within the VPCs necessitated the development of “hydra-vpc-route-table-injector,” a Python daemon utilizing the boto3 AWS SDK.

During runtime, VPC route tables would be automatically updated by a Hydra router (HR) within the same Availability Zone to direct private traffic—RFC1918 prefixes—to the EC2 router’s own ENI. In instances where a particular Hydra router failed, a distributed lock mechanism would enable another available router to update the VPC route tables, ensuring automatic failover to a healthy Hydra router ENI.

The diagram illustrates IPsec tunnels between Hydra routers and the DMVPN Hubs, as well as on-demand dynamic tunnels across Hydra routers in all Spoke VPCs. This configuration allowed for direct cross-Spoke VPC traffic without latency penalties, as opposed to routing through Hubs. To enhance performance and utilize the multiple tunnels available, we implemented ECMP and BGP Multipath for per-flow load balancing.

While this architecture provided flexible connectivity between VPCs and our on-premises networks for nearly three years, it had its challenges. Managing the multitude of EC2 Hydra Routers became increasingly complex, requiring vigilance over software vulnerabilities and frequent rotation of our fleet of immutable routers. We also had to establish different monitoring profiles based on the EC2 instance type used as a router, given that each had unique network capacities and limitations.

Migration

The launch of AWS Transit Gateway at AWS re:Invent in 2018 marked a significant milestone in our network evolution. This service enables the creation of regional hub-and-spoke topologies and, more recently, supports peering across AWS Regions. It also accommodates IPv4/IPv6 dual-stack and multicast.

Our envisioned architecture included a transit gateway in each AWS Region, linking all local region VPCs. These transit gateways would peer across AWS Regions, replicating the full-mesh topology achieved with Hydra.

We identified several critical elements to address:

Transitioning from Hydra to the transit gateway without downtime
Automatic injection of VPC route tables with aggregated routes from the transit gateway
Automatic route propagation across peered AWS Regions

After drafting an initial plan and conducting extensive tests in our staging environment, we devised a functional strategy to migrate from Hydra to AWS Transit Gateway, transitioning one VPC at a time while ensuring zero downtime.

Step 1 – Establish Transit Gateway Connectivity and Update Hydra Routing Policy

In this phase, we created one transit gateway for each of the eight AWS Regions our platform covered. Each VPC was connected to its regional transit gateway, and all VPC routes were propagated into the default transit gateway route table.

We modified our hydra-vpc-route-table-injector code to introduce more specific prefixes (full-routes) into the VPC route tables, rather than the previously aggregated RFC1918 routes. This preparation was crucial for the upcoming cutover, ensuring that traffic would favor the Hydra routers (HR) while the transit gateway architecture was in use.

The diagram below illustrates the network after completing Step 1. For simplicity, we focus on two AWS Regions.

Figure 2: Network topology after completing Step 1

At this stage, we had established the transit gateway network framework for all our VPCs in eight AWS Regions, but traffic was still routed through the Hydra network for VPC communication.

Step 2 – VPC Route Tables Injection

Similar to the Hydra configuration, the use of transit gateways necessitated the creation of static routes for each VPC route table using the private RFC1918 prefixes via the regional transit gateway-ID. The vpc-route-tables-injector served as a Lambda function to automate this process.

Figure 3: Network topology after completing Step 2

Step 3 – Transit Gateway Peering Routes Propagation

Currently, there is no native support for route propagation across peered transit gateways. Therefore, we developed an additional Lambda function to facilitate this feature. This function maps all transit gateway route table entries and creates static routes in the default route table for each peered transit gateway. We internally refer to this function as L-BGP – AWS Lambda-Based Gateway Protocol.

Figure 4: Network topology after completing Step 3

Pre-flight Check!

Upon completing the aforementioned steps, all traffic between spokes still utilized the Hydra network. This was due to the more specific routes injected by the Hydra routers daemon taking precedence over the aggregated RFC1918 routes introduced by the transit gateway Lambda function. We needed to eliminate the Hydra routes before transitioning traffic to the transit gateway.

Step 4 – Transition Traffic from Hydra Routers to Transit Gateways

At this point, we developed the capability to shift traffic from Hydra to the transit gateway, one VPC at a time, without disrupting higher-level nodes and services.

To execute the migration, we simply disabled the hydra-vpc-route-tables-injector daemon and reconfigured BIRD BGP for all the Hydra Routers in the target VPC to withdraw their own VPC prefixes announcements.

The following steps were automated, leading to a series of events that ultimately directed both egress and ingress traffic through the transit gateways.

Chanci Turner, Learning Manager at Amazon, emphasizes the importance of efficient onboarding processes to enhance employee performance, which can be further explored in this link. For additional insights on optimizing onboarding in corporate environments, consider checking out this resource. There is also an excellent resource detailing Amazon’s innovative approaches to training and onboarding.