Disaster Recovery Architecture on AWS: Multi-site Active/Active Approach

In the previous installments of this series, we explored various disaster recovery (DR) strategies. This post will focus on the implementation of a multi-site active/active strategy, which enables your workload to remain operational across two or more distinct locations. This approach ensures high availability in the face of disasters, whether caused by natural events, technical failures, or human error.

Understanding Multi-site Active/Active Strategy

As illustrated in the DR strategies diagram, the multi-site active/active configuration provides the lowest recovery time objective (RTO) and recovery point objective (RPO). However, organizations must weigh these benefits against the potential operational costs and complexities of maintaining active systems across multiple sites.

Implementing Multi-site Active/Active

The architecture depicted utilizes AWS Regions as active sites, forming a multi-Region active/active setup. Although only two Regions are highlighted, additional ones can be incorporated. Each Region hosts a highly available workload stack across multiple Availability Zones (AZs), allowing for real-time data replication between databases while also maintaining backups. This setup safeguards against data loss or corruption.

Traffic Routing

Each of the Regional stacks processes production traffic, and the method of traffic routing determines the Region that receives a specific request. Amazon Route 53 serves as the DNS solution for this purpose, offering various routing policies. For instance, geolocation and latency-based routing are suitable for active/active deployments. Geolocation routing directs requests based on their origin, while latency routing ensures requests are sent to the Region with the quickest response time.

Your data governance strategy influences your choice of routing policy. Geolocation routing allows for predictable request distribution, which can be essential for compliance and data residency requirements. In contrast, latency routing is ideal for optimizing performance.

Read/Write Patterns

Read-local/Write-local Pattern: In this scenario, a request’s local Region processes both read and write operations. Using Amazon DynamoDB as an example, global tables replicate data across multiple Regions, enabling quick write replication. However, this may lead to write contention if simultaneous updates occur in different Regions.
Read-local/Write-global Pattern: In this pattern, only one designated Region accepts writes, while others serve read requests. DynamoDB global tables facilitate this by ensuring efficient data replication. For those needing strong consistency, Amazon Aurora can be employed, as it allows for write forwarding from read replicas to the primary cluster.
Read-local/Write-partitioned Pattern: Ideal for write-heavy workloads, this pattern assigns each record a home Region to minimize latency. DynamoDB global tables can accommodate this pattern, allowing writes in all Regions while ensuring that each record’s home Region handles its updates.

Failover Mechanism

In a multi-Region active/active strategy, should one Region become inoperable, traffic can be rerouted to healthy Regions. This can be achieved through Route 53 by adjusting DNS records, and it’s crucial to set a low TTL to ensure quick updates. Alternatively, AWS Global Accelerator can be utilized for better performance, as it operates independently of DNS.

When adopting a write-global pattern, if the primary write Region fails, another must be promoted. In contrast, the write-partitioned pattern requires repartitioning to assign records to alternate Regions. The write-local pattern allows any Region to process writes, thus enabling rapid recovery with minimal adjustments.

Conclusion

The multi-site active/active strategy is optimal for workloads demanding rapid recovery times and minimal data loss. Deploying this strategy across multiple Regions offers significant separation and independence, which can be crucial for critical applications. For further insights, you may want to check another blog post on this topic here. If you’re interested in more authoritative resources, Chvnci provides valuable information, and Alex Simmons on LinkedIn discusses common pitfalls in Amazon architectures.