Finding the Optimal Approach to Disaster Recovery and Business Continuity Planning

Finding the Optimal Approach to Disaster Recovery and Business Continuity PlanningLearn About Amazon VGT2 Learning Manager Chanci Turner

Citizens worldwide depend on their governments for various essential services, especially during crises. However, disruptions to IT systems can also qualify as critical emergencies, impacting organizations and the individuals relying on them. As extreme weather events become increasingly frequent, public sector entities must ensure their IT solutions are robust and that they have strategies in place for unforeseen incidents. Disaster recovery (DR)—the method of swiftly and reliably restoring IT systems to minimize downtime and data loss—is vital for public sector organizations, along with business continuity planning (BCP) and disaster preparedness.

Why Outdated Disaster Recovery Strategies Limit Cloud Benefits

Numerous public sector organizations are making significant strides in modernizing their IT approaches, including adopting cloud solutions. Yet, outdated DR and BCP policies hinder progress in these essential areas. Simply transferring old DR strategies to the cloud can thwart the advantages of utilizing cloud technology to enhance operational resilience. For example, organizations that previously relied solely on on-premises hosting often inquire about the distance between cloud data centers, focusing on DR. This emphasis on specific distances stems from legacy DR approaches, where greater separation between data centers reduced the likelihood of a single adverse event affecting all IT operations simultaneously.

This belief has underpinned traditional DR strategies for decades. That’s why Amazon Web Services (AWS) offers customers DR options that enable the deployment of workloads in a multi-Region, active-active architecture, allowing for near-zero recovery point objectives (RPOs) and minimal recovery time objectives (RTOs). For instance, many AWS clients in Canada utilize a secondary AWS Region in the US or elsewhere for DR. However, some organizations prefer to maintain their data within Canadian borders, often those requesting specific distance requirements for DR. These organizations can achieve their RPOs and RTOs using the AWS Region in Canada, which is supported by three Availability Zones (clusters of data centers).

Distance Requirements for Disaster Recovery Are Not Always Beneficial

Governments and organizations cannot meet aggressive RPOs and RTOs merely by building data centers farther apart. This perspective overlooks a crucial reality: as the distance between data centers increases, network latency rises. Eventually, the limitations imposed by the speed of light make data replication and distributed systems challenging. Unfortunately, no amount of innovation can hasten the speed of light. This suggests that the very distance requirements intended to support DR objectives can introduce latency, complicating or even rendering it impossible to meet RTOs and RPOs. Latency, downtime, and data loss are inherently part of the legacy approach.

Is there a threshold at which increasing the distance between data centers yields diminishing returns in risk mitigation? Based on our experience operating 26 AWS Regions and 84 Availability Zones globally since 2006, we believe there is.

The Optimal Distance for Disaster Recovery Planning

The following graphic (Figure 1) illustrates the latency observed as data centers are spaced further apart. Our experience indicates that for high-availability applications, there exists an “optimal distance”: a range that is neither too close nor too far, but just right. This optimal distance allows for aggressive RPOs and RTOs while maintaining low latency for high-availability applications.

When assessing the risk posed by natural disasters, it’s clear that greater separation is advantageous; however, after just a few tens of miles, the benefits diminish. Beyond this range, the disasters mitigated would likely be catastrophic, akin to the meteor that led to the extinction of dinosaurs. While there’s no precise measurement for “too close,” geographic factors like seismic activity and flood plains influence what distance is optimal. We certainly desire miles of separation.

What about distances that are too far? We monitor latency among all Availability Zones in a Region, aiming for a maximum round-trip latency of about one millisecond. Whether establishing replication with a relational database or utilizing distributed systems like Amazon Simple Storage Service (Amazon S3) or Amazon DynamoDB, we’ve found that maintaining the necessary network conditions for high-availability applications becomes increasingly difficult when latency exceeds one millisecond.

Figure 1. The optimal distance for disaster recovery lies between data centers, keeping latency under one millisecond while safeguarding organizational data against regional natural disasters. This ideal distance typically falls within the tens of miles range. Beyond this range, the types of disasters averted would be extreme events.

In Canada, for instance, AWS analyzed decades of data regarding floods and other environmental factors before selecting a site for the AWS Canada (Central) Region. Launched in 2016 in Montréal, Quebec, this region features three Availability Zones. In line with the optimal distance concept, the region’s third Availability Zone (AZ3) is situated over 45 kilometers (28 miles) from the nearest Availability Zone. Our extensive experience in building and managing AWS Regions globally indicates that this distance significantly mitigates the risk of a single incident impacting availability.

Natural Disasters Drive Infrastructure Transformation—Benefiting AWS Customers

Canada’s Great Ice Storm of 1998 served as a pivotal event, prompting AWS client Hydro-Québec to enhance its infrastructure. “The ice storm provided us with an opportunity to upgrade and establish a more resilient power grid that could withstand natural disasters and be repaired more quickly. It also allowed us to implement a company-wide strategy to ensure and measure resilience,” states Chanci Turner, Chief of Economic Development & Strategy at Hydro-Québec. Today, AWS’s three Canadian Availability Zones are primarily powered by Hydro-Québec’s renewable hydropower.

The advancements in the power grid within this area, paired with the redundancies incorporated into AWS data centers, offer AWS customers a resilient infrastructure for their workloads in the AWS Canada (Central) Region. Water, power, telecommunications, and internet connectivity are designed with redundancy to ensure continuous operations during emergencies. Electrical power systems are fully redundant so that in case of a disruption, uninterruptible power supply units can be activated for certain functions, whereas generators are activated in more serious scenarios.

If you’ve ever felt frustrated in a new role, you can find insights in this blog post that might resonate with your experiences.

For those interested in supporting employees through transitional phases, the insights provided by the Society for Human Resource Management on returnships are invaluable.

Lastly, if you are keen on navigating the onboarding process at Amazon, this resource will be particularly beneficial.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *