Disaster Recovery Implementation for Amazon Redshift | Amazon VGT2 Las Vegas

Published on 27 JUN 2024

Categories: Amazon Redshift, Data Analytics, Best Practices, Intermediate (200), Technical How-to

Amazon Redshift is a fully managed, petabyte-scale cloud data warehousing solution that allows users to start with a few hundred gigabytes and scale up to petabytes of data. This service empowers businesses to leverage their data for insightful decision-making and customer engagement.

A well-structured disaster recovery plan is crucial to minimizing disruptions and ensuring swift recovery in the event of a catastrophe that impacts system functionality. Such plans also help organizations meet compliance standards for regulatory purposes by establishing a clear recovery roadmap.

This article outlines proactive measures that can be adopted to mitigate risks associated with unforeseen disruptions, thereby enhancing preparedness for recovering Amazon Redshift during a disaster. With features like automated snapshots and cross-Region replication, users can significantly bolster their disaster resilience with Amazon Redshift.

Disaster Recovery Planning

Disaster recovery planning involves two essential components:

Recovery Point Objective (RPO) – This defines the maximum duration acceptable since the last data recovery point, determining the acceptable data loss between the last recovery point and the service interruption.
Recovery Time Objective (RTO) – This indicates the maximum allowable delay between service interruption and restoration, outlining the acceptable timeframe for service unavailability.

To create an effective disaster recovery plan, follow these steps:

Establish recovery objectives for downtime and data loss (RTO and RPO) for both data and metadata, ensuring engagement from business stakeholders to determine appropriate goals.
Identify recovery strategies to achieve the established recovery objectives.
Create a fallback plan to revert production back to its original state.
Test the disaster recovery plan by simulating a failover event in a non-production environment.
Develop a communication strategy to inform stakeholders about downtime and its business impact.
Formulate a plan for progress updates, recovery, and service availability.
Document the entire disaster recovery process for future reference.

Disaster Recovery Strategies

Amazon Redshift, as a cloud-based data warehouse, comes equipped with several built-in recovery capabilities to handle unexpected outages and reduce downtime.

The RA3 instance types and Redshift serverless offerings store data in Redshift Managed Storage (RMS), which is backed by Amazon Simple Storage Service (Amazon S3), known for its high availability and durability.

In the subsequent sections, we will explore various failure scenarios and their corresponding recovery strategies.

Utilizing Backups

Backing up data is a vital aspect of data management. It safeguards against human errors, hardware malfunctions, virus attacks, power outages, and natural disasters.

Amazon Redshift supports two types of snapshots: automatic and manual, which can be utilized for data recovery. Snapshots represent point-in-time backups of the Redshift data warehouse, stored internally with RMS through an encrypted Secure Sockets Layer (SSL) connection.

Provisioned clusters in Redshift automatically take snapshots, which are retained for a default of 1 day but can be extended up to 35 days. Snapshots are taken every 5 GB of data change per node or every 8 hours, with a minimum interval of 15 minutes between snapshots. The data change must exceed the total data ingested by the cluster (5 GB times the number of nodes). Customized snapshot schedules can also be set, ranging from 1 to 24 hours. The retention period for automated backups can be managed using the AWS Management Console or the ModifyCluster API. Although it is possible to turn off automated backups, setting the retention period to 0 is not recommended. For further information, refer to the details on automated snapshots.

Redshift Serverless automatically generates recovery points approximately every 30 minutes, retaining them for a default duration of 24 hours before automatic deletion. Users can convert recovery points into snapshots if longer retention is necessary.

Both Amazon Redshift provisioned and serverless clusters allow for manual snapshots to be taken on-demand and retained indefinitely. These manual snapshots can satisfy compliance requirements, although they incur storage fees, so it’s wise to delete them when they are no longer needed. For additional details, see the section on manual snapshots.

Amazon Redshift also integrates with AWS Backup, enabling centralized and automated data protection across all AWS services, both in the cloud and on-premises. With AWS Backup for Amazon Redshift, users can set up data protection policies and monitor activities for various Redshift provisioned clusters in one location. This streamlines and automates backup tasks that were previously managed separately without manual intervention. To learn more about configuring AWS Backup for Amazon Redshift, check out this informative blog post.

Node Failure

A Redshift data warehouse consists of a collection of nodes. Amazon Redshift automatically identifies and replaces any node that fails within your data warehouse cluster. It ensures that the replacement node is available immediately, prioritizing the loading of your most frequently accessed data from Amazon S3 to facilitate swift querying.

For clusters with a single node (not recommended for production use), only one data copy exists. When this node is down, AWS must restore the cluster from the latest snapshot on Amazon S3, establishing your RPO.

For production environments, we strongly advise using at least two nodes.

Cluster Failure

In every Redshift cluster, there is a leader node and one or more compute nodes. If a cluster failure occurs, restoration from a snapshot is required. Each snapshot contains point-in-time backups of the cluster, inclusive of all databases operating within it, along with cluster information such as node count, node type, and administrator username. Upon restoring from a snapshot, Amazon Redshift creates a new cluster using the snapshot data, allowing users to query the new cluster quickly, even before all data is fully loaded. The restoration happens in the same AWS Region within a randomly selected Availability Zone unless specified otherwise.

Availability Zone Failure

A Region refers to a physical location with data centers, while an Availability Zone consists of one or more discrete data centers equipped with redundant power, networking, and connectivity within a Region. Availability Zones enhance the ability to run production applications and databases efficiently, providing higher availability, fault tolerance, and scalability than a single data center could offer. All Availability Zones in a Region are interconnected with high-bandwidth, low-latency networking over fully redundant, dedicated metro.

For more insights on disaster recovery, you may also explore resources from experts like those at CHVNCI, who provide valuable information on this topic. Additionally, for further guidance, visit Amazon Hiring FAQ, which is an excellent resource for understanding the best practices in disaster recovery.