Automating Data Archival for Amazon Redshift Time Series Tables

Automating Data Archival for Amazon Redshift Time Series TablesMore Info

Published on 04 OCT 2023

Category: Amazon Redshift, Analytics, Customer Solutions

Amazon Redshift serves as a powerful cloud data warehouse, enabling users to analyze vast amounts of data efficiently using standard SQL. With its capacity to handle petabyte-scale data, it’s no surprise that numerous organizations depend on Redshift for analyzing exabytes of information and executing complex analytical queries. This widespread adoption is a testament to Redshift’s ability to facilitate rapid and scalable analytics without the need for users to manage the underlying data warehouse infrastructure.

Establishing a data retention policy is an essential component of an organization’s comprehensive data management strategy. In today’s big data landscape, data volume is perpetually on the rise, leading to increased storage costs. Therefore, it is vital to optimize data management in warehouses to ensure consistent performance, reliability, and cost-effectiveness. Organizations must determine how long they need to retain specific data and whether data that is no longer relevant should be archived or deleted. The frequency of archival largely depends on business and regulatory requirements.

Data archiving involves transferring inactive data from a data warehouse to a separate storage solution for long-term retention. This archived data often includes older records that remain significant to the organization, as well as data necessary for compliance with regulatory obligations. Conversely, data purging refers to the process of removing obsolete data that is no longer needed, which can be guided by the data retention policy set forth by the data owner or organizational needs.

This article outlines the steps to automate the archival and purging of Amazon Redshift time series tables. These tables store data for specified periods (days, months, quarters, or years) and require regular purging to maintain relevant information for analysis by end-users.

Solution Overview

The solution architecture is depicted in the following diagram.

This solution involves two database tables:

  1. arch_table_metadata: This table stores metadata for all tables designated for archival and purging. You need to insert rows for the tables you wish to manage. The columns include:
    • id: A database-generated unique identifier for each record.
    • schema_name: Name of the database schema for the table.
    • table_name: Name of the target table for archival and purging.
    • column_name: The date column used to identify records eligible for archival and purging.
    • s3_uri: The Amazon S3 location for data archival.
    • retention_days: The duration for which data will be retained, defaulting to 90 days.
  2. arch_job_log: This table tracks the run history of stored procedures. It logs details such as:
    • job_run_id: A unique numeric identifier for each stored procedure execution.
    • arch_table_metadata_id: The identifier corresponding to the arch_table_metadata table.
    • no_of_rows_bfr_delete: The number of rows present before the purging operation.
    • no_of_rows_deleted: The count of rows removed during the purge.
    • job_start_time: The UTC time when the stored procedure commenced.
    • job_end_time: The UTC time when the procedure concluded.
    • job_status: The status of the job execution, which can be IN-PROGRESS, COMPLETED, or FAILED.

Prerequisites

Before implementing this solution, ensure you complete the following steps:

  1. Set up an Amazon Redshift provisioned cluster or a serverless workgroup.
  2. Using Amazon Redshift Query Editor v2 or any compatible SQL editor, create the arch_table_metadata and arch_job_log tables with the provided DDL commands.
  3. Create the stored procedure sp_archive_data using the code snippet provided. This procedure accepts the AWS Identity and Access Management (IAM) role ARN as an input parameter unless you choose to use the default role. For more details, refer to this excellent resource on IAM roles in Amazon Redshift.
  4. For further reading on this topic, check out this blog post, which delves deeper into related concepts.

Through this setup, you can effectively automate the data archival and purging process for your Amazon Redshift time series tables, ensuring optimal data management and compliance with retention policies.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *