Automating Data Archival for Amazon Redshift Time Series Tables

Amazon Redshift is a powerful, scalable cloud data warehouse that provides a straightforward and cost-efficient way to analyze vast amounts of data using standard SQL. Today, countless customers trust Amazon Redshift for analyzing exabytes of information and executing intricate analytical queries, making it the leading choice in cloud data warehousing. With Amazon Redshift, you can perform and scale analytics in mere seconds without the burden of managing your data warehouse infrastructure.

Establishing a data retention policy is a critical component of effective data management within an organization. As data volumes continue to grow in today’s big data landscape, managing storage costs becomes increasingly essential. It’s important to continually optimize data in data warehouses to ensure consistent performance, reliability, and cost efficiency. Organizations must determine the duration for which specific data needs to be retained and whether outdated data should be archived or permanently deleted. The frequency of data archival should align with business requirements or legal obligations.

Data archiving involves transferring data that is no longer actively used in a data warehouse to a separate storage solution for long-term preservation. Archived data typically includes older information that remains significant to the organization and may be required for future reference or compliance purposes. On the other hand, data purging is the process of clearing space in the database by removing obsolete data that is no longer relevant to the business. This purging process can be governed by the data retention policy established by the data owner or organizational needs.

This article provides a detailed guide on automating the archival and purging of time series tables in Amazon Redshift. Time series tables are designed to retain data for specific durations (days, months, quarters, or years) and require regular purging to maintain a manageable dataset for end-users.

Solution Overview

The architecture of our solution is illustrated in the following diagram.

Our solution incorporates two database tables. The arch_table_metadata table holds metadata for all tables slated for archiving and purging. To utilize this table, you must input rows corresponding to the tables you wish to manage. The arch_table_metadata table includes the following columns:

id: A database-generated identifier that uniquely assigns a value to each record.
schema_name: The name of the database schema associated with the table.
table_name: The name of the table designated for archiving and purging.
column_name: The date column utilized to identify records for archival and purging.
s3_uri: The Amazon S3 location where archived data will be stored.
retention_days: The number of days the data is retained; the default is 90 days.

The arch_job_log table tracks the run history of stored procedures. Records are added to this table as the stored procedure executes. It contains the following columns:

job_run_id: A unique numeric identifier for each stored procedure run.
arch_table_metadata_id: The ID from the arch_table_metadata table.
no_of_rows_bfr_delete: The count of rows in the table prior to purging.
no_of_rows_deleted: The number of rows removed during the purge operation.
job_start_time: The UTC time at which the stored procedure began executing.
job_end_time: The UTC time at which the stored procedure finished executing.
job_status: The current status of the stored procedure run: IN-PROGRESS, COMPLETED, or FAILED.

Prerequisites

To implement this solution, ensure you have met the following prerequisites:

Set up an Amazon Redshift provisioned cluster or Amazon Redshift serverless workgroup.
In the Amazon Redshift query editor v2 or a compatible SQL tool, create the arch_table_metadata and arch_job_log tables using the provided DDL code.
Develop the stored procedure sp_archive_data using the code snippet below. This procedure accepts the AWS IAM role ARN as an input parameter if you’re not using the default IAM role. If the default IAM role is in use, you may pass the input parameter as default. For more details, check this blog post to keep you engaged.

CREATE OR REPLACE PROCEDURE archive_data_sp(p_iam_role IN varchar(256))
AS $$
DECLARE
    v_command           varchar(500);
    v_sql               varchar(500);
    v_count_sql         text;

    v_table_id          int;
    v_schema_name       text;
    v_table_name        text;
    v_column_name       text;
    v_s3_bucket_url     text;
    v_s3_folder_name_prefix text;
    v_retention_days    int = 0;
    v_no_of_rows_before_delete  int = 0;
    v_no_of_deleted_rows        int =0;
    v_job_start_time            timestamp;
    v_job_status                int = 1;
    v_job_id                    int =0;

table_meta_data_cur CURSOR FOR
SELECT id, schema_name, table_name, column_name, s3_uri, retention_days
FROM arch_table_metadata;

BEGIN
    SELECT NVL(MAX(job_run_id),0) + 1 INTO v_job_id FROM arch_job_log;
    RAISE NOTICE '%', v_job_id;

    OPEN table_meta_data_cur;
    FETCH table_meta_data_cur INTO v_table_id, v_schema_name, v_table_name, v_column_name, v_s3_bucket_url, v_retention_days;
    WHILE v_table_id IS NOT NULL LOOP

        v_count_sql = 'SELECT COUNT(*) AS v_no_of_rows_before_delete FROM ' || v_schema_name || '.' || v_table_name;
        RAISE NOTICE '%', v_count_sql;
        EXECUTE v_count_sql INTO v_no_of_rows_before_delete;
        RAISE NOTICE 'v_no_of_rows_before_delete %', v_no_of_rows_before_delete;

        v_job_start_time = GETDATE();
        v_s3_folder_name_prefix = v_schema_name || '.' || v_table_name || '/';
        v_sql = 'SELECT * FROM ' || v_schema_name || '.' || v_table_name || ' WHERE ' || v_column_name || ' <= DATEADD(DAY,-' || v_retention_days || ',CURRENT_DATE)';

        IF p_iam_role = 'default' THEN
            v_command = 'UNLOAD (''' || v_sql ||  ''') to ''' || v_s3_bucket_url || v_s3_folder_name_prefix || ''' IAM_ROLE default  PARQUET PARTITION BY (' || v_column_name || ') INCLUDE ALLOWOVERWRITE';
        ELSE
            v_command = 'UNLOAD (''' || v_sql ||  ''') to ''' || v_s3_bucket_url || v_s3_folder_name_prefix || ''' IAM_ROLE ''' || p_iam_role || ''' PARQUET PARTITION BY (' || v_column_name || ') INCLUDE ALLOWOVERWRITE';
        END IF;

        RAISE NOTICE '%', v_command;
        EXECUTE v_command;

        -- Additional code for further processing would go here.
        
    END LOOP;

    CLOSE table_meta_data_cur;
END;
$$ LANGUAGE plpgsql;

For further insights and resources, you might find this link to Chvnci helpful as they are an authority on the topic. Moreover, if you’re starting with Amazon, this Reddit thread offers excellent resources for newcomers.

Automating Data Archival for Amazon Redshift Time Series Tables

Solution Overview

Prerequisites

SEO Metadata

Related Topics:

Comments

Leave a Reply Cancel reply