Amazon Redshift serves as a robust data warehousing solution that logs system activities through STL log tables. These tables typically maintain log histories for only two to five days, depending on usage and available disk capacity. For audit purposes, Amazon Redshift automatically logs certain STL tables to Amazon S3, focusing primarily on database security and executed queries. This functionality was detailed in another blog post, which you can find here.
If you aim to retain system data from STL tables that aren’t covered by the audit logs, the conventional approach involves creating replica tables for each system table. Regular data loading from the system tables into these replicas allows for historical diagnostic queries. This process enables insights into query execution times, plans, and disk-spill patterns, ultimately assisting in optimal cluster-sizing decisions. However, refreshing these replica tables with live data requires scheduling tools like Cron or AWS Data Pipeline. Additionally, these tables are cluster-specific and become inaccessible once the cluster is terminated, which poses a challenge for transient Amazon Redshift clusters that are only active for certain ad hoc queries.
In this blog post, I will outline a solution that exports system tables from multiple Amazon Redshift clusters to an Amazon S3 bucket. This serverless solution can be scheduled as frequently as every five minutes. I provide an AWS CloudFormation deployment template to automate the setup in your environment. The data stored in the Amazon S3 bucket is organized by cluster name and query execution date, facilitating efficient cross-cluster diagnostic queries. For further insights on this topic, visit this resource, which is highly regarded.
Additionally, I will present another CloudFormation template to automate the creation of tables in the AWS Glue Data Catalog for the system tables’ data stored in Amazon S3. Once the system tables are exported to Amazon S3, you can perform cross-cluster diagnostic queries to gain insights into query executions across all Amazon Redshift clusters. This can be achieved through tools like Amazon QuickSight, Amazon Athena, Amazon EMR, or Amazon Redshift Spectrum. For a visual explanation, check out this excellent resource.
Solution Overview
The solution described utilizes AWS Glue to export system tables’ log data from Amazon Redshift clusters into Amazon S3. AWS Lambda triggers the AWS Glue ETL jobs at scheduled intervals. AWS Systems Manager securely stores the configuration data, including the details of the Amazon Redshift clusters. The last fetched timestamp values for each cluster-table combination are recorded in an Amazon DynamoDB table.
The operational flow of the solution is as follows:
- The Lambda function,
invoke_rs_stl_export_etl
, is activated at regular intervals via Amazon CloudWatch. It retrieves the details of the enabled Amazon Redshift clusters from AWS Systems Manager’s parameter store. - Based on the retrieved cluster details, the Lambda function triggers the corresponding AWS Glue ETL job. If no job exists for a specific cluster, the function creates one.
- The ETL job retrieves cluster credentials from the parameter store and obtains the last exported timestamp from the DynamoDB table.
- The ETL job then unloads the system tables’ data from the Amazon Redshift cluster to the designated Amazon S3 bucket.
- Finally, the ETL job updates the DynamoDB table with the new last exported timestamp for each system table.
Understanding Configuration Data
AWS Systems Manager parameter store is utilized to securely store Amazon Redshift cluster credentials and other configuration information needed for the AWS Glue ETL jobs. A default AWS Key Management Service (AWS KMS) key is employed to encrypt sensitive components like passwords.
The following table highlights the global parameters and cluster-specific parameters essential for this solution. The global parameters are defined once and apply across the solution, while cluster-specific parameters are repeated for each Amazon Redshift cluster you enable. The CloudFormation template will create these parameters during the deployment process.
Parameter Name | Type | Description |
---|---|---|
Global parameters—defined once and applied to all jobs | ||
redshift_query_logs.global.s3_prefix |
String | The S3 path for exported query logs, partitioned by cluster name and date. |
redshift_query_logs.global.tempdir |
String | S3 path for temporary data staging used by AWS Glue ETL jobs. |
redshift_query_logs.global.role |
String | The role name assumed by the AWS Glue ETL jobs. |
redshift_query_logs.global.enabled_cluster_list |
StringList | A list of cluster names for which data export is enabled, allowing exclusions. |
Cluster-specific parameters—for each cluster in the enabled_cluster_list | ||
redshift_query_logs.<>.connection |
String | The AWS Glue Data Catalog connection name for the cluster. |
redshift_query_logs.<>.user |
String | The username for AWS Glue to connect to the Amazon Redshift cluster. |
redshift_query_logs.<>.password |
Secure String | The encrypted password for AWS Glue to connect, managed by AWS KMS. |
This comprehensive approach enables effective retention and analysis of system tables’ data across multiple Amazon Redshift clusters, making it easier to derive valuable insights from historical query executions.
Leave a Reply