As a data warehouse administrator or data engineer, you often need to carry out maintenance tasks or establish custom monitoring on a regular basis. These tasks can be consolidated within a stored procedure or by invoking views to gather necessary insights. Such activities may include loading nightly staging tables, executing views, terminating idle connections, and removing unused tables.
In this article, we explore how to automate these routine tasks for an Amazon Redshift cluster operating within a secure private network. The solution incorporates the following AWS services:
- AWS Lambda: Executes specific queries and calls views or stored procedures within your Amazon Redshift cluster.
- Amazon EventBridge: Schedules the execution of SQL statements by triggering a Lambda function. The EventBridge rule provides the Amazon Redshift cluster details as input parameters, enabling you to supply multiple queries or cluster specifics.
- AWS Identity and Access Management (IAM): Grants access to the Amazon Redshift cluster using temporarily generated credentials securely. This approach eliminates the need to store access credentials.
- Amazon API Gateway: Facilitates a secure connection to the Amazon Redshift API service from a private subnet without internet access.
Solution Architecture
The architecture diagram below illustrates the solution’s overview.
The workflow of this architecture includes:
- Creating an EventBridge rule with a schedule using the default event bus to invoke a target—a Lambda function that connects to an Amazon Redshift cluster and executes a SQL statement. The target is set up to provide input parameters, including an Amazon Redshift cluster identifier, database name, user, and the SQL to be executed.
- The rule triggers at the scheduled time, sending data to the
RedshiftExecuteSQLFunction
responsible for running the specified query. - This function operates within the user’s Amazon Virtual Private Cloud (VPC) inside a private subnet without internet access. To communicate with the Amazon Redshift API service for generating temporary user credentials, an Amazon API Gateway with a VPC endpoint is employed. The function transmits the Amazon Redshift cluster information through the private subnet to the API Gateway VPC endpoint, which is powered by another function,
RedshiftApiLambda
, that handles communication with the Amazon Redshift API service to generate temporary credentials, returning them securely to theRedshiftExecuteSQLFunction
. - The
RedshiftExecuteSQLFunction
utilizes the Amazon Redshift cluster endpoint, port, and temporary credentials received earlier to interact with the Amazon Redshift cluster in the private subnet of the user’s VPC, executing the SQL statement submitted.
This architecture is scalable, accommodating multiple rules for various DBA tasks across different Amazon Redshift clusters.
Prerequisites
To get started, you must possess an AWS account. We provide an AWS CloudFormation template to illustrate the solution. You can download and utilize this template to easily deploy the necessary AWS resources. This template has been tested in the us-east-1 Region.
Once logged into your AWS account, follow these steps:
- Deploy the resources using the template to launch the stack on the AWS Management Console, or use the following link:
- Select Next.
- On the Specify stack details page, input the following parameters:
- For Lambda VPC Configuration, choose the VPC and subnets within it. The template allows selecting multiple subnets, but only the first two will be utilized. Ensure that the chosen VPC subnets have access to the target Amazon Redshift cluster.
- Decide to create or use an existing VPC endpoint for the API Gateway. If using an existing endpoint, ensure it is a DNS-enabled interface endpoint.
- Leave the other values at their defaults and choose Next.
- On the Configure stack options page, maintain the defaults and select Next.
- On the Review page, acknowledge that AWS CloudFormation might create IAM resources and choose Create stack.
The CloudFormation template may take around 5 minutes to deploy resources. Once the stack status indicates CREATE_COMPLETE, select the Outputs tab and note the values for RedshiftExecuteSQLFunction
and RedshiftExecuteSQLFunctionSecurityGroup
. These values will be needed later for creating EventBridge rules and allowing access to the Amazon Redshift cluster.
Amazon Redshift Stored Procedures and Security Definer
A stored procedure is a user-defined object designed to execute a series of SQL queries and logical operations. They often encapsulate logic for data transformation, validation, and business-specific tasks. By consolidating multiple SQL steps into a stored procedure, you reduce round trips between your applications and the database.
Amazon Redshift supports stored procedures in the PL/pgSQL dialect, allowing variable declaration, control logic, and error handling. The SECURITY attribute specifies who can access which database objects. By default, only superusers and the procedure’s owner have permission to perform actions. You can create stored procedures to execute functions without granting users access to the underlying tables through security definer controls. This concept allows users to perform actions they might not otherwise have permission for, like dropping tables created by others.
For further details on stored procedures, refer to Creating stored procedures in Amazon Redshift and Security and privileges for stored procedures.
In this post, we develop two DBA tasks in the form of a stored procedure and views within the Amazon Redshift cluster:
- Dropping unused tables
- Cleaning up idle connections
We will then schedule these tasks using EventBridge and Lambda. To enhance tracking of DBA tasks—such as which tables are dropped and how many idle connections are terminated—we create a helper table and a stored procedure to log the execution details. SQL statements can be executed against the cluster using a query editor or SQL client tools. Subsequently, you can invoke this stored procedure in other DBA task stored procedures to log task details, as demonstrated in the following code:
CALL dba.sp_log_dba_task(CURRENT_USER_ID, user, 'Idle connections', 'Kill idle connections', 'Succeed');
Dropping Unused Tables
Users may create tables for temporary use but neglect to delete them afterward. Over time, these leftover tables can accumulate in the data warehouse, wasting storage space. In this use case, DBAs need to perform regular cleanups to manage this issue.
To learn more about best practices, check out this excellent resource.
Leave a Reply