Building a Serverless ETL Framework for Amazon Redshift Using RSQL, AWS Batch, and AWS Step Functions

Building a Serverless ETL Framework for Amazon Redshift Using RSQL, AWS Batch, and AWS Step FunctionsMore Info

Amazon Redshift RSQL serves as a command-line interface designed for seamless interaction with Amazon Redshift clusters and databases. This powerful tool enables users to connect to their Redshift clusters, describe database objects, query data, and view results across various output formats. Its enhanced control flow commands are capable of replacing traditional extract, transform, load (ETL) and automation scripts, providing a more streamlined approach.

This article delves into how to construct a fully serverless and cost-efficient ETL orchestration framework for Amazon Redshift. To accomplish this, we will leverage Amazon Redshift RSQL in conjunction with AWS services such as AWS Batch and AWS Step Functions.

Solution Overview

When transitioning from legacy data warehouses to Amazon Redshift, organizations often face proprietary scripts that encapsulate SQL statements and intricate business logic, including if-then-else control flow, error reporting, and handling. These legacy features can be transformed into Amazon Redshift RSQL, which effectively replaces existing ETL and automation scripts. For further insights into Amazon Redshift RSQL’s capabilities, examples, and practical applications, refer to this blog post.

The AWS Schema Conversion Tool (AWS SCT) can facilitate the conversion of proprietary scripts into Amazon Redshift RSQL. Specifically, AWS SCT can automatically translate Teradata BTEQ scripts into Amazon Redshift RSQL. To learn more about using AWS SCT, check out the guide on converting Teradata BTEQ scripts.

The primary objective of the framework discussed in this article is to execute complex ETL jobs via Amazon Redshift RSQL scripts in the AWS Cloud without the burden of managing infrastructure. Beyond functional requirements, this solution ensures comprehensive auditing and traceability of all executed ETL processes.

The architecture diagram below illustrates the final setup.

The deployment of this framework is entirely automated through the AWS Cloud Development Kit (AWS CDK) and consists of the following components:

  • EcrRepositoryStack: Establishes a private Amazon Elastic Container Registry (ECR) repository that stores the Docker image featuring Amazon Redshift RSQL.
  • RsqlDockerImageStack: Constructs the Docker image asset and uploads it to the ECR repository.
  • VpcStack: Sets up a VPC with isolated subnets, creates an Amazon Simple Storage Service (S3) VPC endpoint gateway, and establishes Amazon ECR, Amazon Redshift, and Amazon CloudWatch VPC endpoint interfaces.
  • RedshiftStack: Creates an Amazon Redshift cluster, activates encryption, enforces in-transit encryption, enables auditing, and deploys the cluster within isolated subnets.
  • BatchStack: Configures a compute environment (utilizing AWS Fargate), job queue, and job definition (employing our Docker image with RSQL).
  • S3Stack: Establishes data, scripts, and logging buckets; activates encryption at rest; ensures secure transfer; enables object versioning; and restricts public access.
  • SnsStack: Creates an Amazon Simple Notification Service (SNS) topic and an email subscription (where the email address is provided as a parameter).
  • StepFunctionsStack: Develops a state machine to orchestrate serverless RSQL ETL jobs.
  • SampleDataDeploymentStack: Deploys sample RSQL ETL scripts and sample TPC benchmark datasets.

Prerequisites

To begin, ensure you have the following prerequisites:

  • An AWS account
  • Amazon Linux 2 with AWS CDK and Docker Engine installed

Deploying AWS CDK Stacks

To implement the serverless RSQL ETL framework, execute the following commands. Be sure to replace 123456789012 with your AWS account number, eu-west-1 with your desired AWS Region for deployment, and your.email@example.com with the email address where you would like to receive ETL success and failure notifications.

git clone https://github.com/aws-samples/amazon-redshift-serverless-rsql-etl-framework
cd amazon-redshift-serverless-rsql-etl-framework
npm install
./cdk.sh 123456789012 eu-west-1 bootstrap
./cdk.sh 123456789012 eu-west-1 deploy --all --parameters SnsStack:EmailAddressSubscription=your.email@example.com

The deployment process should only take a few minutes. While AWS CDK is busy creating the stacks, feel free to continue reading this article.

Creating the RSQL Container Image

AWS CDK builds an RSQL Docker image, which serves as the foundational component of our solution. All ETL processes are executed within this image. The Docker image is created locally using Docker Engine and subsequently uploaded to the Amazon ECR repository.

This Docker image is based on an Amazon Linux 2 image and comes pre-installed with essential tools: the AWS Command Line Interface (AWS CLI), unixODBC, the Amazon Redshift ODBC driver, and Amazon Redshift RSQL. Additionally, it includes a .odbc.ini file that specifies the ETL profile used for connecting to the Amazon Redshift cluster.

FROM amazonlinux:2

ENV AMAZON_REDSHIFT_ODBC_VERSION=1.4.65.1000
ENV AMAZON_REDSHIFT_RSQL_VERSION=1.0.8

RUN yum install -y openssl gettext unixODBC awscli && 
yum clean all

RUN rpm -i 
https://s3.amazonaws.com/redshift-downloads/drivers/odbc/${AMAZON_REDSHIFT_ODBC_VERSION}/AmazonRedshiftODBC-64-bit-${AMAZON_REDSHIFT_ODBC_VERSION}-1.x86_64.rpm 
https://s3.amazonaws.com/redshift-downloads/amazon-redshift-rsql/${AMAZON_REDSHIFT_RSQL_VERSION}/AmazonRedshiftRsql-${AMAZON_REDSHIFT_RSQL_VERSION}.x86_64.rpm

COPY .odbc.ini .odbc.ini
COPY fetch_and_run.sh /usr/local/bin/fetch_and_run.sh

ENV ODBCINI=.odbc.ini
ENV ODBCSYSINI=/opt/amazon/redshiftodbc/Setup
ENV AMAZONREDSHIFTODBCINI=/opt/amazon/redshiftodbc/lib/64/amazon.redshiftodbc.ini

ENTRYPOINT ["/usr/local/bin/fetch_and_run.sh"]

The following code snippet illustrates the .odbc.ini file, which defines an ETL profile that utilizes an AWS Identity and Access Management (IAM) role to obtain temporary cluster credentials for connecting to Amazon Redshift. Notably, AWS CDK automatically creates this role, eliminating the need to hard-code credentials into the Docker image. The parameters for Database, DbUser, and ClusterID are defined in AWS CDK, and the Region parameter is dynamically replaced at runtime based on the deployment location.

[ODBC]
Trace=no

[etl]
Driver=/opt/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so
Database=demo
DbUser=etl
ClusterID=redshiftblogdemo
Region=eu-west-1
IAM=1

For further details regarding connecting to Amazon Redshift clusters using RSQL, see the documentation on connecting to a cluster with Amazon Redshift RSQL.

Our Docker image follows a well-established “fetch and run” integration pattern, which allows it to retrieve the ETL script from an external repository before executing it. AWS CDK passes the necessary ETL script information to the Docker container as a runtime parameter within the AWS Batch job. This job parameter is made available to the container through the environment variable named BATCH_SCRIPT_LOCATION. Additionally, the job expects the environment variables DATA_BUCKET_NAME, which contains the S3 data bucket’s name, and COPY_IAM_ROLE_ARN, which specifies the IAM role for the COPY command to load data into Amazon Redshift. These environment variables are automatically set by AWS CDK. The entry point for the Docker container is the fetch_and_run.sh script.

#!/bin/bash

# This script expects the following env variables to be set:
# BATCH_SCRIPT_LOCATION - full S3 path to RSQL script to run
# DATA_BUCKET_NAME - S3 bucket name with the data
# COPY_IAM_ROLE_ARN - IAM role ARN that will be used to copy the data from S3 to Redshift

PATH="/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin"

if [ -z "${BATCH_S

In summary, the serverless ETL framework for Amazon Redshift using RSQL, AWS Batch, and AWS Step Functions not only simplifies the ETL process but also ensures that it remains cost-effective and easy to manage.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *