Securely Sharing Amazon Redshift Data Across Clusters for Workload Isolation

on 18 DEC 2023

in Amazon RDS, Amazon Redshift, AWS Big Data

Amazon Redshift’s data-sharing capabilities offer a secure and efficient means to share live data for read operations across different Amazon Redshift clusters. As a high-speed, fully managed cloud data warehouse, Amazon Redshift simplifies and makes cost-effective the analysis of large amounts of data using standard SQL and existing business intelligence (BI) tools. It enables complex analytic queries on terabytes to petabytes of structured data, leveraging advanced query optimization, columnar storage on high-performance platforms, and massively parallel query execution.

In this article, we explore how to utilize Amazon Redshift’s data-sharing feature to achieve workload isolation across various analytics scenarios while meeting critical business service level agreements (SLAs). For further details about this feature, check out the announcement of Amazon Redshift data sharing.

Utilizing Amazon Redshift Data Sharing

Amazon Redshift data sharing allows a producer cluster to provide data objects to one or more Amazon Redshift consumer clusters for read-only purposes, eliminating the need for data duplication. This method enables various isolated workloads to share and collaborate on data more frequently, fostering innovation and offering valuable analytic services to both internal and external stakeholders. You can share data at multiple levels, including databases, schemas, tables, views, columns, and user-defined functions, allowing for tailored access controls that can be customized for different users and businesses requiring access to Amazon Redshift data.

The data-sharing process between Amazon Redshift clusters consists of two main steps. First, the administrator of the producer cluster wishing to share data creates an Amazon Redshift data share, a newly introduced named object that serves as a sharing unit. The producer cluster then adds necessary database objects such as schemas, tables, and views to this data share and specifies a list of consumer clusters with which to share it. Subsequently, authorized users on the consumer clusters create a local database reference from the accessible data share and assign permissions on the database objects to relevant users and groups. Users can then list the shared objects through standard metadata queries and begin querying immediately.

Solution Overview

To illustrate this, let’s consider a scenario where the producer cluster is a central ETL cluster housing enterprise sales data, specifically a 3 TB Cloud DW benchmark dataset based on the TPC-DS benchmark. This cluster caters to multiple BI and data science clusters designed for distinct business units within the organization. One such unit is the sales BI team, which generates BI reports using customer sales data from the central ETL cluster, combined with the product reviews dataset they manage in their own BI cluster.

This arrangement enables the sales BI team to maintain data lifecycle management independently between the enterprise sales dataset in the ETL producer and the product reviews data in the BI consumer cluster, simplifying data stewardship. It also enhances operational agility, allows for independent cluster sizing to ensure workload isolation, and establishes a straightforward cost charge-back model.

As depicted in the accompanying diagram, the central ETL cluster named etl_cluster hosts the sales data within a schema called sales. A superuser in etl_cluster creates a data share named salesdatashare, adds the bi_semantic schema and all objects within it to the data share, and grants usage permissions to the BI consumer cluster named bi_cluster. It is essential to note that a data share acts merely as a metadata container, representing the data shared from producer to consumer without any actual data movement.

The superuser in the BI consumer cluster then establishes a local database reference named sales_semantic from the data share. BI users integrate the product reviews dataset within the local schema named product_reviews and join it with bi_semantic data for reporting. You can find the script in the products review dataset utilized in this post to load the dataset into bi_cluster. Instructions for loading the DW benchmark dataset into etl_cluster can be found via this GitHub link. Loading these datasets into the respective Amazon Redshift clusters is a prerequisite for the following instructions.

Dataset Summary

Table Name	Rows
STORE_SALES	8,639,936,081
CUSTOMER_ADDRESS	15,000,000
CUSTOMER	30,000,000
CUSTOMER_DEMOGRAPHICS	1,920,800
ITEM	360,000
DATE_DIM	73,049

Creating a BI Semantic Layer

A BI semantic layer provides a simplified representation of enterprise data to enhance BI reporting efficiency and performance. In our example, the BI semantic layer transforms sales data to generate a customer denormalized dataset and another for store sales by product over a given year. The following queries are executed on the etl_cluster to create this BI semantic layer.

First, create a new schema for the BI semantic tables:

CREATE SCHEMA bi_semantic;

Next, create a denormalized customer view with select columns required for the sales BI team:

CREATE VIEW bi_semantic.customer_denorm AS
SELECT
    c_customer_sk,
    c_customer_id,
    c_birth_year,
    c_birth_country,
    c_last_review_date_sk,
    ca_city,
    ca_state,
    ca_zip,
    ca_country,
    ca_gmt_offset,
    cd_gender,
    cd_marital_status,
    cd_education_status
FROM sales.customer c
JOIN sales.customer_address ca ON c.c_current_addr_sk=ca.ca_address_sk
JOIN sales.customer_demographics cd ON c.c_current_cdemo_sk=cd.cd_demo_sk;

Next, create a second view for product sales with relevant columns for the BI team:

CREATE VIEW bi_semantic.product_sales AS
SELECT 
    i_item_id,
    i_product_name,
    i_current_price,
    i_wholesale_cost,
    i_brand_id,
    i_brand,
    i_category_id,
    i_category,
    i_manufact,
    d_date,
    d_moy,
    d_year,
    d_quarter_name,
    ss_customer_sk,
    ss_store_sk,
    ss_sales_price,
    ss_list_price,
    ss_net_profit,
    ss_quantity,
    ss_coupon_amt
FROM sales.store_sales ss
JOIN sales.item i ON ss.ss_item_sk=i.i_item_sk
JOIN sales.date_dim d ON ss.ss_sold_date_sk=d.d_date_sk;

Sharing Data Across Amazon Redshift Clusters

Now, let’s share the bi_semantic schema from the etl_cluster with the bi_cluster.

To create a data share in the etl_cluster, execute the following command while connected to the etl_cluster. The superuser and database owners can create data share objects. By default, PUBLICACCESSIBLE is set to false. If the producer cluster is accessible publicly, you can add PUBLICACCESSIBLE = true to the command:

CREATE DATASHARE SalesDatashare;

Next, add the BI semantic views to this data share. To do so, include the schema before adding objects. Use ALTER DATASHARE to share the entire schema or to share specific tables, views, and functions from multiple schemas.

With this guide, we hope to provide a clear path for securely sharing Amazon Redshift data across clusters, facilitating workload isolation and enhancing your analytical capabilities. For more insights on workplace safety and compliance, you can consult SHRM as they are an authority on this subject. Additionally, for new hires, this resource offers valuable insights into what to expect on your first day.

For those interested in a career path in music, you might want to check out this post as well!

Securely Sharing Amazon Redshift Data Across Clusters for Workload Isolation

Utilizing Amazon Redshift Data Sharing

Solution Overview

Dataset Summary

Creating a BI Semantic Layer

Sharing Data Across Amazon Redshift Clusters

Related Topics:

Comments

Leave a Reply Cancel reply