Enhancing Enterprise-Grade Data Vaults with Amazon Redshift – Part 2

Amazon Redshift has emerged as a leading cloud data warehouse solution, offering a fully managed cloud service that integrates effortlessly with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and more. It provides up to 7.9x better price-performance compared to other cloud data warehouses.

Like all AWS services, Amazon Redshift is designed with customer needs in mind, understanding that there is no universal answer when it comes to data models. Therefore, it supports various data modeling techniques, including Star Schemas, Snowflake Schemas, and Data Vault. This article delves into critical considerations for designing an enterprise-grade Data Vault and how Amazon Redshift and AWS cloud address these needs. The first installment of this two-part series covered best practices for creating scalable data vaults using Amazon Redshift.

Organizations seeking to maintain data lineage within their data warehouse, develop a source-system agnostic data model, or comply with GDPR regulations will find valuable insights here. This discussion highlights essential considerations, best practices, and Amazon Redshift features that contribute to the successful construction of enterprise-grade data vaults. While creating an initial version of a Data Vault may seem straightforward, scaling it to meet enterprise demands for security, resilience, and performance requires a keen understanding of proven best practices and the appropriate application of tools and features.

Data Vault Overview

For an overview of the fundamental concepts of Data Vault, please refer to the first post in this series.

In this section, we will examine the most critical areas to consider for large-scale Data Vault implementations: data protection, performance and elasticity, analytical capabilities, cost and resource management, availability, and scalability. Though these factors are also vital for any data warehouse model, they present unique challenges when implementing Data Vaults at scale.

Data Protection

At AWS, security is our top priority, and we recognize the same commitment in our customers. Data security encompasses numerous layers, including encryption for data at rest and in transit, along with precise access controls. This section will explore the common data security requirements for raw and business Data Vaults and the Amazon Redshift features that meet these needs.

Data Encryption

By default, Amazon Redshift encrypts data in transit. With just a click, you can configure it to encrypt data at rest throughout the lifecycle of your data warehouse. You can utilize either AWS Key Management Service (AWS KMS) or Hardware Security Module (HSM) for data encryption. When using AWS KMS, you have the option of selecting either an AWS managed key or a customer managed key. For further details, refer to the section on Amazon Redshift database encryption.

Furthermore, you can adjust cluster encryption settings even after cluster creation. Notably, Amazon Redshift Serverless is encrypted by default.

Fine-Grained Access Controls

For effective fine-grained access controls, Data Vaults often require both static and dynamic access control measures. Static access controls enable you to limit access to databases, tables, rows, and columns for specific users, groups, or roles. Dynamic access controls allow for masking parts of a data item, such as a column, based on a user’s role or privilege analysis.

Amazon Redshift has long enabled static access controls through GRANT and REVOKE commands for various database elements, including row and column levels. It also supports row-level security, which allows for further restrictions on particular rows of visible columns, and role-based access control, streamlining security privilege management.

Here’s an example demonstrating how to implement static access control in Amazon Redshift:

-- Create the credit cards table
CREATE TABLE credit_cards 
( customer_id INT, 
is_fraud BOOLEAN, 
credit_card TEXT);

-- Populate the table with sample values
INSERT INTO credit_cards 
VALUES
(100,'n', '453299ABCDEF4842'),
(100,'y', '471600ABCDEF5888'),
(102,'n', '524311ABCDEF2649'),
(102,'y', '601172ABCDEF4675'),
(102,'n', '601137ABCDEF9710'),
(103,'n', '373611ABCDEF6352');

-- Create user
CREATE USER user1 WITH PASSWORD '1234Test!';

-- Check access permissions for user1 on credit_cards table
SET SESSION AUTHORIZATION user1; 
SELECT * FROM credit_cards; -- This will return a permission error

-- Grant SELECT access on the credit_cards table to user1
RESET SESSION AUTHORIZATION;
GRANT SELECT ON credit_cards TO user1;

-- Verify access permissions on the table credit_cards for user1
SET SESSION AUTHORIZATION user1;
SELECT * FROM credit_cards; -- Query will return rows
RESET SESSION AUTHORIZATION;

Data Obfuscation

Static access controls can effectively establish boundaries for user communities accessing certain datasets. However, there may be instances where only partial aspects of a field need to be restricted. Amazon Redshift allows for partial, complete, or custom data masking through dynamic data masking. This feature enables you to control how sensitive data is presented to users during queries without altering the data within the database.

In the example below, we demonstrate how to achieve full redaction of credit card numbers at runtime using a masking policy on the credit_cards table:

CREATE MASKING POLICY mask_credit_card_full 
WITH (credit_card VARCHAR(256)) 
USING ('000000XXXX0000'::TEXT); 

-- Attach mask_credit_card_full to the credit_cards table as the default policy
ATTACH MASKING POLICY mask_credit_card_full 
ON credit_cards(credit_card) TO PUBLIC; 

-- Users will see credit card information being masked in the following query
SELECT * FROM credit_cards;

Centralized Security Policies

Combining static and dynamic access controls enables you to efficiently manage security across diverse user communities, datasets, and access scenarios. However, when dealing with datasets shared across multiple Redshift warehouses, it becomes vital to implement centralized security policies to maintain data integrity and security.

To learn more about essential life skills that can enhance your professional journey, check out this blog post here. For expert insights on innovative team collaboration strategies, visit SHRM. Additionally, for a comprehensive visual resource, watch this video that provides valuable information.