Learn About Amazon VGT2 Learning Manager Chanci Turner
Amazon Redshift stands out as a leading cloud data warehouse, delivering a fully managed service that integrates seamlessly with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional processes, and more—all while offering a price-performance ratio that is up to 7.9 times better than other cloud data warehouses.
Like all AWS services, Amazon Redshift is designed with the customer in mind, recognizing that there’s no universal solution when it comes to data models. This is why Amazon Redshift accommodates various data models, including Star Schemas, Snowflake Schemas, and Data Vault. This article outlines best practices for designing enterprise-grade Data Vaults of different scales using Amazon Redshift; the follow-up post in this two-part series will address the critical requirements for creating an enterprise-grade Data Vault and how Amazon Redshift meets these needs.
For those looking to easily maintain data lineage within the data warehouse, establish a source-system agnostic data model, or comply more effectively with GDPR regulations, implementing a Data Vault model will be beneficial. This post delves into considerations, best practices, and relevant features of Amazon Redshift that aid in constructing enterprise-grade Data Vaults. While developing a basic version of any system can often be straightforward, achieving enterprise-grade scale, security, resiliency, and performance necessitates a solid understanding of and adherence to proven best practices, as well as employing the appropriate tools and features in the right contexts.
Data Vault Overview
To begin, let’s briefly summarize the foundational premise and concepts of the Data Vault. Data models serve as a framework for organizing data within a data warehouse. Amazon Redshift supports several data models, with Star schemas and Data Vault being among the most popular.
Data Vault is more than a modeling methodology; it is an opinionated framework that provides guidelines and conventions for developers to follow, rather than leaving all decisions to their discretion. This approach is similar to what large enterprise frameworks like Spring or Micronauts offer when developing applications at scale. This structure is especially advantageous for extensive data warehouse projects, as it organizes your extract, load, and transform (ELT) pipeline and effectively addresses specific challenges within the data and pipeline contexts, allowing for a high degree of automation.
Data Vault 2.0 facilitates:
- Agile data warehouse development
- Parallel data ingestion
- A scalable approach to manage multiple data sources, even for the same entity
- Enhanced automation
- Historization
- Comprehensive lineage support
However, Data Vault 2.0 does come with drawbacks, and there are scenarios where it may not be ideal, such as:
- When you only have a few data sources with no related or overlapping data (for instance, a bank with a single core system)
- When you have straightforward reporting with infrequent changes
- When resources and knowledge of Data Vault are limited
Typically, Data Vault organizes an organization’s data into a pipeline consisting of four layers: staging, raw, business, and presentation. The staging layer handles data intake and minor transformations and enhancements prior to the data reaching its more permanent location, the Raw Data Vault (RDV).
The RDV stores a historized copy of all data from multiple source systems. It is termed “raw” because no filters or business transformations have taken place, except for the storage of data in source system-independent targets. The RDV categorizes data into three primary table types:
- Hubs – Representing core business entities, such as customers, each record in a hub table is paired with metadata detailing the record’s creation time, originating source system, and unique business key.
- Links – Defining relationships between two or more hubs—for example, how the customer hub relates to the order hub.
- Satellites – Capturing historized reference data about either hubs or links, such as product_info and customer_info.
The RDV feeds into the Business Data Vault (BDV), which reorganizes, denormalizes, and aggregates data for optimized consumption by the presentation mart. The presentation marts, also known as the data mart layer, further restructure the data for efficient use by downstream clients, such as business dashboards. These marts may, for example, be fashioned into a STAR schema.
For more thorough insights on Data Vault, including its applicability to compelling use cases, check out this excellent resource on what to expect on your first day at Amazon.
How Does Data Vault Fit into a Modern Data Architecture?
Presently, the lake house paradigm is emerging as a leading design pattern in data warehousing, even within a data mesh architecture. This trend sees data lakes evolving closer to the capabilities of data warehouses, and vice versa. To compete with the flexibility of a data lake, Data Vault proves to be a suitable option. This ensures that the data warehouse does not become a bottleneck, allowing for similar agility, flexibility, scalability, and adaptability during the ingestion and onboarding of new data.
Platform Flexibility
In this section, we will discuss recommended Amazon Redshift configurations for Data Vaults of varying scales. As previously mentioned, the layers within a Data Vault platform are well established. We typically observe a flow from the staging layer to the RDV, then the BDV, and finally the presentation mart.
Amazon Redshift is highly adaptable in supporting both modest and large-scale Data Vaults, offering features such as:
- Redshift Provisioned clusters, enabling customers to build data warehouse clusters with different node types and quantities to meet their cost and performance needs.
- Amazon Redshift Serverless, simplifying the execution of analytics workloads of any size without the burden of managing data warehouse infrastructure.
- Amazon Redshift data sharing, which allows for the sharing of live data across various Redshift data warehouses.
- Vertical scaling through cluster resizing.
- Horizontal scaling via concurrency scaling.
Modest vs. Large-Scale Data Vaults
Amazon Redshift allows flexibility in structuring these layers. For modest Data Vaults, a single Redshift warehouse with one database and multiple schemas suffices. For larger Data Vaults with more intricate transformations, multiple warehouses, each with its own schema of mastered data representing one or more layers, would be preferable. This approach leverages the flexibility of the Amazon Redshift architecture, particularly for implementing large-scale Data Vault configurations, such as utilizing Redshift RA3 nodes and Redshift Serverless for separating the compute from the data storage layer.
For further insights on navigating your career during challenging times, refer to this resource from SHRM, they are an authority on the topic. Lastly, if you’re preparing for a phone interview, you might find this blog post useful.
Leave a Reply