Establishing a Cost-Effective, Petabyte-Scale Lake House Using Amazon S3 Lifecycle Rules and Amazon Redshift Spectrum: Part 1

As data volumes continue to expand, organizations are increasingly challenged to manage long-term data retention—often necessitated by industry regulations—which can significantly impact storage costs. This challenge persists even for modern cloud-based data warehousing solutions like Amazon Redshift. The introduction of Amazon Redshift RA3 node types has facilitated the separation of compute from storage, enabling businesses to manage costs more effectively. By leveraging integration points such as Amazon Redshift Spectrum, various Amazon S3 storage classes, and additional S3 features, organizations can adhere to retention policies while keeping expenses manageable.

An enterprise client in Italy sought guidance from the AWS team on best practices for developing a data journey solution tailored to their sales data. The aim of this initial installment in our series is to provide comprehensive, step-by-step instructions and recommended practices for constructing an end-to-end data lifecycle management system that integrates an Amazon S3-based data lake house with Amazon Redshift. In the subsequent part, we will delve into further best practices for operating the solution, including implementing a sustainable monthly aging process, utilizing Amazon Redshift local tables to address common issues, and analyzing data access patterns through Amazon S3 access logs.

Amazon Redshift and Redshift Spectrum

During re:Invent 2019, AWS introduced new Amazon Redshift RA3 nodes, which brought heightened cost efficiency to cloud data warehousing. However, we encountered cases where regulatory requirements necessitated retaining large volumes of historical data—often for 10–12 years or more. Additionally, this cold data must be accessible by external services and applications, such as Amazon SageMaker for AI and machine learning training jobs, and at times, it needs to be queried alongside the hot data in Amazon Redshift. For these scenarios, Redshift Spectrum is particularly beneficial, as it can be used in conjunction with Amazon S3 storage classes to enhance Total Cost of Ownership (TCO).

Redshift Spectrum allows users to query data stored in S3 buckets using existing application code and logic for data warehouse tables, enabling joins and unions between Amazon Redshift local tables and Amazon S3 data. It operates on a fleet of compute nodes managed by AWS, enhancing system scalability. To utilize it, users must define at least an external schema and an external table (unless these are already established in the AWS Glue Data Catalog). The Data Definition Language (DDL) statements employed to create an external table include a location attribute pointing to S3 buckets and prefixes that house the dataset, which can be formatted in common file types such as ORC, Parquet, AVRO, CSV, JSON, or plain text. We recommend using compressed and columnar formats like Apache Parquet, as they reduce storage usage and improve performance.

For our data catalog, we can utilize AWS Glue or an external Hive metastore; in this instance, we will use AWS Glue.

S3 Lifecycle Rules

Amazon S3 offers a variety of storage classes, including S3 Standard, S3-IA, S3 One-Zone, S3 Intelligent-Tiering, S3 Glacier, and S3 Glacier Deep Archive. For our scenario, we need to keep data accessible for queries for five years with high durability, thus we will focus on S3 Standard and S3-IA for this duration, reserving S3 Glacier for long-term storage (5–12 years). Accessing data in S3 Glacier necessitates retrieval times in the range of minutes (using expedited retrieval), which cannot be synchronized with immediate query requirements. We can, however, use Glacier for very cold data by first restoring the Glacier archive to a temporary S3 bucket before querying the data defined via an external table.

S3 Glacier Select enables querying directly on data within S3 Glacier but only supports uncompressed CSV files. Given the goal of this post is to propose a cost-effective solution, we have chosen not to include it. If there are constraints requiring storage in CSV format rather than more efficient compressed formats like Parquet, Glacier Select might still be a viable option.

When disregarding retrieval costs, S3-IA storage typically costs about 45% less than S3 Standard, while S3 Glacier is approximately 68% cheaper than S3-IA. For the latest pricing details, please refer to Amazon S3 pricing.

S3 Intelligent Tiering is not utilized as it transitions storage based on the last access time, which resets each time we query the data. Instead, we employ S3 Lifecycle rules based on creation time or prefix and tag matching, ensuring consistency irrespective of data access patterns.

Simulated Use Case and Retention Policy

Our use case requires the implementation of a data retention strategy for trip records, as outlined in the table below.

Corporate Rule	Dataset Start	Dataset End	Data Storage Engine
Last 6 months in Redshift Spectrum	December 2019	May 2020	Amazon Redshift local tables
Months 6–11 in Amazon S3	June 2019	November 2019	S3 Standard
Months 12–14 in S3-IA	March 2019	May 2019	S3-IA
After month 15	January 2019	February 2019	Glacier

For this post, we will create a new table in a new Amazon Redshift cluster and load a public dataset. We will utilize the New York City Taxi and Limousine Commission (TLC) Trip Record Data as it provides the necessary historical depth.

We will work with the Green Taxi Trip Records, based on monthly CSV files that contain 20 columns with fields like vendor ID, pickup time, drop-off time, fare, and other pertinent information.

Preparing the Dataset and Setting Up the Amazon Redshift Environment

The first step involves creating an AWS Identity and Access Management (IAM) role for Redshift Spectrum. This role is essential for granting Amazon Redshift access to Amazon S3 for querying and loading data, as well as for enabling access to the AWS Glue Data Catalog when creating, modifying, or deleting new external tables.

Create a role named BlogSpectrumRole. Edit the following two JSON files to reflect the IAM policies based on the bucket and prefix used in this post, and attach these policies to the role you created:

S3-Lakehouse-Policy.JSON

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::rs-lakehouse-blog-post",
                "arn:aws:s3:::rs-lakehouse-blog-post/extract_longterm/*",
                "arn:aws:s3:::rs-lakehouse-blog-post/extract_midterm/*",
                "arn:aws:s3:::rs-lakehouse-blog-post/extract_shortterm/*",
                "arn:aws:s3:::rs-lakehouse-blog-post/accesslogs/*"
            ]
        }
    ]
}

Glue-Lakehouse-Policy.JSON

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:DeleteDatabase",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:UpdateDatabase",
                "glue:CreateTable",
                "glue:DeleteTable",
                "glue:GetTable",
                "glue:GetTables"
            ],
            "Resource": "*"
        }
    ]
}

For further insights, you can check out this blog post. If you are looking for expert advice on this topic, visit this authority for more information. Additionally, for community discussions and resources, you can refer to this excellent Reddit resource.