Constructing a Serverless Transactional Data Lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

In the wake of the big data explosion over the past decade, organizations have adapted by developing applications capable of processing and analyzing vast amounts of information. Data lakes have emerged as crucial repositories for storing both structured and unstructured data at scale and in diverse formats. However, as the complexity of data processing solutions increases, organizations are compelled to enhance their data lakes with additional features. A key requirement is the ability to execute various workloads, such as business intelligence (BI), machine learning (ML), data science, and Change Data Capture (CDC) of transactional data, all while avoiding the need to maintain multiple data copies. Furthermore, managing and maintaining files within the data lake can be both tedious and complex.

Table formats like Apache Iceberg address these challenges. They facilitate transactions atop data lakes, streamlining data storage, management, ingestion, and processing. These transactional data lakes blend the functionalities of both data lakes and data warehouses, allowing organizations to simplify their data strategies by executing multiple workloads and applications using the same data in a unified location. However, leveraging these formats necessitates building, maintaining, and scaling infrastructure and integration connectors, which can be time-consuming and costly.

In this article, we will demonstrate how to create a serverless transactional data lake using Apache Iceberg on Amazon Simple Storage Service (Amazon S3), utilizing Amazon EMR Serverless and Amazon Athena. We will provide a practical example focused on data ingestion and querying within an e-commerce sales data lake.

Overview of Apache Iceberg

Iceberg is an open-source table format that extends the capabilities of SQL tables to big data files. It supports ACID transactions, enabling concurrent data ingestion, updates, and queries, all through familiar SQL syntax. Iceberg incorporates internal metadata management to track data effectively and unlock a suite of robust features at scale. These features include time travel capabilities, schema evolution control, efficient data compaction, and hidden partitioning for expedited queries.

Iceberg manages files on behalf of the user, facilitating use cases such as:

Concurrent data ingestion and querying, including streaming and CDC
BI and reporting utilizing expressive, simple SQL
Empowering ML feature stores and training datasets
Compliance workloads, such as GDPR’s right to be forgotten
Handling late-arriving data, such as supplementary information that arrives after the fact
Monitoring data changes and enabling rollback

Constructing Your Transactional Data Lake on AWS

You can develop a modern data architecture featuring a scalable data lake that integrates seamlessly with an Amazon Redshift-powered cloud warehouse. Many organizations seek an architecture that melds the advantages of both data lakes and data warehouses within the same storage environment. The architecture diagram illustrates a comprehensive approach utilizing AWS to establish a fully functional transactional data lake. AWS offers a flexible and extensive feature set for data ingestion, AI and ML application development, and analytics workloads, all while minimizing operational complexity.

Data can be organized into three distinct zones: the raw zone for unprocessed data captured directly from sources, a transformed zone hosting cleaned data for various teams and use cases, and a business zone storing application-specific data aggregated from the transformed zone. Iceberg provides a table format in the transformed zone on Amazon S3, offering ACID transactions and facilitating efficient file management and time travel capabilities.

A critical element of a successful data strategy is robust data governance. On AWS, organizations can implement a comprehensive governance framework with fine-grained access control through AWS Lake Formation.

Overview of Serverless Architecture

In this section, we will outline how to ingest and query data within your transactional data lake in just a few steps. EMR Serverless allows data analysts and engineers to run Spark-based analytics without the hassle of configuring or managing clusters or servers. You can execute Spark applications without the need to plan capacity or provision infrastructure, only paying for your actual usage. EMR Serverless natively supports Iceberg for table creation and allows querying, merging, and inserting data using Spark. The architecture diagram demonstrates how Spark transformation jobs can load data from the raw zone or source, apply cleaning and transformation logic, and ingest data into Iceberg tables in the transformed zone. Spark code can execute instantaneously on an EMR Serverless application, which we will illustrate later in this post.

The Iceberg table is synchronized with the AWS Glue Data Catalog, a centralized repository for managing schema and metadata. With Iceberg, the ingestion, updating, and querying processes can leverage atomicity, snapshot isolation, and concurrency management, ensuring a consistent view of data.

Athena is a serverless interactive analytics service built on open-source frameworks that supports open table and file formats. It offers a streamlined and flexible method for analyzing petabytes of data where it resides. To facilitate BI and reporting, Athena allows for native querying on Iceberg tables and integrates with various BI tools. For further exploration of this topic, you might find this blog post helpful: this other blog post.

Sales Data Model

Star schemas and their variants are commonly employed for data modeling in data warehouses, consisting of fact tables and dimension tables. The fact table captures core transactional data from the business logic and includes foreign keys linking to dimension tables, which provide additional contextual information.

In this article, we will examine sales data derived from the TPC-DS benchmark, focusing on a subset of the schema featuring the web_sales fact table. This table contains numerical data regarding sales costs, shipping costs, taxes, and net profits, along with foreign keys to dimension tables such as date_dim, time_dim, customer, and item. These dimension tables enrich the fact table by providing detailed records, enabling insights into when a sale occurred, which customer made it, and which item was involved.

Dimension-based models have been widely utilized to construct data warehouses. In the subsequent sections, we will detail how to implement such a model on top of Iceberg, enhancing your data lake with data warehousing capabilities while supporting various workloads in the same environment. We will present a comprehensive example of building a serverless architecture, including data ingestion with EMR Serverless and Athena, along with TPC-DS queries. For those interested in authoritative insights on this subject, you can visit this link.

Prerequisites

Before embarking on this walkthrough, ensure you have the following:

An AWS account
Basic familiarity with data management and SQL

Deploying Solution Resources with AWS CloudFormation

We provide an AWS CloudFormation template to deploy the data lake stack, which includes the following resources:

Two S3 buckets: one for scripts and query results, and another for data lake storage
An Athena workgroup
An EMR Serverless application

For those looking to explore employment opportunities, you can check out this excellent resource.