Stream Real-Time Data into Apache Iceberg Tables on Amazon S3 Using Amazon Data Firehose

Stream Real-Time Data into Apache Iceberg Tables on Amazon S3 Using Amazon Data FirehoseLearn About Amazon VGT2 Learning Manager Chanci Turner

As organizations increasingly generate vast amounts of data from diverse sources, they require efficient systems to manage that data effectively for favorable business outcomes, such as enhancing customer experiences or minimizing operational costs. This trend is evident across various sectors—ranging from online media and gaming companies offering personalized recommendations to manufacturing facilities monitoring machinery for maintenance needs, and theme parks providing real-time wait times for popular attractions.

To support these applications, engineering teams are shifting towards two key trends. First, they are opting for real-time streaming over traditional batch data processing pipelines, enabling applications to gain insights and take action almost instantaneously instead of relying on daily or hourly batch extract, transform, and load (ETL) processes. Second, as conventional data warehousing methods struggle to cope with the high volume, velocity, and variety of data, teams are increasingly building data lakes and utilizing open data formats like Parquet and Apache Iceberg for data storage. Iceberg brings the reliability and simplicity of SQL tables into Amazon Simple Storage Service (Amazon S3) data lakes. By leveraging Iceberg for data storage, developers can utilize various analytics and machine learning frameworks, including Apache Spark, Apache Flink, Presto, Hive, or Impala, as well as AWS services like Amazon SageMaker, Amazon Athena, AWS Glue, Amazon EMR, Amazon Managed Service for Apache Flink, or Amazon Redshift.

Iceberg’s popularity stems from several factors: it is widely supported by various open-source frameworks and vendors, it allows concurrent read and write operations with different frameworks, and it enables features like time travel and rollback for querying historical data snapshots or reverting to previous versions. Additionally, Iceberg supports schema evolution, allowing the addition of new columns to tables without necessitating data rewrites or modifications to existing applications. For more information, refer to Apache Iceberg.

In this article, we will explain how to stream real-time data into Iceberg tables hosted on Amazon S3 using Amazon Data Firehose. This service simplifies the streaming process by enabling users to configure a delivery stream, select a data source, and designate Iceberg tables as the destination. Once configured, the Firehose stream is ready to deliver data. Firehose seamlessly integrates with over 20 AWS services, allowing real-time data delivery from sources such as Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka, Amazon CloudWatch Logs, AWS Internet of Things (AWS IoT), AWS WAF, Amazon Network Firewall Logs, or even your custom applications via the Firehose API to Iceberg tables. It’s a cost-effective solution, as Firehose is serverless—meaning you only pay for the data transmitted to and stored in your Iceberg tables. There are no costs associated with provisioning or during idle times, such as nights or weekends.

Firehose also simplifies the setup and execution of advanced scenarios. For instance, if you wish to route data to distinct Iceberg tables for better query performance or data isolation, you can configure a stream to automatically direct records into different tables based on the incoming data’s content. Firehose automatically scales, eliminating the need to plan for data allocation among tables, and it includes built-in mechanisms for handling delivery failures and ensuring exactly-once delivery. Additionally, Firehose supports record updates and deletions based on incoming data streams, allowing compliance with regulations like GDPR and right-to-forget mandates. Because Firehose is fully compatible with Iceberg, data can be written using it while simultaneously allowing other applications to read from and write to the same tables. Firehose also integrates with the AWS Glue Data Catalog, providing features like managed compaction for Iceberg tables.

In the upcoming sections, you’ll learn how to configure Firehose for real-time streaming into Iceberg tables to tackle four specific scenarios:

  1. Stream data into a single Iceberg table and insert all incoming records.
  2. Stream data into a single Iceberg table while performing record inserts, updates, and deletes.
  3. Route records to different tables based on the content of incoming data using a JSON Query expression.
  4. Route records to different tables based on incoming data content with the help of a Lambda function.

You’ll also discover how to query the data sent to Iceberg tables using standard SQL queries in Amazon Athena. All AWS services featured in these examples are serverless, thus eliminating the need for infrastructure management.

Solution Overview

The architecture is illustrated in the following diagram.

For our examples, we utilize the Kinesis Data Generator, which is a sample application for generating and publishing data streams to Firehose. You can also configure Firehose to accept other data sources for real-time streams. We set up Firehose to deliver the stream into Iceberg tables within the Data Catalog.

Walkthrough

This article includes an AWS CloudFormation template for a rapid setup. You can review and tailor it to your specifications. The template performs several operations:

  • Creates a Data Catalog database for the destination Iceberg tables.
  • Creates four tables in the AWS Glue database configured for the Apache Iceberg format.
  • Specifies the S3 bucket locations for the destination tables.
  • Optionally creates a Lambda function.
  • Sets up an AWS Identity and Access Management (IAM) role for Firehose.
  • Creates resources for Kinesis Data Generator.

Prerequisites

Before commencing this walkthrough, you should have the following prerequisites:

  • An AWS account. If you don’t have one, you can create it.

Deploy the Solution

To begin, you need to deploy the required resources into your AWS environment using the CloudFormation template.

  1. Sign in to the AWS Management Console for CloudFormation.
  2. Choose Launch Stack.
  3. Click Next.
  4. Keep the stack name as Firehose-Iceberg-Stack, and enter your desired username and password for accessing Kinesis Data Generator.
  5. Scroll down the page, acknowledge that AWS CloudFormation may create IAM resources, and select Next.
  6. Review the deployment and choose Submit.

The stack may take 5–10 minutes to complete, after which you can observe the deployed stack on the CloudFormation console. The following figure illustrates the details of the deployed Firehose-Iceberg stack.

Before setting up Firehose to deliver streams, you need to create destination tables in the Data Catalog. For the examples provided here, we use the CloudFormation template to automatically create the tables utilized in the examples. For your custom applications, you can create tables using CloudFormation or with DDL commands in Athena or Glue. Below is the DDL command for creating a table used in our example:

CREATE TABLE firehose_iceberg_db.firehose_events_1 (
  type struct<device: string, event: string, action: string>,
  customer_id string,
  event_timestamp timestamp
);

For further details on the hiring process and best practices, check out this excellent resource: Amazon Hiring Process. Additionally, if you are considering job applications, you might want to explore this insightful blog post about resume buzzwords here. Additionally, for compliance and security, you can refer to SHRM’s authoritative guide on employer responsibilities when checking applicants against terrorist lists here.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *