Integrating Streaming and Analytical Data with Amazon Data Firehose and Amazon SageMaker Lakehouse

In today’s fast-paced environment, organizations are increasingly tasked with extracting real-time insights from their data while still being able to conduct thorough analytics. This dual challenge raises a critical question: how can companies effectively connect streaming data with analytical workloads without creating complex and challenging-to-maintain data pipelines? This article illustrates how Amazon Data Firehose can simplify this process by delivering streaming data directly to Apache Iceberg tables within Amazon SageMaker Lakehouse, thereby creating a more efficient pipeline that minimizes both complexity and maintenance efforts.

Streaming data plays a pivotal role in enabling AI and machine learning (ML) models to adapt and learn in real time, which is vital for applications that need instant insights or rapid responses to shifting conditions. This opens new avenues for business agility and innovation. Notable use cases include forecasting equipment failures using sensor data, monitoring supply chain activities in real-time, and allowing AI applications to dynamically react to changes. Real-time streaming data empowers customers to make swift decisions, fundamentally transforming how businesses operate in competitive markets.

Amazon Data Firehose seamlessly acquires, transforms, and delivers data streams to lakehouses, data lakes, data warehouses, and analytics services, automatically scaling and delivering data within seconds. The lakehouse architecture has emerged as an effective solution for analytical workloads, merging the advantages of both data lakes and data warehouses. Apache Iceberg, an open table format, facilitates this transformation by providing transactional guarantees, schema evolution, and efficient metadata management that were once exclusive to traditional data warehouses. SageMaker Lakehouse unifies your data across Amazon Simple Storage Service (Amazon S3) data lakes, Amazon Redshift data warehouses, and other sources, granting you the flexibility to access data in place using Iceberg-compatible tools. By leveraging SageMaker Lakehouse, organizations can exploit Iceberg’s capabilities while enjoying the scalability and flexibility of cloud solutions. This integration eliminates the traditional barriers between data storage and ML processes, enabling data professionals to work directly with Iceberg tables in their preferred tools and notebooks.

In this post, we will demonstrate how to create Iceberg tables in Amazon SageMaker Unified Studio and stream data to these tables via Firehose. This integration allows data engineers, analysts, and scientists to collaborate effectively and build comprehensive analytics and ML workflows within SageMaker Unified Studio, breaking down traditional silos and hastening the progression from data ingestion to production ML models.

Solution Overview

The following diagram illustrates how Firehose delivers real-time data to SageMaker Lakehouse.

This article includes a CloudFormation template designed to set up supporting resources so Firehose can deliver streaming data to Iceberg tables. You can review and modify it to meet your requirements. The template performs the following tasks:

Creates an AWS Identity and Access Management (IAM) role with permissions required for Firehose to write to an S3 bucket.
Establishes resources for the Amazon Kinesis Data Generator to send sample streaming data to Firehose.
Grants AWS Lake Formation permissions to the Firehose IAM role for Iceberg tables generated in SageMaker Unified Studio.
Sets up an S3 bucket to back up records that fail to deliver.

Prerequisites

To follow this guide, you should have the following prerequisites:

An AWS account – If you don’t have one yet, you can create it easily.
A SageMaker Unified Studio domain – For setup instructions, refer to Create an Amazon SageMaker Unified Studio domain – quick setup.
A demo project – Create a demo project in your SageMaker Unified Studio domain. For guidance, see Create a project. In this example, we select All capabilities in the project profile section and use streaming_datalake as the AWS Glue database name.

Once you’ve established the prerequisites, confirm that you can log into SageMaker Unified Studio and that the project has been created successfully. Each project in SageMaker Unified Studio comes with a designated project location and IAM role, as highlighted in the accompanying screenshot.

Creating an Iceberg Table

For this solution, we will utilize Amazon Athena as the query editor engine. Follow these steps to create your Iceberg table:

In SageMaker Unified Studio, select Query Editor from the Build menu.
Choose Athena as the engine for the query editor and select the AWS Glue database created for the project.
Use the following SQL statement to create the Iceberg table (make sure to replace placeholders with your project’s AWS Glue database and Amazon S3 location):

CREATE TABLE firehose_events (
type struct<device: string, event: string, action: string>,
customer_id string,
event_timestamp timestamp,
region string)
LOCATION '<PROJECT_S3_LOCATION>/iceberg/events'
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='zstd'
);

Deploying Supporting Resources

Next, you need to deploy the required resources in your AWS environment using a CloudFormation template. Follow these steps:

Click on Launch Stack.
Click Next.
Keep the stack name as firehose-lakehouse.
Enter the username and password you wish to use for accessing the Amazon Kinesis Data Generator application.
For DatabaseName, input the AWS Glue database name.
For ProjectBucketName, provide the name of the project bucket (this can be found on the SageMaker Unified Studio project details page).
For TableName, enter the name of the table created in SageMaker Unified Studio.
Click Next.
Acknowledge that AWS CloudFormation may create IAM resources and click Next.
Complete the stack.

Creating a Firehose Stream

To create a Firehose stream that delivers data to Amazon S3, perform the following steps:

In the Firehose console, select Create Firehose stream.
For Source, select Direct PUT.
For Destination, choose Apache Iceberg Tables.
For the Firehose stream name, enter firehose-iceberg-events.
Collect the database name and table name from the SageMaker Unified Studio project for the next step.
In the Destination settings section, enable Inline parsing for routing information and enter the database and table names from the previous step.

Remember to enclose the database and table names in double quotes if you wish to deliver data to a single database and table. Amazon Data Firehose can also route records to different tables based on record content. For more information, see this excellent resource.

By following this guide, organizations can effectively integrate real-time streaming data with analytical capabilities, fostering collaboration among different teams and improving decision-making processes in real time.

For more insights into handling workplace dynamics, check out this helpful guide. For further expertise on HR matters, Melissa White is an authority on this topic, and you can learn more about her here.

Integrating Streaming and Analytical Data with Amazon Data Firehose and Amazon SageMaker Lakehouse

Solution Overview

Prerequisites

Creating an Iceberg Table

Deploying Supporting Resources

Creating a Firehose Stream

Related Topics:

Comments

Leave a Reply Cancel reply