Data lakes were originally conceptualized to efficiently store vast amounts of raw, unstructured, or semi-structured data at a minimal cost, primarily catering to big data and analytical needs. As organizations have expanded their application of data lakes, these repositories have evolved into pivotal components for various data-driven processes, transcending mere reporting and analytics. Today, they are vital in synchronizing with customer applications, enabling management of concurrent data operations while ensuring data integrity and consistency. This transformation includes not only the storage of batch data but also the ingestion and processing of near real-time data streams, empowering businesses to integrate historical insights with live data for more agile decision-making. However, this modern data lake architecture introduces challenges related to transactional support and the management of the numerous small files produced by real-time data streams. Traditionally, organizations have tackled these challenges through complex extract, transform, and load (ETL) processes, often resulting in data duplication and increased complexity in their data pipelines. Moreover, to address the surge of small files, companies developed custom solutions for compacting and merging these files, leading to bespoke systems that were difficult to scale and maintain. As data lakes increasingly handle sensitive business data and transactional workloads, ensuring high data quality, governance, and compliance has become essential for maintaining trust and regulatory alignment.
To alleviate these issues, organizations have embraced open table formats (OTFs) like Apache Iceberg, which offer built-in transactional capabilities and compaction mechanisms. OTFs, such as Iceberg, mitigate significant limitations found in traditional data lakes by incorporating features like ACID transactions, which uphold data consistency during concurrent operations, and compaction to address the small file problem efficiently. By leveraging Iceberg’s compaction features, OTFs simplify maintenance, making it easier to manage object and metadata versioning at scale. However, despite their advantages, OTFs still necessitate regular maintenance to ensure tables remain optimized.
In this article, we delve into new features of the AWS Glue Data Catalog that enhance automatic compaction of Iceberg tables for streaming data, simplifying the process of maintaining high performance in your transactional data lakes. Activating automatic compaction on Iceberg tables reduces metadata overhead and boosts query performance. Many clients continuously ingest streaming data into Iceberg tables, generating a significant number of delete files to track changes. With this new feature, when you enable the Data Catalog optimizer, it perpetually monitors table partitions and triggers the compaction process for both data and delta or delete files, regularly committing partial progress. The Data Catalog now also accommodates heavily nested complex data and schema evolution, allowing for column reordering or renaming.
Automatic Compaction with AWS Glue
The automatic compaction feature in the Data Catalog ensures that your Iceberg tables are always in peak condition. The data compaction optimizer continuously monitors table partitions and initiates the compaction process when certain thresholds for the number of files and file sizes are reached. For instance, based on the Iceberg table’s configuration for target file size, the compaction process will commence if any partition within the table exceeds the default configuration of 100 files, with each file smaller than 75% of the target size.
Iceberg offers two table modes: Merge-on-Read (MoR) and Copy-on-Write (CoW), each providing distinct approaches for managing data updates and playing a critical role in how data lakes handle changes while maintaining performance.
- Data Compaction on Iceberg CoW – In the CoW mode, updates or deletions are immediately applied to the table files, resulting in a complete rewrite of the dataset with any changes made. Although this ensures immediate consistency and simplifies read operations (as readers only access the latest snapshot of the data), it can become costly and slow for write-heavy workloads due to frequent rewrites. Announced during AWS re:Invent 2023, this feature optimizes data storage for Iceberg tables utilizing the CoW mechanism. Compaction in CoW guarantees that updates create new files, which are subsequently compacted to enhance query performance.
- Data Compaction on Iceberg MoR – In contrast, MoR permits updates to be written independently from the existing dataset, merging those changes only during data reads. This method proves advantageous for write-heavy scenarios as it avoids constant full table rewrites. However, it can complicate read operations since the system must merge base and delta files as needed to provide a complete data view. MoR compaction, which is now generally available, allows for efficient handling of streaming data, ensuring that while data is continuously ingested, it is also compacted to optimize read performance without hindering ingestion speed.
Whether utilizing CoW, MoR, or a combination of both, one persistent challenge is the maintenance of the growing number of small files generated by each transaction. AWS Glue’s automatic compaction addresses this issue, keeping your Iceberg tables efficient and performant across both table modes.
This post presents a comprehensive comparison of query performance between auto-compacted and non-compacted Iceberg tables. By examining critical metrics such as query latency and storage efficiency, we highlight how the automatic compaction feature enhances data lakes for improved performance and cost savings. This analysis will assist you in making informed decisions to optimize your data lake environments.
Solution Overview
This blog discusses the performance advantages of the newly launched AWS Glue feature that supports automatic compaction of Iceberg tables with MoR capabilities. We executed two versions of the same architecture: one with auto-compaction enabled and another without it. By contrasting both scenarios, we illustrate the efficiency, query performance, and cost benefits of auto-compacted versus non-compacted tables within a simulated Internet of Things (IoT) data pipeline.
The solution architecture is depicted in the following diagram:
The architecture comprises the following components:
- Amazon Elastic Compute Cloud (Amazon EC2) simulates continuous IoT data streams, relaying them to Amazon MSK for processing.
- Amazon Managed Streaming for Apache Kafka (Amazon MSK) ingests and streams data from the IoT simulator for real-time processing.
- Amazon EMR Serverless processes the streaming data from Amazon MSK without the need for cluster management, writing results to the Amazon S3 data lake.
- Amazon Simple Storage Service (Amazon S3) stores data utilizing Iceberg’s MoR format for efficient querying and analysis.
- The Data Catalog manages metadata for datasets in Amazon S3, facilitating organized data discovery and querying via Amazon Athena.
- Amazon Athena queries data from the S3 data lake with two table options:
- Non-compacted table – Queries raw data from the Iceberg table.
For additional insights, check out this blog post or learn from the experts at Chanci Turner on this topic. Lastly, for a deeper understanding, visit this excellent resource.
Location: Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115.
Leave a Reply