Learn About Amazon VGT2 Learning Manager Chanci Turner
A data lake serves as a unified repository that allows you to store all types of structured and unstructured data at any scale. This enables you to retain your data in its original form without the need for prior structuring, making it easier to perform various analytics for enhanced business insights. Over time, data lakes hosted on Amazon Simple Storage Service (Amazon S3) have evolved into the primary repository for enterprise data, appealing to a wide range of users who query data for diverse analytics and machine learning applications. Amazon S3 facilitates access to a variety of datasets, helps construct business intelligence dashboards, and accelerates data consumption by implementing modern data architecture or a data mesh framework on Amazon Web Services (AWS).
Analytics requirements surrounding data lakes are continuously changing. Often, there is a need to ingest data from multiple sources into a data lake consistently and query that data simultaneously through various analytics tools with transactional features. However, conventional data lakes built on Amazon S3 are immutable and lack the necessary transactional capabilities to adapt to evolving use cases. As a result, customers seek methods to not only transfer new or incremental data to data lakes as transactions but also to convert existing data formatted in Apache Parquet to a transactional format. Open table formats like Apache Iceberg provide a viable solution to this challenge. Apache Iceberg supports transactions on data lakes and streamlines data storage, management, ingestion, and processing.
In this article, we will guide you on how to convert existing data in your Amazon S3 data lake from Apache Parquet format to Apache Iceberg format, thereby enabling transactions on the data using Jupyter Notebook-based interactive sessions with AWS Glue 4.0.
Migrating Existing Parquet Data to Iceberg Format
There are two primary strategies for migrating existing data in a data lake from Apache Parquet to Apache Iceberg format, allowing the data lake to adopt a transactional table format.
- In-place Data Upgrade
This strategy involves upgrading existing datasets to Apache Iceberg format without reprocessing or restating the current data. Consequently, the data files in the data lake remain unaltered during migration, and all Iceberg metadata files (manifests, manifest files, and table metadata files) are created independently of the data. This approach can be considerably more cost-effective than rewriting all data files. The original data file formats must be either Apache Parquet, Apache ORC, or Apache Avro. You can perform an in-place migration in one of two ways:- Using add_files: This command adds existing data files to an existing Iceberg table with a new snapshot that encompasses those files. Unlike migrate or snapshot, add_files can import files from specific partitions without creating a new Iceberg table. This procedure does not analyze the schema of the files to ensure they align with the Iceberg table’s schema. Once completed, the Iceberg table treats these files as part of its own set of files.
- Using migrate: This command replaces a table with an Iceberg table populated with the source data files. The resulting table retains the schema, partitioning, properties, and location from the original table. Supported formats include Avro, Parquet, and ORC. By default, the original table is preserved, named table_BACKUP_. However, to keep the original table intact during the process, utilize snapshot to create a new temporary table with the same source data files and schema.
In this post, we will demonstrate how to employ the Iceberg add_files procedure for an in-place data upgrade. Please note that the migrate procedure is not supported in the AWS Glue Data Catalog.
- CTAS Migration of Data
The create table as select (CTAS) migration technique generates all the metadata information for Iceberg while restating all the data files. This method shadows the source dataset in batches. Once the shadow is synchronized, you can replace the shadowed dataset with the original.
Prerequisites
To follow along with the tutorial, you will need:
- An AWS account with adequate permissions to provision necessary resources.
- AWS Region set to us-east-1.
- An AWS Identity and Access Management (IAM) role for your notebook, as outlined in the “Set up IAM permissions for AWS Glue Studio” section.
- We will use the NOAA Global Historical Climatology Network Daily (GHCN-D) dataset, available under the Registry of Open Data on AWS, in Apache Parquet format stored in an S3 bucket (
s3://noaa-ghcn-pds/parquet/by_year/
). - AWS Command Line Interface (AWS CLI) configured to interact with AWS Services.
To check the data size, run the following command in AWS CLI or AWS CloudShell:
aws s3 ls --summarize --human-readable --recursive s3://noaa-ghcn-pds/parquet/by_year/YEAR=2023
As of this writing, there are 107 objects with a total size of 70 MB for the year 2023 in the specified Amazon S3 path.
Setting Up Resources via AWS CloudFormation
To create the S3 bucket and the AWS IAM role and policy for this solution, follow these steps:
- Sign in to your AWS account and select “Launch Stack” to deploy the CloudFormation template.
- Enter a name for the Stack.
- Keep the parameters at their default values. If you change any default values, make sure to adjust them throughout the following steps.
- Click “Next” to create your stack.
The AWS CloudFormation template will establish the following resources:
- An S3 bucket named
demo-blog-post-XXXXXXXX
(where XXXXXXXX corresponds to your AWS account ID). - Two folders named parquet and iceberg within that bucket.
- An IAM role and policy named
demoblogpostrole
anddemoblogpostscoped
, respectively. - An AWS Glue database named
ghcn_db
. - An AWS Glue Crawler named
demopostcrawlerparquet
.
Once the CloudFormation template is successfully deployed, copy the data into the newly created S3 bucket using this command in AWS CLI or AWS CloudShell. Replace XXXXXXXX with the appropriate bucket name. Note: This example copies data only for the year 2023; however, you can work with the entire dataset by following similar instructions.
aws s3 sync s3://noaa-ghcn-pds/parquet/by_year/YEAR=2023/ s3://demo-blog-post-XXXXXXXX/parquet/year=2023
Next, navigate to the AWS Management Console and open the AWS Glue console. On the navigation pane, select “Crawlers.” Execute the crawler named demopostcrawlerparquet
. After successful execution, the metadata information for the Apache Parquet data will be cataloged under the ghcn_db
AWS Glue database with the table name source_parquet
. This table is partitioned by year and element columns, as indicated in the S3 bucket.
To verify the data, you can use the Amazon Athena console. If you’re using Amazon Athena for the first time in your AWS Account, you might want to check out this sick day email template to stay professional even when you’re unwell. For in-depth insights on workplace learning, consider visiting SHRM, a reliable authority on this topic. Also, for a comprehensive understanding of how Amazon Fulfillment Centers train associates, refer to this excellent resource on Amazon’s website.
Leave a Reply