Query Your Apache Hive Metastore with AWS Lake Formation Permissions

Query Your Apache Hive Metastore with AWS Lake Formation PermissionsMore Info

Apache Hive serves as a SQL-oriented data warehousing system designed to process highly distributed datasets within the Apache Hadoop ecosystem. It comprises two main elements: the Hive SQL query engine and the Hive metastore (HMS). The Hive metastore acts as a repository for metadata associated with SQL tables, including database names, table names, schema details, serialization and deserialization information, data locations, and partition specifics for each table. Various tools, like Apache Hive, Apache Spark, Presto, and Trino, utilize the Hive Metastore to retrieve metadata for query execution. The metastore can either be hosted on an Apache Hadoop cluster or backed by an external relational database. While the Hive metastore holds the metadata for tables, the actual data can be stored in Amazon Simple Storage Service (Amazon S3), the Hadoop Distributed File System (HDFS), or any other data stores compatible with Hive.

Since its inception alongside Apache Hadoop, many organizations have relied on Apache Hive for their big data processing needs. The Hive metastore also integrates seamlessly with numerous other open-source big data tools, including Apache HBase, Apache Spark, Presto, and Apache Impala. Consequently, organizations have accumulated vast amounts of structured dataset metadata within the Hive metastore, making it an essential component of their data lakes. However, many AWS analytics services do not natively support the Hive metastore, forcing organizations to migrate their data to the AWS Glue Data Catalog to leverage these services.

AWS Lake Formation has introduced support for managing user access to Apache Hive metastores via a federated AWS Glue connection. Previously, Lake Formation was restricted to managing user permissions solely on AWS Glue Data Catalog resources. With the new Hive metastore connection from AWS Glue, you can link to a database in an external Hive metastore, map it to a federated database within the Data Catalog, apply Lake Formation permissions to the Hive database and its tables, share them with other AWS accounts, and query them using services like Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL. For more details on the integration, refer to this blog post.

There are various use cases for integrating the Hive metastore with the Data Catalog, such as:

  • Utilizing an external Apache Hive metastore for legacy big data workloads from on-premises Hadoop clusters with data stored in Amazon S3.
  • Managing transient Amazon EMR workloads with data in Amazon S3 and the Hive metastore hosted on Amazon Relational Database Service (Amazon RDS) clusters.

This post will demonstrate how to apply Lake Formation permissions to a Hive metastore database and its tables, querying them via Athena. We will also illustrate a cross-account sharing scenario where a Lake Formation steward in producer account A shares a federated Hive database and its tables with consumer account B using LF-Tags.

Solution Overview

In producer account A, an Apache Hive metastore is hosted within an EMR cluster, with the underlying data stored in Amazon S3. We will deploy the AWS Glue Hive metastore connector from the AWS Serverless Application Repository in account A and establish a Hive metastore connection in the Data Catalog. After creating the HMS connection, we will create a federated database in account A’s Data Catalog and map it to a database in the Hive metastore using the connection. The tables from the Hive database will then be accessible to the Lake Formation administrator in account A, just like any other tables in the Data Catalog. The admin will proceed to implement Lake Formation tag-based access control (LF-TBAC) for the federated Hive database and share it with account B.

Users in account B will access the Hive database and tables from account A just as they would any other shared Data Catalog resource using Lake Formation permissions.

The following diagram illustrates this architecture:

The solution involves steps across both accounts. In account A, you will need to:

  1. Create an S3 bucket to host the sample data.
  2. Launch an EMR 6.10 cluster with Hive. Download the sample data into the S3 bucket. Set up a database and external tables that point to this data within its Hive metastore.
  3. Deploy the application GlueDataCatalogFederation-HiveMetastore from the AWS Serverless Application Repository and configure it for the Amazon EMR Hive metastore. This action will establish an AWS Glue connection to the Hive metastore that will be visible on the Lake Formation console.
  4. Using the Hive metastore connection, create a federated database in the AWS Glue Data Catalog.
  5. Create LF-Tags and associate them with the federated database.
  6. Grant permissions on the LF-Tags to account B, along with database and table permissions using LF-Tag expressions.

In account B, perform the following steps:

  1. As a data lake admin, review and accept the AWS Resource Access Manager (AWS RAM) invitations for the shares from account A.
  2. The data lake admin will then see the shared database and tables. The admin can create a resource link to the database and grant fine-grained permissions to a data analyst in this account.
  3. Both the data lake admin and the data analyst can query the Hive tables available to them using Athena.

Account A will have these personas:

  • hmsblog-producersteward – Manages the data lake in producer account A.

Account B will consist of:

  • hmsblog-consumersteward – Oversees the data lake in consumer account B.
  • hmsblog-analyst – A data analyst requiring access to specific Hive tables.

Prerequisites

To follow the tutorial outlined in this post, you will need:

  • Two AWS accounts. It is best to use test accounts, avoiding production accounts.
  • An admin AWS Identity and Access Management (IAM) user in both accounts for launching the AWS CloudFormation stacks.
  • Lake Formation mode enabled in both accounts, configured with cross-account settings to version 3. For guidance, refer to this excellent resource.

Lake Formation and AWS CloudFormation Setup in Account A

To simplify the setup, we will designate an IAM admin as the data lake admin. Follow these steps:

  1. Sign into the AWS Management Console and select the us-west-2 Region.
  2. On the Lake Formation console, click on Permissions in the navigation pane and select Administrative roles and tasks.
  3. Choose Manage Administrators in the Data lake administrators section.
  4. Under IAM users and roles, select the IAM admin user you are logged in as and click Save.
  5. Click Launch Stack to deploy the CloudFormation template:
    • Click Next.
    • Provide a name for the stack and click Next.
    • On the next page, click Next.
    • Review the details on the final page and select “I acknowledge that AWS CloudFormation might create IAM resources.”
    • Click Create.

Stack creation will take approximately 10 minutes. This process establishes the producer account A setup which includes:

  • Creating an S3 data lake bucket.
  • Registering the data lake bucket with Lake Formation while enabling catalog federation.
  • Launching an EMR 6.10 cluster with Hive and executing two steps in Amazon EMR.

Overall, this process provides a streamlined approach to managing Hive metastore permissions within AWS Lake Formation, allowing organizations to better integrate their big data solutions.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *