Learn About Amazon VGT2 Learning Manager Chanci Turner
AWS Lake Formation empowers you to centrally manage, secure, and share data for analytics and machine learning. With Lake Formation, you can handle access control for your data lake in Amazon S3 along with its metadata in AWS Glue Data Catalog, all from a single interface that resembles traditional database features. You can implement fine-grained data access controls to ensure that users have the right access to specific data, even at the cell level of tables. Additionally, Lake Formation simplifies data sharing both internally within your organization and externally. It seamlessly integrates with AWS analytics services like Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL for Apache Spark, enabling efficient and secure data querying from Lake Formation-managed tables to derive business insights swiftly.
Prior to the advent of Lake Formation and its database-style permissions for data lakes, managing access to your data and its metadata was a two-fold challenge, requiring the use of AWS Identity and Access Management (IAM) policies and S3 bucket policies separately. The IAM and Amazon S3 access control mechanisms are more complex and less granular than Lake Formation, often leading to longer migration times. Under the previous system, a database or table in the data lake could only be governed by either IAM and S3 policies or Lake Formation policies, never both. This posed difficulties for operations teams, especially when various use cases interacted with the data lake, making it hard to transition all use cases from one permission model to another without disruptions.
To facilitate the migration of data lake permissions from an IAM and S3 framework to Lake Formation, we are excited to launch a hybrid access mode for AWS Glue Data Catalog. As detailed in the What’s New section and the accompanying documentation, this feature allows you to secure and access cataloged data using both Lake Formation permissions alongside IAM and S3 permissions. The hybrid access mode empowers data administrators to incrementally adopt Lake Formation permissions, focusing on one data lake use case at a time. For instance, if you have an existing extract, transform, and load (ETL) data pipeline that operates under IAM and S3 policies, you can enable your data analysts to access the same data through Amazon Athena using Lake Formation permissions, allowing for fine-grained controls as necessary, without altering access for your ETL data pipelines.
The hybrid access mode permits both permission models to coexist for the same database and tables, offering greater flexibility in user access management. While this feature allows dual access methods for a Data Catalog resource, either an IAM user or role can only access the resource using one of the two permission sets. Once Lake Formation permissions are enabled for an IAM principal, authorization is completely governed by Lake Formation, rendering existing IAM and S3 policies ineffective. AWS CloudTrail logs provide comprehensive access details of Data Catalog resources in Lake Formation logs, as well as S3 access logs.
In this blog post, we guide you through the steps to onboard Lake Formation permissions under hybrid access mode for selected users while the database remains accessible to others via IAM and S3 permissions. We will also explore how to set up hybrid access mode within a single AWS account and between multiple accounts.
Scenario 1 – Hybrid Access Mode Within a Single AWS Account
In this scenario, we will outline the steps to start incorporating users with Lake Formation permissions for a Data Catalog database that is currently accessed via IAM and S3 policy permissions. For illustration purposes, we will utilize two personas: a Data Engineer, who possesses broad permissions through an IAM policy and an S3 bucket policy to execute an AWS Glue ETL job, and a Data Analyst, who will receive fine-grained Lake Formation permissions to perform queries on the database using Amazon Athena.
Scenario 1 is depicted in the diagram below, where the Data Engineer role accesses the database hybridsalesdb utilizing IAM and S3 permissions, while the Data Analyst role accesses the database using Lake Formation permissions.
Prerequisites
To establish Lake Formation along with IAM and S3 permissions for a Data Catalog database employing hybrid access mode, the following prerequisites should be satisfied:
- An AWS account that isn’t utilized for production applications.
- Lake Formation must already be configured within the account, along with a Lake Formation administrator role or a similar role to follow the instructions in this post. For example, we will employ a data lake administrator role called LF-Admin. For more information on setting up permissions for a data lake administrator role, refer to the guide on creating a data lake administrator.
- A sample database in the Data Catalog containing several tables. For this illustration, our sample database is called hybridsalesdb and includes a set of eight tables, as shown in the accompanying screenshot. You can utilize any of your datasets for this process.
Personas and Their IAM Policy Setup
We have two personas that represent IAM roles within the account: Data Engineer and Data Analyst. Below are their respective IAM policies and access descriptions.
The IAM policy for the Data Engineer role permits access to the database and table metadata in the Data Catalog.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:Get*"
],
"Resource": [
"arn:aws:glue:<Region>:<account-id>:catalog",
"arn:aws:glue:<Region>:<account-id>:database/hybridsalesdb",
"arn:aws:glue:<Region>:<account-id>:table/hybridsalesdb/*"
]
}
]
}
Additionally, the IAM policy for the Data Engineer role also grants data access to the underlying Amazon S3 location associated with the database and tables.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowDataLakeBucket",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:Put*",
"s3:Get*",
"s3:Delete*"
],
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/<prefix>/"
]
}
]
}
The Data Engineer also has access to the AWS Glue console via the AWS managed policy arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess
and has the iam:PassRole
permission to execute an AWS Glue ETL script, depicted below.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PassRolePermissions",
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": [
"arn:aws:iam::<account-id>:role/Data-Engineer"
]
}
]
}
Furthermore, the following policy is included in the trust policy of the Data Engineer role, enabling AWS Glue to assume the role to execute the ETL script on its behalf.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
For additional permissions necessary to run an AWS Glue ETL script, see the AWS Glue studio setup. Meanwhile, the Data Analyst role is set up with more restricted permissions, enabling a focused approach to data exploration.
If you’re looking to further enhance your knowledge on this topic, check out this insightful post on common email mistakes, which could help improve your communication skills. Additionally, to stay updated on innovative work models, refer to SHRM. For those interested in developing their careers within Amazon, visit this excellent resource to explore opportunities.
Leave a Reply