Overview
Updated June 2024 for accuracy, this post guides you through setting up a secure data lake using AWS Lake Formation. A data lake serves as a centralized, curated, and secure repository for storing both structured and unstructured data at any scale. You can keep your data in its raw form, which allows you to perform various types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.
Challenges of Data Lakes
One significant challenge in managing data lakes is the storage of unrefined data without adequate oversight. To make your data lake functional, it is essential to have established methods for cataloging and securing that data. AWS Lake Formation offers tools for governance, semantic consistency, and access control, enhancing the usability of your data for analytics and machine learning and providing greater value to your organization.
Lake Formation enables you to manage access to your data lake and monitor who accesses it. The AWS Glue Data Catalog integrates data access policies, ensuring compliance regardless of the data’s source. For further insights, see this another blog post on the topic.
Step-by-Step Walkthrough
This guide will demonstrate how to build and utilize a data lake:
- Designate a data lake administrator.
- Register an Amazon S3 path.
- Create a database.
- Grant permissions.
- Use AWS Glue to crawl the data and generate metadata and tables.
- Provide access to the table data.
- Query the data using Amazon Athena.
- Add a new user with restricted access and verify results.
Prerequisites
To follow along, you’ll need:
- An AWS account
- An IAM user with the AWSLakeFormationDataAdmin policy (see IAM Access Policies for details)
- An S3 bucket named
datalake-yourname-region
in the US-East (N. Virginia) - A folder named
zipcode
within your S3 bucket
Additionally, download the sample dataset, which consists of City of New York statistics available on the DATA.GOV site. Upload this file to the /zipcode
folder in your S3 bucket. Once your S3 bucket is set up with the dataset, proceed to configure your data lake using Lake Formation.
Step 1: Designate a Data Lake Administrator
Start by assigning yourself as the data lake administrator to gain access to all Lake Formation resources.
Step 2: Register an Amazon S3 Path
Next, register an Amazon S3 path to house your data within the data lake.
Step 3: Create a Database
Create a database in the AWS Glue Data Catalog for the zipcode table definitions:
- For Database, use
zipcode-db
. - For Location, input your S3 bucket/zipcode.
- For New tables in this database, do not select “Grant All to Everyone.”
Step 4: Grant Permissions
Grant permissions for AWS Glue to access the zipcode-db database. Choose your IAM user and the AWSGlueServiceRoleDefault IAM role.
Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to utilize your data lake using the specified data location:
- For IAM role, select your user and AWSServiceRoleForLakeFormationDataAccess.
- For Storage locations, use
s3://datalake-yourname-region
.
Step 5: Crawl the Data Using AWS Glue
In this step, a crawler connects to the data store and determines the schema for your data, creating metadata tables in the AWS Glue Data Catalog. Configure the crawler with the following settings:
- Crawler name:
zipcodecrawler
. - Data stores: Select this field.
- Choose a data store: Select S3.
- Specified path: Select this field.
- Include path:
s3://datalake-yourname-location/zipcode
. - Choose No for adding another data store.
- Select AWSGlueServiceRoleDefault for the IAM role.
- Run on demand: Select this field.
- Database: Select
zipcode-db
.
Run the crawler and wait for it to complete before proceeding.
Step 6: Grant Access to Table Data
Set up permissions in the AWS Glue Data Catalog for others to manage the data. Use the Lake Formation console to manage access to tables within the database:
- In the navigation pane, select Tables.
- Choose Grant and fill in the required fields:
- For IAM role, select your user and AWSGlueServiceRoleDefault.
- For Table permissions, select “Select all.”
Step 7: Query Data with Athena
Next, use Athena to query the data in your data lake. In the Athena console, select the Query Editor and choose the zipcode-db
.
Choose Tables and select the zipcode
table. Access Table Options to preview the table and specify a destination for the query results, which is a one-time setup. Navigate to Settings, click Manage, and select the desired S3 location for query results.
Run the query:
SELECT * FROM "zipcode-db"."zipcode" LIMIT 10;
Alternatively, you can select the Database from the left side of the Athena console and omit the database name from the query. The screenshot below illustrates how the datalakeadmin user can access all data.
Step 8: Add a User with Restricted Access
As the data lake administrator, you can create a user with limited access to specific columns. In the IAM console, create an IAM user with administrative rights named user1
and attach the AWSLakeFormationDataAdmin policy.
Follow these steps to create the user:
- Go to IAM and add a new user.
- Attach AdministratorAccess & AWSLakeFormationDataAdmin policies.
- Complete the user creation process and note the sign-in URL or email the details for reference.
In the Lake Formation console, grant permissions to user1 with these configurations:
- Database: Select
zipcode-db
. - Choose IAM user:
user1
. - Table: Select
zipcode
. - Revoke IAMAllowedPrincipals.
- Include columns: Choose “Jurisdiction name” and “Count participants.”
- Set Table permissions and Grantable permissions as needed.
Be sure to clear both checkboxes under Default permissions for newly created databases and tables and select Version 4 for Cross-account version settings, then click Save.
To verify the restricted permissions, log in as user1 and repeat step 7. As shown in the screenshot, user1 can only see the columns they were granted access to by the datalakeadmin user.
Conclusion
This guide has demonstrated how to securely build a data lake using Lake Formation. It provides essential tools for governance, semantic consistency, and access controls, which are critical for effective data management. For further insights, check out this authoritative source on the subject. Additionally, you can find this excellent resource for more information.
Leave a Reply