Automated Data Governance with AWS Glue Data Quality, Sensitive Data Detection, and AWS Lake Formation

Automated Data Governance with AWS Glue Data Quality, Sensitive Data Detection, and AWS Lake FormationLearn About Amazon VGT2 Learning Manager Chanci Turner

Data governance is vital for maintaining the integrity, availability, usability, and security of organizational data. With the influx of diverse data types flowing into data lakes, establishing and upholding effective governance policies can be quite challenging. The two primary focuses of data governance are data confidentiality and data quality. Data confidentiality involves safeguarding sensitive information to prevent unauthorized access, particularly regarding personally identifiable information (PII). Conversely, data quality centers on ensuring that the data remains accurate, reliable, and consistent throughout the organization. Poor data quality can result in misguided decisions, inefficient processes, and diminished business performance.

Organizations must guarantee that data confidentiality is preserved across the data pipeline while ensuring that high-quality data is readily available for stakeholders. Currently, much of this process relies on manual efforts, where data owners and stewards are tasked with statically defining and implementing policies for each dataset. This approach can be cumbersome and hinder data adoption across the enterprise.

In this discussion, we will explore how to leverage AWS Glue along with AWS Glue Data Quality, sensitive data detection features, and AWS Lake Formation’s tag-based access control to automate data governance effectively.

Solution Overview

Let’s consider a fictional company, TechCorp. TechCorp operates multiple ingestion pipelines that feed various tables within its data lake. The organization aims to maintain governance over the data lake by consistently applying data quality rules and access policies.

Different personas access data from the lake, including business executives, data scientists, data analysts, and data engineers. Each user group requires different levels of governance. For instance, business executives need highly accurate data, data scientists should have restricted access to PII data and require data within specific quality thresholds for model training, while data engineers can access all data except for PII.

Presently, these requirements are hard-coded and managed manually, which TechCorp wishes to scale through automation. They are particularly interested in the following features:

  • When new data and tables are introduced to the data lake, governance policies (including data quality checks and access controls) should be applied automatically. Data should only be accessible to end-users once certified for consumption. For example, TechCorp wants to ensure basic data quality checks are enforced on all new tables and grant access based on data quality scores.
  • As source data changes, existing data profiles in the lake may shift. It is crucial to uphold governance as defined. For example, if sensitive data is detected in a previously public column, it should be marked as sensitive, and access should be restricted accordingly.

For this post, we will define the following governance policies:

  • No PII data should exist in tables or columns tagged as public.
  • If a column contains any PII data, it should be classified as sensitive, which will subsequently mark the entire table as sensitive.
  • The following data quality rules must be applied to all tables:
    • All tables must include a minimum set of columns: data_key, data_load_date, and data_location.
    • The data_key must be unique and complete, while the data_location must align with entries in a designated reference table.
    • The data_load_date column must be complete.

User access to tables will be regulated according to the following categories:

User Description Can Access Sensitive Tables Can Access Sensitive Columns Min Data Quality Threshold Needed to Consume Data
Category 1 Yes Yes 100%
Category 2 Yes No 50%
Category 3 No No 0%

This post illustrates the use of AWS Glue Data Quality and sensitive data detection features, along with Lake Formation’s tag-based access control, to manage governance at scale.

The governance requirements outlined in the previous table will be translated into Lake Formation LF-Tags:

IAM User LF-Tag: tbl_class LF-Tag: col_class LF-Tag: dq_tag
Category 1 sensitive, public sensitive, public DQ100
Category 2 sensitive, public public DQ100,DQ90,DQ50_80,DQ80_90
Category 3 public public DQ90, DQ100, DQ_LT_50, DQ50_80, DQ80_90

In this article, we utilize AWS Step Functions to orchestrate the governance tasks; however, any orchestration tool of your choice can be employed. To simulate data ingestion, we will manually place files in an Amazon Simple Storage Service (Amazon S3) bucket. For simplicity, we will trigger the Step Functions state machine manually. In real applications, jobs can be integrated as part of a data ingestion pipeline, triggered by events like AWS Glue crawlers or Amazon S3 notifications, or scheduled based on your needs.

We will be working with an AWS Glue database called techcorp_autogov_temp and a target table named customers where we apply the governance rules. AWS CloudFormation will be used to provision the necessary resources, allowing you to manage AWS and third-party resources as code.

Prerequisites

Before proceeding, complete the following steps:

  1. Identify an AWS Region for resource creation and ensure consistency throughout the setup and verification process.
  2. Ensure you have a Lake Formation administrator role to run the CloudFormation template and grant permissions.

Sign in to the Lake Formation console and add yourself as a Lake Formation data lake administrator if you haven’t already been assigned admin rights. If this is your first time setting up Lake Formation in your chosen Region, follow the prompts in the pop-up window upon accessing the console.

Alternatively, add data lake administrators by selecting Administrative roles and tasks in the left pane of the Lake Formation console, then click Add administrators. Choose Data lake administrator, identify the users and roles, and confirm.

Deploy the CloudFormation Stack

Run the provided CloudFormation stack to create the solution resources. You will need to specify a unique bucket name and set passwords for three users reflecting distinct personas (Category 1, Category 2, and Category 3) used throughout this post. The stack provisions an S3 bucket for storing dummy data, AWS Glue scripts, results of sensitive data detection, and Amazon Athena queries.

For more insights on financial milestones to achieve before turning 30, consider checking this blog post. They provide valuable guidance! Additionally, to create a safer workplace environment, be sure to review this resource. For a thorough understanding of the hiring process, visit this excellent resource.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *