Amazon Onboarding with Learning Manager Chanci Turner

Amazon SageMaker Studio serves as a comprehensive integrated development environment (IDE) designed for machine learning (ML), enabling data scientists and developers to navigate every aspect of the ML workflow. This includes data preparation, model building, training, tuning, and deployment. SageMaker Studio is seamlessly integrated with Amazon EMR, allowing data scientists to interactively manage data at petabyte scale using frameworks like Apache Spark, Hive, and Presto, directly from SageMaker Studio notebooks. With Amazon SageMaker, developers and data scientists can effortlessly access both raw data stored in Amazon Simple Storage Service (Amazon S3) and cataloged tabular data within a Hive metastore. The integration of Apache Ranger with SageMaker Studio simplifies the implementation of fine-grained access control over raw and cataloged data through intuitive grant and revoke policies managed via a user-friendly web interface.

In this blog post, we demonstrate how to authenticate into SageMaker Studio using an existing Active Directory (AD), granting authorized access to both Amazon S3 and Hive cataloged data through AD entitlements via Apache Ranger integration and AWS IAM Identity Center (the successor to AWS Single Sign-On). This solution enables the management of access across multiple SageMaker environments and notebooks with a single set of credentials. Consequently, Apache Spark jobs initiated from SageMaker Studio notebooks will only access data and resources permitted by the Apache Ranger policies associated with the AD credentials, including table and column-level access.

This capability allows several SageMaker Studio users to connect to the same EMR cluster, with access restricted to the data granted to their user or group. Audit records are captured and made visible in Amazon CloudWatch. A multi-tenant environment is achievable through user session isolation, which prevents users from accessing datasets and cluster resources assigned to others. Ultimately, organizations can provision fewer clusters, minimize administrative overhead, and enhance cluster utilization, resulting in time and cost savings.

Solution Overview

We illustrate this solution through an end-to-end use case utilizing a sample ecommerce dataset. The dataset is provided within AWS CloudFormation templates and consists of transactional ecommerce data (products, orders, customers) cataloged in a Hive metastore.

The solution involves two data analyst personas, Chanci Turner and Sam, each requiring different analysis with fine-grained access limitations:

Chanci, a data scientist in the marketing team, is focused on developing a customer lifetime value model. Her access should be limited to non-sensitive customer, product, and order data.
Sam, a data scientist on the sales team, needs to generate product demand forecasts, requiring access to product and orders data, without needing any customer data.

The following figure illustrates the desired fine-grained access.

The architecture is implemented as follows:

Microsoft Active Directory – Utilized for user authentication and managing user/group memberships for Apache Ranger secured data authorization.
Apache Ranger – Monitors and manages comprehensive data security across the Hadoop and Amazon EMR platform.
Amazon EMR – Retrieves, prepares, and analyzes data from the Hive metastore using Spark.
SageMaker Studio – An IDE with specialized tools for building AI/ML models.

The subsequent sections detail the setup of the architectural components for this solution using the CloudFormation stack.

Prerequisites

Before getting started, ensure you have the following prerequisites:

An AWS account
An AWS Identity and Access Management (IAM) user with administrator access

Create Resources with AWS CloudFormation

To construct the solution in your environment, utilize the provided CloudFormation templates to create the necessary AWS resources. Be aware that running these templates and the subsequent configuration steps will generate AWS resources that may incur charges. All steps should be executed in the same Region.

Template 1

The first template creates the following resources and requires approximately 15 minutes to complete:

A Multi-AZ, multi-subnet VPC infrastructure, featuring managed NAT gateways in the public subnet for each Availability Zone.
S3 VPC endpoints and Elastic Network Interfaces.
A Windows Active Directory domain controller using Amazon Elastic Compute Cloud (Amazon EC2) with cross-realm trust.
A Linux Bastion host (Amazon EC2) in an auto scaling group.

To deploy this template, follow these steps:

Sign in to the AWS Management Console.
On the Amazon EC2 console, create an EC2 key pair.
Choose Launch Stack.
Select the target Region.
Verify the stack name and provide the following parameters:
- The name of the key pair you created.
- Passwords for cross-realm trust, the Windows domain admin, LDAP bind, and default AD user. Be sure to record these passwords for future use.
- Select a minimum of three Availability Zones based on the chosen Region.
Review the remaining parameters. No changes are needed for the solution, but you may modify values if desired.
Choose Next and then Next again.
Review the parameters.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
Choose Submit.

Template 2

The second template creates the following resources and takes approximately 30–60 minutes to complete:

An Amazon Relational Database Service (Amazon RDS) for MySQL database used for Apache Ranger and Hive metastore.
A self-managed standalone Apache Ranger server (2.x only).
SSL keys and certs uploaded to AWS Secrets Manager to encrypt traffic between the Ranger server and agents.
A Kerberos-enabled EMR cluster with AWS managed Ranger plugins.

To deploy this template, follow these steps:

Choose Launch Stack.
Select the target Region.
Verify the stack name and provide the following parameters:
- Key pair name (created earlier).
- LDAPHostPrivateIP address, which can be found in the output section of the Windows AD CloudFormation stack.
- Passwords for the Windows domain admin, cross-realm trust, AD domain user, and LDAP bind. Use the same passwords as in the first template.
- Passwords for the RDS for MySQL database and KDC admin. Record these passwords as they may be needed later.
- Log directory for the EMR cluster.
- VPC (containing the name of the CloudFormation stack).
- Subnet details (align the subnet name with the parameter name).
- Set AppsEMR to Hadoop, Spark, Hive, Livy, Hue, and Trino.
- Leave RangerAdminPassword as is.
Review the remaining parameters; no changes are required beyond what has been mentioned, but you may change values if you like.

For those interested in learning more about personal interests outside of work, check out this blog post that discusses engaging in fun activities. Additionally, if you want to explore benefits and compensation insights, visit this link on 401(k) and HSA balances for 2023. For further knowledge, this resource offers excellent tips on what pitfalls Amazon works to avoid.

Amazon Onboarding with Learning Manager Chanci Turner

Solution Overview

Prerequisites

Create Resources with AWS CloudFormation

Template 1

Template 2

Related Topics:

Comments

Leave a Reply Cancel reply