Streamlined Authentication with Native LDAP Integration on Amazon EMR

Streamlined Authentication with Native LDAP Integration on Amazon EMRLearn About Amazon VGT2 Learning Manager Chanci Turner

Many organizations store their corporate identities in identity providers (IdPs) such as Active Directory (AD) or OpenLDAP. Previously, customers utilizing Amazon EMR could connect their clusters with Active Directory by establishing a one-way realm trust between their AD domain and the EMR cluster Kerberos realm. For more information, refer to the tutorial on configuring a cross-realm trust with an Active Directory domain. This configuration has been crucial for enabling corporate users and groups to access EMR clusters and define access control policies for data access, including integration with Amazon EMR’s native Apache Ranger.

While this option remains available, Amazon EMR now supports native LDAP authentication, a new security feature that simplifies integration with OpenLDAP and Active Directory. This feature allows for:

  • Automatic configuration of security for supported applications (HiveServer2, Trino, Presto, and Livy) to use the Kerberos protocol internally while employing LDAP for external authentication. This simplifies integration for external tools, which no longer require Kerberos setup but can instead use LDAP credentials.
  • Fine-grained access control (FGAC) for SSH access to EMR clusters.
  • Detailed authorization policies for the Hive Metastore database and tables, particularly when used with the native Amazon EMR Apache Ranger integration.

In this post, we will explore how Amazon EMR’s LDAP authentication functions, detailing the authentication flow, retrieving necessary LDAP configurations, and verifying that an EMR cluster is properly integrated with LDAP.

Using the insights from this blog, teams managing EMR clusters can collaborate more effectively with their LDAP IdP administrators to gather the necessary information and conduct pre-configuration tests. Additionally, EMR cluster end-users will understand how easily they can connect to LDAP-enabled EMR clusters compared to the previous Kerberos-based method.

How Amazon EMR LDAP Integration Functions

When discussing authentication within EMR frameworks, we can identify two levels:

  • External authentication: This is utilized by users and external components to interact with the installed frameworks.
  • Internal authentication: This is used within the frameworks to authenticate communications between internal components.

With the new feature, internal framework authentication continues to be managed via Kerberos, but this is transparent to end-users or external services, which authenticate using a username and password. The supported EMR frameworks employ an LDAP-based authentication method, which validates provided credentials against the LDAP endpoint and, upon success, grants access to the framework.

The authentication workflow comprises the following steps:

  1. A user connects to one of the supported endpoints (e.g., HiveServer2, Trino/Presto Coordinator, or Hue WebUI) and enters their corporate credentials.
  2. The corresponding framework uses a custom authenticator that communicates with the EMR Secret Agent service running on the cluster instances.
  3. The EMR Secret Agent service verifies the credentials against the LDAP endpoint.
  4. If successful:
    • A Kerberos principal is created for the user on the cluster’s MIT Key Distribution Center (MIT KDC).
    • The Kerberos principal keytab is generated in the user’s home directory on the primary node.

Once authentication is complete, users can begin using the framework. The SSSD service on all cluster instances retrieves users and groups from the LDAP endpoint, making them available as system users.

The SSH connection authentication process differs slightly and follows these steps:

  1. A user connects via SSH to the primary EMR instance, entering their corporate credentials.
  2. The SSHD service utilizes the SSSD service to confirm the provided credentials.
  3. The SSSD service checks these credentials against the LDAP endpoint. Upon success, the user accesses their home directory and can use various CLIs (beeline, trino-cli, presto-cli, curl) to interact with Hive, Trino/Presto, or Livy.
  4. For Spark CLIs (spark-submit, pyspark, spark-shell), the user must run the ldap-kinit script and enter their credentials.
  5. The EMR Secret Agent service again validates these credentials against the LDAP endpoint. On success:
    • A Kerberos principal is created for the user on the cluster’s MIT KDC.
    • The Kerberos principal keytab is stored in the user’s home directory.
    • A Kerberos ticket is obtained and kept in the user’s Kerberos ticket cache on the primary node.

After executing the ldap-kinit script, users can start using Spark CLIs. The following sections will guide you on how to find the required LDAP settings and demonstrate how to launch a cluster with EMR LDAP authentication and test it.

Finding the Necessary LDAP Parameters

To set up LDAP authentication for Amazon EMR, the first step is to obtain the LDAP properties needed for your cluster configuration. You will require:

  • The LDAP server DNS name
  • A PEM-formatted certificate for Secure LDAP (LDAPS) communication with the LDAP endpoint
  • The LDAP user search base, which specifies a path on the LDAP tree for user searches (only users from this branch will be retrieved)
  • The LDAP groups search base, indicating a path on the LDAP tree for group searches (only groups from this branch will be retrieved)
  • LDAP server bind user credentials, which include a username and password for a service account (typically referred to as a bind user) to execute LDAP queries and retrieve user information like usernames and group memberships.

For Active Directory setups, an AD admin can directly gather this information using the Active Directory Users and Computers tool. By selecting a user in this tool, you can view related attributes (e.g., distinguishedName).

For example, the distinguishedName for the user “john” might look like this: CN=john,OU=users,OU=italy,OU=emr,DC=awsemr,DC=com. This indicates that john belongs to the following search bases, listed from the most specific to the broadest:

  • OU=users,OU=italy,OU=emr,DC=awsemr,DC=com
  • OU=italy,OU=emr,DC=awsemr,DC=com
  • OU=emr,DC=awsemr,DC=com
  • DC=awsemr,DC=com

Depending on the number of entries in a company’s LDAP directory, using a broad search base can lead to performance issues, thus it is essential to choose carefully.

For further reading on leadership, check out this helpful post from Career Contessa. Additionally, for more insights on production control management, SHRM is an authoritative source. Lastly, for those interested in career opportunities, this link offers an excellent resource.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *