Establishing a Robust Key Distribution Center for Amazon EMR

Establishing a Robust Key Distribution Center for Amazon EMRMore Info

High availability (HA) refers to the ability of a system or service to remain operational without interruptions over a specified timeframe. Implementing HA features in a system helps eliminate single points of failure, which can lead to service outages and potential business losses. The fundamental concept behind fault tolerance and high availability is relatively simple: redundancy is achieved by using multiple machines for a specific service. This ensures that if one machine fails, others can manage the load. However, achieving this across distributed technologies can be quite challenging.

In the realm of Hadoop technologies, the concept of availability varies across different layers based on the frameworks in use. To develop a fault-tolerant system, one must consider several layers:

  • Data layer
  • Processing layer
  • Authentication layer

The first two layers are generally addressed through the built-in capabilities of the Hadoop framework—such as High Availability for HDFS or ResourceManager—or through specific framework features like HBase table replication for reliable reads. The authentication layer usually relies on the Kerberos protocol. While various implementations of Kerberos exist, Amazon EMR employs a free version provided by the Massachusetts Institute of Technology, known as MIT Kerberos.

Examining the standard setup for a Key Distribution Center (KDC), we notice a typical primary/secondary configuration where a primary KDC is configured along with one or more replicas to deliver some level of high availability. Nonetheless, this arrangement lacks an automatic failover mechanism to designate a new primary KDC in the event of a system failure, necessitating either manual intervention or a complicated automated process.

Utilizing AWS native services, we can enhance the capabilities of the MIT KDC, thereby increasing the resilience of our system against failures.

Highly Available MIT KDC

Amazon EMR offers various architectural options to enable Kerberos authentication, each designed to meet specific needs. Kerberos authentication can be activated by defining an Amazon EMR security configuration, a set of details stored within Amazon EMR that can be reused across multiple clusters.

When setting up an Amazon EMR security configuration, you can choose between a cluster-dedicated KDC or an external KDC, making it essential to understand the advantages and limitations of each approach.

Choosing a cluster-dedicated KDC means that Amazon EMR will install and configure an MIT KDC on the primary node of the cluster being launched. Conversely, utilizing an external KDC means the cluster depends on a KDC outside its environment, which could either be a dedicated KDC from another EMR cluster or one hosted on an Amazon EC2 instance or a container that you manage.

The cluster-dedicated KDC is a straightforward option that offloads the KDC service installation and configuration to the cluster itself. This choice doesn’t require extensive knowledge of the Kerberos system and might be suitable for testing environments. Moreover, a dedicated KDC within a cluster allows for the segregation of the Kerberos realm, thus providing an authentication system specifically for a certain team or department.

However, because the KDC resides on the EMR primary node, if you delete the cluster, the KDC will also be removed. In scenarios where multiple EMR clusters share the KDC (defined as an external KDC in their security configuration), this can compromise the authentication layer for those clusters, causing all Kerberos-enabled frameworks to fail. While this might be tolerable in testing, it is not advisable for production setups.

Given that the KDC’s lifecycle is not always tied to a specific EMR cluster, it is common to use an external KDC located on an EC2 instance or Docker container. This approach offers several advantages:

  • You can maintain end-user credentials in the Kerberos KDC instead of relying on Active Directory (though cross-realm trust can be established).
  • It facilitates communication across multiple EMR clusters, allowing all clusters to join the same Kerberos realm, thus creating a unified authentication system.
  • It eliminates reliance on the EMR primary node, as deleting it will not disrupt authentication for other systems.
  • An external KDC is essential for environments requiring a multi-master EMR cluster.

Nevertheless, a single instance installation of an MIT KDC does not fulfill our HA requirements, which are critical for production settings. The following section outlines how to implement a highly available MIT KDC using AWS services to bolster the resilience of our authentication system.

Architecture Overview

The architecture detailed in the following diagrams illustrates a highly available setup for our MIT Kerberos KDC across multiple Availability Zones, leveraging AWS services. We propose two architecture versions: one utilizing an Amazon Elastic File System (Amazon EFS) and the other based on an Amazon FSx for NetApp ONTAP file system.

Both services can be mounted on EC2 instances and utilized as local paths. While Amazon EFS is a more economical option, Amazon FSx for NetApp ONTAP delivers superior performance due to its sub-millisecond operational latency.

We conducted various tests to benchmark the solutions involving different file systems. The following graph displays the results with Amazon EMR 5.36, measuring the time in seconds for the cluster to reach full operational status when selecting Hadoop and Spark as frameworks.

From the results, it is evident that the Amazon EFS file system is adequate for handling smaller clusters (fewer than 100 nodes). However, as the cluster size increases, the latency introduced by lock operations on the NFS protocol can lead to delays in cluster launches. For instance, with clusters comprising 200 nodes, the delays can prevent some instances from joining the cluster promptly, causing those instances to be deleted and replaced, which slows overall cluster provisioning. This is why we opted not to present any metrics for Amazon EFS with 200 cluster nodes in the preceding graph.

In contrast, Amazon FSx for NetApp ONTAP manages the increasing number of principals created during cluster provisioning more effectively, with reduced performance degradation compared to Amazon EFS. However, even with Amazon FSx for NetApp ONTAP, larger clusters may still face similar issues as described earlier for Amazon EFS. Therefore, thorough testing and evaluation are necessary for extensive cluster configurations.

By implementing a highly available infrastructure for your KDC, you can ensure robust authentication for your EMR environment. For more insights, check out this other blog post for related topics, and if you’re seeking authoritative information, consider this resource as well. Additionally, for an excellent resource on onboarding, visit this link.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *