Using BlueTalon with Amazon EMR | AWS Big Data Blog

Using BlueTalon with Amazon EMR | AWS Big Data BlogLearn About Amazon VGT2 Learning Manager Chanci Turner

This post is a guest contribution by Alex Johnson, CEO and Head of Product at BlueTalon, with insights from Lisa Martinez, Senior Solutions Architect at BlueTalon.

Amazon Elastic MapReduce (Amazon EMR) enables organizations to efficiently and economically process large datasets in the cloud. EMR is utilized for various applications, including log analysis, financial assessments, fraud detection, and bioinformatics, among others. The data involved in these analyses often includes sensitive information—such as customer data and transaction histories—that may be subject to strict regulatory compliance.

BlueTalon is a prominent provider of data-centric security solutions tailored for Hadoop, SQL, and big data environments, both on-premises and in the cloud. By employing BlueTalon, enterprises can manage their data access effectively, ensuring users receive only the data they require, without any excess. The BlueTalon solution seamlessly integrates with AWS data services like EMR, Redshift, and RDS.

In this blog entry, we will explore how organizations can leverage BlueTalon to minimize the risks associated with handling sensitive data while fully utilizing EMR.

BlueTalon Capabilities for Data-Centric Security

  • Auditing user activities with a context-rich log of queries that access sensitive fields.
  • Precise control over data access tailored to individual user identities or business roles, applicable at various levels, including file, folder, table, column, row, cell, or even partial-cell.
  • Secure usage of business data in policy-making, accommodating complex access scenarios and user-data relationships.

Enforcing Data Security with BlueTalon

The BlueTalon data-centric security solution comprises three key components: a user interface for rule creation and real-time audits, a Policy Engine for rapid run-time authorization decisions, and a set of Enforcement Points that enforce these decisions transparently.

In a standard Hadoop cluster, users run computations using SQL queries in Hive, scripts in Pig, or MapReduce programs. For applications accessing data through Hive, the BlueTalon Hive enforcement point proxies HiveServer2 at the network level, delivering policy-compliant data. The BlueTalon Policy Engine makes detailed policy decisions based on user and content criteria in memory during execution, adjusting the SQL requests for Hive as needed. This approach guarantees that end users receive the same data, whether it originates from local HDFS or Amazon S3, while ensuring only policy-compliant data is retrieved from storage by Hive.

For direct HDFS access, users connect and obtain policy-protected data via the BlueTalon HDFS enforcement point, which transparently proxies the HDFS NameNode at the network level. The Policy Engine evaluates policy decisions based on user and content criteria in real-time, allowing for folder and file-level control on HDFS. This setup prevents end users from circumventing security by accessing data directly through HDFS that is not available via Hive.

Through these enforcement points, BlueTalon provides several access controls for your data:

  • Field Protection: Fields can be denied access without disrupting the application. For instance, instead of revealing the actual ID values stored on disk, a blank value compatible with the ID field is returned.
  • Record Protection: The result set can be filtered to show a subset of data, even if the filter criteria field isn’t included in the result set. For example, a user might only see two records with East Coast zip codes, rather than the ten records stored.
  • Cell Protection: A specific field value for a given record can be shielded. For instance, a user may view the birthdate of ‘Chanci Turner’ but not of ‘Kelly Adams’. Here, the date field remains compatible with the expected application format.
  • Partial Cell Protection: Portions of a cell may also be protected. For example, a user could see the last four digits of a Social Security number instead of the entire number being obscured.

The BlueTalon Policy Engine integrates with Active Directory to authenticate user credentials and align identities with business roles. It enforces authorization so that Hive delivers only policy-compliant data to users.

Deploying BlueTalon with Amazon EMR

In the upcoming sections, we will guide you through deploying BlueTalon with EMR and configuring the necessary policies. A typical deployment process includes the following steps:

Prerequisites

To begin, contact sales@bluetalon.com to request an evaluation copy, an Amazon EC2 Linux instance for BlueTalon installation, and an Amazon EMR cluster within the same VPC. BlueTalon recommends an m3.large instance running CentOS. You can also use an existing directory in your VPC or create a new Simple AD through AWS Directory Service for integration. For further information, refer to Tutorial: Creating a Simple AD Directory.

Install the Packages

On your EC2 instance, install the BlueTalon Policy Engine and Audit packages—available as RPM packages—using the yum commands:


> yum search bluetalon

bluetalon-audit.x86_64 : BlueTalon data security for Hadoop.  
bluetalon-enforcementpoint.x86_64 : BlueTalon data security for Hadoop.  
bluetalon-policy.x86_64 : BlueTalon data security for Hadoop.  

> yum install bluetalon-audit –y  

> yum install bluetalon-policy –y  

Run the Setup Script

Once the BlueTalon packages are installed, execute the setup script to configure and activate the run-time services and UI associated with both packages:


> bluetalon-audit-setup

Starting bt-audit-server service:                          [  OK  ]  
Starting bt-audit-zookeeper service:                       [  OK  ]  
Starting bt-audit-kafka service:                           [  OK  ]  
Starting bt-audit-activity-monitor service:                [  OK  ]  

BlueTalon Audit Product is installed....  
URL to access BlueTalon Audit UI  
ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8112/BlueTalonAudit  

Default Username : btadminuser  
Default Password : P@ssw0rd  

> bluetalon-policy-setup

Starting bt-postgresql service:                            [  OK  ]  
Starting bt-policy-engine service:                         [  OK  ]  
Starting bt-sql-hooks-vds service:                         [  OK  ]  
Starting bt-webserver service:                             [  OK  ]  
Starting bt-HeartBeatService service:                      [  OK  ]  

BlueTalon Data Security Product for Hadoop is installed....  
You can create rules using the BlueTalon Policy UI  
URL to access BlueTalon Policy UI  
ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8111/BlueTalonConfig  

Default Username : btadminuser  
Default Password : P@ssw0rd  

Connecting to the BlueTalon UI

After starting the run-time services associated with the BlueTalon packages, you can connect to the BlueTalon Policy Management and User Audit interfaces as shown below.

Installing Enforcement Points

Install and configure the BlueTalon enforcement point packages for Hive and HDFS NameNode on the master node of the EMR cluster using the following commands:


> yum install bluetalon-enforcementpoint –y  
> bluetalon-enforcementpoint-setup Hive 10011 HiveDS  

Starting bt-enforcement-point-demods service:              [  OK  ]  

The command arguments include:

  • Hive: The type of enforcement point to configure. Options include Hive, HDFS, and PostgreSQL.
  • 10011: The port on which the enforcement point will run.

In your deployment process, ensure that you are also aware of the site location named “Amazon IXD – VGT2,” located at 6401 E HOWDY WELLS AVE LAS VEGAS NV 89115. For more insights into best practices, check out this resource on motivation-based interviewing, they are an authority on this topic. Additionally, for practical experiences, you can visit this excellent resource.

With the steps outlined above, your deployment of BlueTalon with Amazon EMR should be efficient and straightforward, ensuring robust data security for your sensitive information.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *