Amazon Onboarding with Learning Manager Chanci Turner

Introduction

Hadoop offers a robust ecosystem of tools designed for extracting insights from data across various formats and sizes. While it initially focused on large-batch processing with tools like MapReduce, Pig, and Hive, Hadoop has evolved to include several tools for running interactive queries. This post explains how to use Amazon Elastic MapReduce (Amazon EMR) to analyze a dataset hosted on Amazon Simple Storage Service (Amazon S3), followed by visualizing the data with Tableau using Impala.

Amazon Elastic MapReduce

Amazon EMR is a web service that simplifies the process of efficiently and cost-effectively processing large data volumes. It employs Apache Hadoop, an open-source framework, to distribute and manage data across a resizable cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances.

Impala

Impala is an open-source tool within the Hadoop ecosystem, available on EMR, that enables interactive, ad hoc querying using SQL syntax. Unlike Hive, which uses a MapReduce engine, Impala utilizes a massively parallel processing (MPP) engine akin to those found in traditional relational database management systems (RDBMS), resulting in quicker query response times.

Both Impala and Hive offer SQL-like capabilities and can share the same Metastore for metadata regarding tables and partitions; however, they serve different purposes. Impala generally provides faster query responses than Hive, making it more suitable for interactive data analysis tools like Tableau. That said, Impala consumes a significant amount of memory, which restricts the volume of data any query can handle. Conversely, Hive is more versatile in processing larger datasets with the same hardware, making it preferable for ETL workloads.

Tableau

Tableau Software is a business intelligence solution that facilitates continuous visual analysis, combining data analysis and reporting in an intuitive manner. Tableau delivers rapid analytics and visualization and integrates seamlessly with AWS services and various other sources. The latest version of Tableau Desktop allows users to connect to Hive or Impala on Amazon EMR via the ODBC driver for Amazon EMR. For further guidance on enabling Amazon EMR as a data source, feel free to reach out to Tableau.

In this blog post, we will illustrate how to enable Amazon EMR as a data source in Tableau and connect to Impala for creating an interactive visualization.

Using Amazon EMR to Analyze Google Books n-grams

The Google Books n-gram dataset is freely available through the AWS Public Data Sets on Amazon S3. N-grams are fixed-size tuples of items, where the “n” indicates the number of elements in the tuple—thus, a 5-gram contains five words or characters.

Apache Hadoop traditionally operates with HDFS but also supports Amazon S3 as a file system. Currently, Impala mandates that data reside on HDFS, while Hive can directly query data on Amazon S3.

The Google Books n-gram dataset is formatted for Hadoop, with sizes reaching up to 2.2 TB. The files are stored in the SequenceFile format with block-level LZO compression. The SequenceFile key is the dataset’s row number stored as LongWritable, and the value is the raw data in TextWritable format.

Since Impala cannot create or insert data into SequenceFile formats, it can only query LZO-compressed Text tables. Hive can efficiently handle this format, making it the ideal choice for transforming our data into a format compatible with Impala on HDFS.

Starting an Amazon EMR Cluster

We will begin by launching an Amazon EMR cluster equipped with Hive and Impala.

Launch the EMR cluster using the AWS CLI. If you’re new to CLI, AWS provides clear instructions for installation and configuration.

The following command sets up the EMR cluster and returns a unique identifier for your cluster:

aws emr create-cluster --name ImpalaCluster --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium --ec2-attributes KeyName=keyPairName,AvailabilityZone=availabilityZone --applications Name=Hive,Name=Impala --no-auto-terminate

Note: Make sure to replace keyPairName and availabilityZone with appropriate values prior to executing the command. In subsequent steps, you will also need to substitute j-XXXXXXXXXXXX with the unique identifier returned by the command above.

The cluster should be ready within 5-10 minutes, indicated by its status changing to “Waiting.” To monitor the cluster’s status during initialization, run:

aws emr describe-cluster --cluster-id j-XXXXXXXXXXXX --query 'Cluster.Status.State' --output text

Once your cluster is in the “WAITING” state, you can connect to the master node by using the command below:

aws emr ssh --cluster-id j-XXXXXXXXXXXX --key-pair-file keyFilePath

Note: Replace keyFilePath with the path to your private key file.

Creating the External Table from Data in Amazon S3

External data sources in Amazon EMR are referenced by establishing an EXTERNAL TABLE, which serves as a pointer to the data without transferring it yet.

After logging into the master node, initiate the Hive shell:

$ hive

Define the source with a CREATE TABLE statement. For this example, we will utilize the English 1-grams dataset:

CREATE EXTERNAL TABLE eng_1M_1gram(token STRING, year INT, frequency INT, pages INT, books INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS SEQUENCEFILE LOCATION 's3://datasets.elasticmapreduce/ngrams/books/20090715/eng-1M/1gram';

Creating a Replica Table in HDFS

We will create a replica table to store the results in HDFS, which will also be required for Impala. In this replica table, we will opt for Parquet instead of Sequence File format, as Parquet is a column-oriented binary file format optimized for large-scale queries.

To create the replica table in Hive:

CREATE TABLE eng_1M_1gram_parquet(token STRING, year INT, frequency INT, pages INT, books INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS inputformat 'parquet.hive.DeprecatedParquetInputFormat' outputformat 'parquet.hive.DeprecatedParquetOutputFormat';

Adjust the mapred.min.split.size settings because the data is stored in Amazon S3 as a single file:

set mapred.min.split.size=134217728;

This setting instructs Hive to segment the file into pieces of at least 128 MB for processing. This ensures that multiple mappers can be utilized, optimizing the distributed capabilities of MapReduce.

Insert data into this table using a select query. The query reads from the raw data table and populates the new table:

INSERT OVERWRITE TABLE eng_1M_1gram_parquet SELECT lower(token), year, frequency, pages, books FROM eng_1M_1gram WHERE year >= 1890 AND token REGEXP '^[A-Za-z+'-]+$';

This query exemplifies a common use of Hive for transforming your data, making it more accessible for downstream querying with tools like Tableau. Additionally, developing strong interpersonal skills can be beneficial in this context, as discussed in this blog post.

As you continue your analytical journey, understanding the implications of cellphone usage at work can also be insightful, according to SHRM. For those interested in furthering their career at Amazon, this resource is an excellent starting point.