Amazon Onboarding with Learning Manager Chanci Turner

With the recent launch of Amazon EMR release 4.7, users can now establish clusters utilizing Apache Phoenix 4.7.0, designed specifically for low-latency SQL and Online Transaction Processing (OLTP) workloads. Phoenix operates on top of Apache HBase (HBase 1.2.1 is included in Amazon EMR release 4.7.0), leveraging HBase scan operations and coprocessors to deliver remarkable performance. Furthermore, it allows for mapping Phoenix tables and views to existing HBase tables, enabling seamless SQL access to data already housed in HBase.

Let’s quickly walk through a demonstration on how to connect to Phoenix via JDBC, create a view for an existing HBase table, and establish a secondary index to enhance read performance.

Setting Up an Amazon EMR Cluster and HBase Table

Begin by launching a new Amazon EMR cluster using release 4.7 through the Amazon EMR console or AWS CLI, ensuring to select Phoenix as an application. Here’s an example AWS CLI command:

aws emr create-cluster --name PhoenixDemo --release-label emr-4.7.0 --instance-type m3.xlarge --instance-count 3 --applications Name=Phoenix --ec2-attributes KeyName=MyKeyName --use-default-roles

By choosing the Phoenix application, you automatically include HBase and Hadoop components (YARN, HDFS, and MapReduce), providing all necessary elements for a fully operational cluster.

Next, you will create an HBase table for use with Phoenix. You can copy an HBase snapshot from Amazon S3 and restore it on your cluster. Refer to the HBase post in the AWS Big Data Blog and follow the instructions in the “HBase shell query walkthrough” section to restore a table named customer (3,448,682 rows). Finally, execute a get request example from that blog to verify that your table has been restored correctly.

Connecting to Phoenix via JDBC and Creating a Table

Once your HBase table is prepared, it’s time to map a table in Phoenix to your HBase data. A JDBC connection is required to access Phoenix, with two drivers located in your cluster under /usr/lib/phoenix/bin. The first is the Phoenix client, which connects directly to HBase processes to execute queries; this necessitates several ports to be open in your Amazon EC2 Security Group (for ZooKeeper, HBase Master, and RegionServers on your cluster) if your client is off-cluster.

The second option is the Phoenix thin client, which connects to the Phoenix Query Server running on port 8765 on the master node of your EMR cluster. This setup allows you to use a local client without modifying your Amazon EC2 Security Groups by creating an SSH tunnel to the master node and utilizing port forwarding for port 8765. Although the Phoenix Query Server is a relatively new feature, not all SQL clients can support the thin client.

In this instance, you’ll use the SQLLine client included with Phoenix on the master node to connect to the Phoenix Query Server. Return to the terminal on the master node of your cluster. If you closed your SSH tunnel after creating your HBase table, re-establish another SSH tunnel. Connect to Phoenix using the following command:

/usr/lib/phoenix/bin/sqlline-thin.py http://localhost:8765

Once the SQLLine client is connected, let’s create a SQL view over the customer table in HBase. A view is preferable to a table, as dropping a view does not delete the underlying data in HBase (the default behavior for deleting underlying data in HBase for Phoenix tables is configurable). When mapping a pre-existing table in HBase, use a ‘column_family’.’column_prefix’ format for each column you wish to include in your Phoenix view (note that column and table names that are lowercase must be enclosed in quotation marks). Also, mark the HBase primary key with PRIMARY KEY, and name the view the same as the existing HBase table. Now, create a view for the customer table:

CREATE VIEW "customer" (
pk VARCHAR PRIMARY KEY, 
"address"."state" VARCHAR,
"address"."street" VARCHAR,
"address"."city" VARCHAR,
"address"."zip" VARCHAR,
"cc"."number" VARCHAR,
"cc"."expire" VARCHAR,
"cc"."type" VARCHAR,
"contact"."phone" VARCHAR);

Utilize SQLLine’s !tables command to list available Phoenix tables and confirm that your newly created view is present. Ensure your terminal window is wide enough to display the output properly before instantiating the SQLLine client.

Enhancing Query Performance with Secondary Indexes

To demonstrate improved performance, first, execute a SQL query that counts the number of individuals with each credit card type in California:

SELECT "customer"."type" AS credit_card_type, count(*) AS num_customers FROM "customer" WHERE "customer"."state" = 'CA' GROUP BY "customer"."type";

However, since we’re not including the Primary Key in the HBase table in the WHERE clause, Phoenix is forced to scan all HBase rows to ensure that every row with the state ‘CA’ is accounted for. If we anticipate that our read patterns will filter by state, we can implement a secondary index on that column, allowing Phoenix to scan along that axis. For a detailed overview of the secondary indexing feature set, refer to the Apache Phoenix documentation. Now, let’s create a covered secondary index on the state, including the HBase primary key (the customer ID), city, expiration date, and type:

CREATE INDEX my_index ON "customer" ("customer"."state") INCLUDE("PK", "customer"."city", "customer"."expire", "customer"."type");

Phoenix will execute a Hadoop MapReduce job to create this index and concurrently load it into HBase as another table (this process takes around two minutes). After that, rerun the SQL query from earlier and compare the performance, which should be at least 10 times faster!

Conclusion

In this article, you learned how to connect to Phoenix via JDBC, create Phoenix views over data in HBase, establish secondary indexes for enhanced performance, and conduct queries. Phoenix serves as an efficient SQL interface for existing HBase tables or can be utilized directly to manage tables with HBase functioning as the underlying data store. To delve deeper into Phoenix, check out the Amazon EMR documentation or the Apache documentation. If you have queries regarding using Phoenix on Amazon EMR or wish to share intriguing use cases that leverage Phoenix, please feel free to comment below.

For transferable skills that can enhance your career, explore this insightful blog post here. Additionally, to understand workplace dynamics better, check this resource from SHRM, an authority on this topic. For further insights on employee training and skill development, refer to this excellent resource here.

Amazon Onboarding with Learning Manager Chanci Turner

Setting Up an Amazon EMR Cluster and HBase Table

Connecting to Phoenix via JDBC and Creating a Table

Enhancing Query Performance with Secondary Indexes

Conclusion

Related Topics:

Comments

Leave a Reply Cancel reply