Amazon VGT2 Las Vegas: Running R on Amazon Athena

Amazon VGT2 Las Vegas: Running R on Amazon AthenaMore Info

Data scientists frequently face challenges managing the infrastructure of big data platforms while utilizing R for SQL queries. Amazon Athena simplifies this process by providing an interactive query service that directly interacts with data stored in S3, allowing users to analyze data using standard SQL without the burden of infrastructure management. By integrating R with Amazon Athena, data scientists gain a robust platform for developing interactive analytical solutions.

In this article, we will guide you through connecting R/RStudio on an Amazon EC2 instance with Athena.

Prerequisites

Before diving in, ensure you complete the following preparations:

  1. Request your AWS account administrator to grant your account the necessary permissions to access Athena through Amazon’s Identity and Access Management (IAM) console. This can be accomplished by attaching the relevant Athena policies to your data scientist user group in IAM.
  2. Specify a staging directory in the form of an Amazon S3 bucket. Athena will utilize this bucket to query datasets and store results; we will refer to it as s3://athenauser-athena-r in the subsequent instructions.

Note: I will create all AWS resources in the US-East region. Please consult the Region Table to check Athena’s availability in other regions.

Setting Up R and RStudio on EC2

Follow the guidelines in the blog post “Running R on AWS” to establish R on an EC2 instance (t2.medium or larger) using Amazon Linux. Before proceeding, note the following:

In the aforementioned article under “Advanced Details,” during step 3 use the bash script below to install the latest version of RStudio. Adjust the password for RStudio as necessary.

#!/bin/bash
#install R
yum install -y R
#install RStudio-Server
wget https://download2.rstudio.org/rstudio-server-rhel-1.0.153-x86_64.rpm
yum install -y --nogpgcheck rstudio-server-rhel-1.0.153-x86_64.rpm
#add user(s)
useradd rstudio
echo rstudio:rstudio | chpasswd

Install Java 8

SSH into the EC2 instance. Remove any older Java versions and install Java 8, which is required for working with Athena. Execute the following commands:

#install Java 8, select ‘y’ from options presented to proceed with installation
sudo yum install java-1.8.0-openjdk-devel
#remove version 7 of Java, select ‘y’ from options to proceed with removal
sudo yum remove java-1.7.0-openjdk
#configure java, choose 1 as your selection option for java 8 configuration
sudo /usr/sbin/alternatives --config java
#run command below to add Java support to R
sudo R CMD javareconf

#the following libraries are required for the interactive application we build later
sudo yum install -y libpng-devel
sudo yum install -y libjpeg-turbo-devel

Configure .Renviron

You need to set the R environment variable .Renviron with the required Athena credentials. Obtain the necessary credentials from your AWS Administrator: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

At the Linux command prompt, enter the command below to open the vi editor:

sudo vim /home/rstudio/.Renviron

Enter your Athena credentials in the format below into the editor:

ATHENA_USER=<AWS_ACCESS_KEY_ID>
ATHENA_PASSWORD=<AWS_SECRET_ACCESS_KEY>

Save the file and exit the editor.

Log in to RStudio

Next, access RStudio on your EC2 instance. Retrieve the public IP address of your instance from the EC2 dashboard and enter it into your browser, appending :8787 (the port number for RStudio).

Ensure your IP address is whitelisted for inbound access to port 8787 in the security group settings associated with your EC2 instance.

Log in to RStudio using the username and password you previously set.

Install Required R Packages

Proceed to install and load the essential R packages:

#--the following R packages are required for connecting R with Athena
install.packages("rJava")
install.packages("RJDBC")
library(rJava)
library(RJDBC)

#--these R packages are needed for the interactive application we will build later
install.packages(c("plyr","dplyr","png","RgoogleMaps","ggmap"))
library(plyr)
library(dplyr)
library(png)
library(RgoogleMaps)
library(ggmap)

Connecting to Athena

The following steps in R will download the Athena driver and establish the necessary connection. Use the JDBC URL associated with your region.

#verify Athena credentials by checking results from the command below
Sys.getenv()
#set up URL to download Athena JDBC driver
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#check that the file has been downloaded
fil
list.files()
#initialize driver connection to JDBC
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
#connect to Athena using the driver, S3 working directory and credentials for Athena 
#replace ‘athenauser’ with your S3 bucket prefix
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir="s3://athenauser-athena-r",
user=Sys.getenv("ATHENA_USER"),
password=Sys.getenv("ATHENA_PASSWORD"))
#in case of an error or warning, ensure rJava and RJDBC packages are loaded 
#and that Java 8 is configured for R as described earlier

You are now prepared to begin querying Athena from RStudio.

Sample Queries to Test

# get a list of all tables currently in Athena 
dbListTables(con)
# execute a sample query
dfelb=dbGetQuery(con, "SELECT * FROM sampledb.elb_logs limit 10")
head(dfelb, 2)

Interactive Use Case

Next, you will practice interactively querying Athena from R for analytics and visualization. For this, you’ll use GDELT, a publicly accessible dataset hosted on S3. You can create a table in Athena from R using the GDELT dataset; this can also be done via the AWS management console, as outlined in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”

#---sql create table statement in Athena
dbSendQuery(con, 
"
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.gdeltmaster (
GLOBALEVENTID BIGINT,
SQLDATE INT,
MonthYear INT,
Year INT,
FractionDate DOUBLE,
Actor1Code STRING,
Actor1Name STRING,
Actor1CountryCode STRING,
Actor1KnownGroupCode STRING,
Actor1EthnicCode STRING,
Actor1Religion1Code STRING,
Actor1Religion2Code STRING,
Actor1Type1Code STRING,
Actor1Type2Code STRING,
Actor1Type3Code STRING,
Actor2Code STRING,
Actor2Name STRING,
Actor2CountryCode STRING,
Actor2KnownGroupCode STRING,
Actor2EthnicCode STRING,
Actor2Religion1Code STRING,
Actor2Religion2Code STRING,
Actor2Type1Code STRING,
Actor2Type2Code STRING,
Actor2Type3Code STRING,
IsRootEvent INT,
EventCode STRING,
EventBaseCode STRING,
EventRootCode STRING,
QuadClass INT,
GoldsteinScale DOUBLE,
NumMentions INT,
NumSources INT,
NumArticles INT,
AvgTone DOUBLE,
Actor1Geo_Type INT,
Actor1Geo_FullName STRING,
Actor1Geo_CountryCode STRING,
Actor1Geo_ADM1Code STRING,
Actor1Geo_Lat FLOAT,
Actor1Geo_Long FLOAT,
Actor1Geo_FeatureID INT,
Actor2Geo_Type INT,
Actor2Geo_FullName STRING,
Actor2Geo_CountryCode STRING,
Actor2Geo_ADM1Code STRING,
Actor2Geo_Lat FLOAT,
Actor2Geo_Long FLOAT,
Actor2Geo_FeatureID INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LOCATION 's3://gdelt-open-data/events/'
")

Now you are ready to explore the vast data from GDELT, and you can find more insights in another blog post highlighting this topic here. If you want to delve deeper, this source is an authority on the subject. Additionally, for further assistance, check out this excellent resource.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *