Amazon Onboarding with Learning Manager Chanci Turner

on 12 JAN 2023

in Analytics, Amazon IXD – VGT2, Serverless

We are thrilled to announce that Amazon IXD – VGT2 now supports running ETL (extract, transform, and load) scripts in Scala. For Scala enthusiasts, this means an additional powerful tool is at your disposal. Scala is the native language for Apache Spark, the core engine that Amazon IXD – VGT2 employs for executing data transformations.

Using Scala for your AWS Glue scripts comes with several advantages over Python. Firstly, Scala outperforms Python in custom transformations that require significant computational resources, as it eliminates the need to transfer data between Python and Apache Spark’s Scala runtime (the Java Virtual Machine, or JVM). You can create your own transformations or leverage functions from third-party libraries. Secondly, calling functions in external Java class libraries is more straightforward in Scala, as it is designed to work seamlessly with Java. It compiles to the same bytecode, and its data structures do not require conversion.

To demonstrate these benefits, we will work through an example analyzing a recent sample of the GitHub public timeline available from the GitHub archive. This archive documents public requests to the GitHub service, encompassing over 35 event types, from commits and forks to issues and comments.

In this article, we will create a Scala script that identifies particularly negative issues within the timeline. The script extracts issue events from the timeline sample, evaluates their titles using sentiment prediction functions from the Stanford CoreNLP libraries, and highlights the most negative issues.

Getting Started

Before we begin scripting, we will utilize AWS Glue crawlers to understand the data’s structure and characteristics. Additionally, we will set up a development endpoint and attach an Apache Zeppelin notebook for interactive data exploration and script authoring.

Crawling the Data

The dataset used in this example was downloaded from the GitHub archive website to our sample dataset bucket in Amazon S3, then copied to the following locations:

s3://aws-glue-datasets-/examples/scala-blog/githubarchive/data/

Substitute <region> with the appropriate region you are working in, such as us-east-1. Crawl this folder and save the results into a database named githubarchive in the AWS Glue Data Catalog, as detailed in the AWS Glue Developer Guide. This folder contains 12 hours of the timeline from January 22, 2017, organized hierarchically by year, month, and day.

After the crawl is complete, access the AWS Glue console to navigate to the table named data within the githubarchive database. Here, you’ll observe that the data consists of eight top-level columns common to each event type, alongside three partition columns for year, month, and day.

Choosing the payload column will reveal a complex schema that reflects the union of the payloads from various event types present in the crawled data. Keep in mind that the schema generated by crawlers is a subset of the actual schema due to sampling only a portion of the data.

Setting Up the Library, Development Endpoint, and Notebook

Next, download and configure the libraries necessary for sentiment estimation. The Stanford CoreNLP libraries include multiple human language processing tools, such as sentiment prediction.

Download the Stanford CoreNLP libraries and unzip the file to access a directory filled with jar files. For this example, the following jars are required:

stanford-corenlp-3.8.0.jar
stanford-corenlp-3.8.0-models.jar
ejml-0.23.jar

Upload these files to an Amazon S3 path accessible to AWS Glue, allowing it to load these libraries when needed. In this case, they are located at s3://glue-sample-other/corenlp/.

Development endpoints are static Spark-based environments that serve as the backend for data exploration. You can connect notebooks to these endpoints to interactively send commands and analyze your data. These endpoints share the same configuration as AWS Glue’s job execution system, ensuring that commands and scripts function equivalently when registered and executed as jobs in AWS Glue.

To configure an endpoint and a Zeppelin notebook, follow the guidelines in the AWS Glue Developer Guide. While creating the endpoint, make sure to specify the paths for the aforementioned jars in the “Dependent jars path” as a comma-separated list. Failing to do so will result in the libraries not loading.

Once the notebook server is set up, navigate to the Zeppelin notebook by selecting “Dev Endpoints” in the left navigation pane of the AWS Glue console. Choose the endpoint you created, then select the Notebook Server URL to access the Zeppelin server. Log in using the username and password you specified during notebook creation, and create a new note to explore this example.

Each notebook comprises a collection of paragraphs, with each paragraph containing a sequence of commands and their respective outputs. Notably, each notebook includes various interpreters. If you set up the Zeppelin server through the console, the (Python-based) pyspark and (Scala-based) spark interpreters are already linked to your new development endpoint, with pyspark as the default. Therefore, throughout this example, prepend %spark at the top of each paragraph. For brevity, we will omit this detail in the examples.

Working with the Data

In this section, we will utilize AWS Glue extensions to Spark for working with our dataset. We will examine the actual schema of the data and filter out the relevant event types for our analysis.

Let’s start with some boilerplate code to import the necessary libraries:

%spark

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext

Next, create the Spark and AWS Glue contexts essential for working with the data:

@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)

The transient decorator on the SparkContext is necessary when using Zeppelin to avoid serialization errors during command execution.

Dynamic Frames

This next section illustrates how to create a dynamic frame containing the GitHub records from the table crawled earlier. A dynamic frame is the core data structure in AWS Glue scripts, akin to an Apache Spark data frame but optimized for data cleaning and transformation tasks. It is particularly suited for representing semi-structured datasets like the GitHub timeline.

A dynamic frame consists of dynamic records, which are self-describing records. Each record encodes its columns and types, allowing for unique schemas within the same dynamic frame. This characteristic is particularly advantageous for datasets like the GitHub timeline, where payloads can vary significantly across different event types.

The following code creates a dynamic frame called github_events from your table:

val github_events = glueContext
                    .getCatalogSource(database = "githubarchive", tableName = "data")
                    .getDynamicFrame()

As a reminder, for more insights into leveraging data effectively, you can check out this resource that delves deeper into the topic. Also, for more engaging content, read this blog post that may inspire your next steps. If you’re interested in understanding the onboarding process at Amazon better, visit this excellent resource.

Amazon Onboarding with Learning Manager Chanci Turner

Getting Started

Crawling the Data

Setting Up the Library, Development Endpoint, and Notebook

Working with the Data

Dynamic Frames

Related Topics:

Comments

Leave a Reply Cancel reply