Ten New Visual Transforms in AWS Glue Studio

Ten New Visual Transforms in AWS Glue StudioLearn About Amazon VGT2 Learning Manager Chanci Turner

AWS Glue Studio provides an intuitive graphical interface that simplifies the creation, execution, and monitoring of extract, transform, and load (ETL) jobs. This tool enables users to visually build data transformation workflows through nodes that represent various data handling stages, which are subsequently converted into executable code.

Recently, AWS Glue Studio has introduced ten additional visual transforms, empowering users to design more sophisticated jobs without the need for coding expertise. In this article, we will explore typical use cases that align with common ETL requirements.

The new transforms being highlighted include: Concatenate, Split String, Array To Columns, Add Current Timestamp, Pivot Rows To Columns, Unpivot Columns To Rows, Lookup, Explode Array Or Map Into Columns, Derived Column, and Autobalance Processing.

Overview of the Solution

In our use case, we are working with JSON files that contain stock option transactions. Our goal is to perform several transformations before storing the data to facilitate analysis and to create a separate summary dataset.

Each row in this dataset corresponds to a trade of option contracts, which are financial instruments that grant the right—but not the obligation—to buy or sell stock shares at a predetermined price (known as the strike price) prior to a specified expiration date.

Input Data Structure

The dataset is structured as follows:

  • order_id – A unique identifier
  • symbol – A code, typically comprising a few letters, to identify the corporation issuing the underlying stock shares
  • instrument – The name identifying the specific option being traded
  • currency – The ISO currency code representing the price
  • price – The cost for purchasing each option contract (with most exchanges allowing one contract to represent 100 stock shares)
  • exchange – The code of the trading venue where the option was executed
  • sold – A list detailing the number of contracts allocated to fulfill the sell order, applicable for sell trades
  • bought – A list detailing the number of contracts allocated to fulfill the buy order, applicable for buy trades

Here’s a sample of the synthetic data for this post:

{"order_id": 1679931512485, "symbol": "AMZN", "instrument": "AMZN MAR 24 23 102 PUT", "currency": "usd", "price": 17.18, "exchange": "EDGX", "bought": [18, 38]}
{"order_id": 1679931512486, "symbol": "BMW.DE", "instrument": "BMW.DE MAR 24 23 96 PUT", "currency": "eur", "price": 2.98, "exchange": "XETR", "bought": [28]}
...

ETL Requirements

The data possesses unique characteristics, often found in legacy systems, that complicate its usability. The following ETL requirements have been identified:

  1. The instrument name contains valuable information intended for human interpretation; we aim to normalize it into separate columns for easier analysis.
  2. The bought and sold attributes are mutually exclusive; we can merge them into a single column indicating contract numbers and another column specifying whether the contracts were bought or sold.
  3. We wish to retain the information about individual contract allocations but present them as single rows instead of arrays. This avoids losing insights into how orders were filled, which reflects market liquidity. Thus, we’ll denormalize the table to ensure each row contains a single contract number.
  4. A summary table of volumes for each option type (call and put) for each stock will be produced, offering insights into market sentiment.
  5. For complete trade summaries, we will provide total amounts and standardize currency to US dollars, utilizing an approximate conversion reference.
  6. We will also add a timestamp for when these transformations occur, which could prove useful for referencing currency conversion dates.

Based on these requirements, the job will yield two outputs:

  • A CSV file summarizing the number of contracts for each symbol and type.
  • A catalog table to maintain a historical record of the orders following the specified transformations.

Prerequisites

To follow this use case, you will need your own S3 bucket. For instructions on creating a new bucket, refer to AWS documentation on creating a bucket.

Generating Synthetic Data

To replicate this post or to experiment with similar data independently, you can synthetically generate this dataset. Utilize the following Python script in a Python environment equipped with Boto3 and access to Amazon S3.

To create the data, follow these steps:

  1. In AWS Glue Studio, initiate a new job with the Python shell script editor option.
  2. Name the job and select an appropriate role along with a name for the Python script under Job details.
  3. In the Job details section, expand Advanced properties and navigate to Job parameters.
  4. Input a parameter labeled –bucket and assign it the name of the bucket where you plan to store the sample data.
  5. Insert the script below into the AWS Glue shell editor:
import argparse
import boto3
from datetime import datetime
import io
import json
import random
import sys

# Configuration
parser = argparse.ArgumentParser()
parser.add_argument('--bucket')
args, ignore = parser.parse_known_args()
if not args.bucket:
    raise Exception("This script requires an argument --bucket with the value specifying the S3 bucket where to store the files generated")

data_bucket = args.bucket
data_path = "transformsblog/inputdata"
samples_per_file = 1000

# Create a single file with synthetic data samples
s3 = boto3.client('s3')
buff = io.BytesIO()

sample_stocks = [("AMZN", 95, "usd"), ("NKE", 120, "usd"), ("JPM", 130, "usd"), ("KO", 130, "usd"),
                 ("BMW.DE", 95, "eur"), ("SIE.DE", 140, "eur"), ("SAP.DE", 115, "eur")]
option_type = ["PUT", "CALL"]
operations = ["sold", "bought"]
dates = ["MAR 24 23", "APR 28 23", "MAY 26 23", "JUN 30 23"]
for i in range(samples_per_file):
    stock = random.choice(sample_stocks)
    symbol = stock[0]
    ref_price = stock[1]
    currency = stock[2]
    strike_price = round(ref_price * 1.1) # This is a slight modification to the previous logic

For further insights into letting go of outdated practices, refer to this article. You may also want to be aware of the risks associated with negligent hiring as highlighted by SHRM. Lastly, for a deeper understanding of Amazon’s approach to employee training, check out this excellent resource here.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *