Backfilling an Amazon DynamoDB Time to Live Attribute Using Amazon EMR: Part 2 | Amazon VGT2 Las Vegas Blog

Backfilling an Amazon DynamoDB Time to Live Attribute Using Amazon EMR: Part 2 | Amazon VGT2 Las Vegas BlogMore Info

Amazon DynamoDB is a fully managed, serverless key-value NoSQL database specifically designed for high-performance applications, capable of scaling effortlessly. This service can accommodate and retrieve large volumes of data. As data accumulates and transitions to a less active state, various use cases, such as session management or order processing, necessitate the archival of older, unnecessary items. DynamoDB offers a feature known as Time to Live (TTL), which allows for the expiration and removal of items without incurring write capacity unit (WCU) charges. Effectively managing item expiration with TTL can significantly reduce both storage and WCU delete costs.

We recommend establishing a TTL attribute prior to writing data into your DynamoDB table. However, some customers opt to enable TTL on tables that already contain existing data. For efficiently backfilling TTL attributes in DynamoDB, we advocate using Amazon EMR due to its scalable nature and integrated capabilities for connecting with DynamoDB. Once your application is updated to include a TTL attribute for all new items, you can execute an Amazon EMR job.

Part 1 of this series detailed how to set up an Amazon EMR cluster and execute a Hive query to backfill the TTL attribute for items lacking it. This method is applicable when the data does not include complex or collection data types, such as maps or lists. This post will guide you through the process of backfilling TTL attributes in items containing map, list, Boolean, or null data types, which are not inherently supported by the DynamoDBStorageHandler in Hive. The outlined steps are effective regardless of whether the attribute sets for each item are consistent or varied.

This method is particularly suitable for workloads that involve only inserts or during maintenance windows where updates and deletions are prohibited. If data is updated or deleted during this process, there is a risk of unintentional data loss.

DynamoDB Schema

Although DynamoDB primarily enforces and defines primary key attributes, most applications operate within some form of schema. In this case, our current schema includes:

  • order_id – The partition key, formatted as a universally unique identifier (UUID)
  • creation_timestamp – A string indicating the item’s creation timestamp in ISO 8601 format
  • delivery_address – A map representing the order’s delivery address
  • is_cod – A Boolean indicating if the payment method is cash on delivery
  • item_list – A list of product codes ordered

This discussion focuses on a table named Orders, which comprises 2 million items lacking the expiration_timestamp attribute. The following screenshot illustrates a sample of the items within the Orders table.

When retrieving one of the items from this table via the AWS Command Line Interface (AWS CLI) using the get-item command, it is evident that the expiration_timestamp attribute is absent:

aws dynamodb get-item --table-name Orders --key '{"order_id":{"S":"e9bba98e-d579-43bb-a571-93ccdb32c960"}}'

You will be adding a new TTL attribute called expiration_timestamp to each of these items. In this instance, the goal is to remove the order 180 days after its creation, meaning the expiration_timestamp will hold a number representing the item’s expiration time in seconds from the epoch, calculated as 180 days post-creation_timestamp.

After completing the backfilling process, you can employ the same command to confirm the addition of the TTL attribute for this item.

Executing Hive CLI Commands

You can derive the new TTL attribute on a per-item basis using an existing timestamp attribute. For detailed information on setting up an Amazon EMR cluster, connecting via SSH, and considering cluster sizing, please refer to Part 1 of this series. This post utilizes Amazon EMR cluster version 5.34, configured with five nodes (one master and four core nodes) of type c5.4xlarge.

To begin, log into the master node of the cluster and access the Hive CLI by entering the command hive in the terminal. Next, create a new database and switch to it using the following commands:

hive> show databases;
OK
default
Time taken: 0.632 seconds, Fetched: 1 row(s)

hive> create database dynamodb_ttl;
OK
Time taken: 0.221 seconds

hive> use dynamodb_ttl;
OK
Time taken: 0.039 seconds

Now, create a wrapper around the DynamoDB table to enable Hive queries. Since the DynamoDBStorageHandler does not map columns for map, list, and Boolean data types, we need to create a single entity called item that encompasses the entire DynamoDB item as a map of strings for keys and values. Additionally, include a column named expiration_timestamp of type bigint, where the TTL values will actually be backfilled. The following Hive query illustrates this:

hive> CREATE EXTERNAL TABLE Orders(item map, expiration_timestamp bigint COMMENT 'from deserializer')
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ('dynamodb.table.name'='Orders', "dynamodb.column.mapping" = "expiration_timestamp:expiration_timestamp");

The duration of the backfill process is influenced by two main factors: the available DynamoDB capacity and the number of nodes in the Amazon EMR cluster. For this test, the table operates in provisioned throughput mode with both ProvisionedReadCapacity and ProvisionedWriteCapacity set to 40000.

For certain Amazon EMR versions (5.25.0 to 5.34.0), you must configure the dynamodb.throughput.write and dynamodb.throughput.read parameters in Hive:

hive> SET dynamodb.throughput.write=40000;

hive> SET dynamodb.throughput.read=40000;

Now, execute the following command to determine the count of items in the table missing the expiration_timestamp:

hive> select count(*) from Orders where item["expiration_timestamp"] IS NULL;

As evidenced by the output, there are 2 million items lacking the TTL attribute. Next, run an INSERT OVERWRITE command using the regexp_replace() function to perform the calculations. In this excellent resource, you can explore more about related topics.

For further reading, check out another blog post here. Additionally, for more authoritative insights, visit this link.

SEO Metadata


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *