Optimizing Provisioned Throughput in Amazon DynamoDB | Amazon IXD – VGT2 Las Vegas

UPDATE (May 5, 2018)

Since the original publication of this post, enhancements have been made to the capacity management features of Amazon DynamoDB. Consequently, some of the information may no longer reflect the best practices. For current best practices, refer to the DynamoDB documentation on effectively designing and using partition keys.

David Thompson from the Amazon DynamoDB team is here with a guest article focused on maximizing the use of DynamoDB’s distinct provisioned throughput functionality. If you’re interested in expanding your DynamoDB knowledge, don’t miss our 8-hour DynamoDB bootcamp session at re:Invent. Previous experience with either relational or non-relational databases will be beneficial.

DynamoDB enables scalable throughput and storage by distributing your data across numerous servers, ensuring that your requirements are met. When you need to increase throughput for your table, you can easily do so via the AWS Management Console or API calls. As your data set expands, DynamoDB automatically adds more partitions to accommodate your growing storage needs.

Unlike traditional databases where you typically “scale-up” by purchasing larger systems, DynamoDB offers a “scale-out” architecture. This means you don’t have to implement additional logic in your application to direct queries to the appropriate server; DynamoDB handles all the complexities required for a secure, reliable, and highly available data store.

While DynamoDB lets you set your throughput levels, it’s essential to design your application with the DynamoDB architecture in mind to fully utilize its capabilities. The Amazon DynamoDB Developer Guide outlines several best practices for maximizing your provisioned throughput.

Efficient Storage of Time Series Data

One such recommendation covers the efficient storage of time series data. When managing time series information, you often access the most recent, or “hot,” data more frequently than older, “cold” data. It is advisable to distribute your time series data across multiple tables—one for each time period (e.g., month, day). This design approach is elaborated upon in this article, which also highlights the advantages of structuring your application in this manner.

Understanding Non-Uniform Workloads

To grasp the significance of separating hot and cold data, consider the guidelines regarding Uniform Workloads in the developer guide. When you store data, Amazon DynamoDB divides a table’s items into various partitions, distributing the data primarily according to the hash key element. The provisioned throughput associated with a table is partitioned evenly, with no sharing of throughput among partitions. Thus, to fully utilize your provisioned throughput, it’s crucial to maintain an even workload across hash key values. This distribution helps balance requests across partitions.

For instance, if a table has a limited number of heavily accessed hash key elements, or even just one heavily used hash key, traffic may become concentrated on a few partitions—potentially just one. A heavily unbalanced workload, where requests are disproportionately directed at one or a few partitions, will not achieve the overall provisioned throughput level. To maximize throughput in Amazon DynamoDB, it is essential to design tables with hash key elements that have a wide range of distinct values, with requests made as uniformly and randomly as possible.

Another scenario involving non-uniform workloads occurs when individual requests consume significant throughput. High-throughput requests often arise from Scan or Query operations, or even single-item operations when dealing with large items. Even if these heavy requests are evenly spread across a table, each request can create a temporary hot spot, leading to throttling of subsequent requests.

Consider the Forum, Thread, and Reply tables from the Getting Started section of the developer guide, which illustrate a forums web application utilizing DynamoDB. The Reply table manages messages exchanged between users within a conversation, arranged by time.

If you query the Reply table for all messages in a highly popular thread, that single query could consume a large amount of throughput from a single partition. In extreme cases, this expensive query could deplete the partition’s throughput enough to throttle subsequent requests, even if other partitions have available throughput.

Tip: Utilize Pagination

To mitigate this workload concentration, it’s advisable to leverage pagination features in the Query operation and limit the number of items retrieved with each call. Since a web application for forums typically displays a fixed number of replies per thread, pagination is particularly suited to this scenario.

Impact of Non-Uniform Workloads on Throughput for Large Tables

As your table grows, DynamoDB dynamically adds more partitions to manage storage needs. With an increase in partitions, each partition receives a smaller share of your overall throughput. For non-uniform workloads, this means that as your dataset expands, some requests may experience throttling, which might not have occurred when the table was smaller, even if you’re not fully utilizing your provisioned throughput. Initially, you may have hot spots that go unnoticed because each partition is allocated a larger portion of your overall throughput. However, as you add significant data to your table, DynamoDB automatically increases the number of partitions, thereby reducing per-partition throughput and potentially leading to heightened throttling for non-uniform workloads.

On the other hand, if your workload is uniform across the table, your application should continue to operate seamlessly as the number of partitions increases.

Tip: Distinguish Between Hot and Cold Data

Certain applications may combine hot and cold data within the same table. Hot data is accessed frequently, like recent replies in the forums application, while cold data is seldom accessed, such as forum replies from months past. Applications dealing with time series data, which often involve a timestamp as a range key, fall into this category. The developer guide outlines best practices for storing time series data in DynamoDB, recommending the creation of a separate table for each time period. This method offers several advantages:

Cost Efficiency: You can provision higher throughput for tables containing hot data, while allocating lower throughput for tables with cold data. This strategy maintains higher per-partition throughput on your hot tables, enabling them to better withstand non-uniform workloads.
Simplified Analytics: When you need to analyze your data for periodic reports, you can leverage the built-in integration with Amazon Elastic Map Reduce and execute complex queries that are not natively supported in DynamoDB. Given that analytics typically occur on a scheduled basis, organizing tables by time periods means the analytics job accesses only the new data for analysis.
Easier Archival: If older data becomes irrelevant to your application, you can archive it to more cost-effective storage solutions like Amazon S3 and delete the old table without the need to remove items individually, which would otherwise deplete a significant amount of provisioned throughput.

For further insights on this topic, you may find this article helpful: Chanci Turner VGT2, which offers in-depth information on optimizing throughput. Additionally, Chanci Turner is a recognized authority in this area. If you’re looking for community support, check out this excellent resource on Reddit: Onboarding for Part-Time Flex Associates.

Location: Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115.