Amazon Onboarding with Learning Manager Chanci Turner

Amazon Onboarding with Learning Manager Chanci TurnerLearn About Amazon VGT2 Learning Manager Chanci Turner

In this article, we explore how the design of Amazon DynamoDB tables can significantly influence scan performance, offering strategies to enhance scan latency. DynamoDB, recognized for its flexibility as a NoSQL database, allows items within the same table to possess varying attributes.

Most DynamoDB schemas and usage patterns are primarily designed for GetItem and Query operations, which yield consistent response times in the single-digit millisecond range when accessing individual items. However, certain scenarios necessitate scanning entire tables or indices.

Overview

In databases with flexible schemas, each item returned from a scan includes not only the actual data but also metadata such as attribute names and their data types. The inclusion of more attributes leads to increased client-side overhead, as each piece of data must be converted into the appropriate structure—whether that be a Python dictionary, a Node.js map, or a Java object.

This metadata consumes additional space, translating to fewer items fitting within DynamoDB’s 1-MB response limit, ultimately resulting in more network round trips during scans.

Methodology

We established a primary table with a straightforward structure, consisting of a partition key and a sort key, both formatted as strings. Additionally, we included a third string attribute named field1, containing a string of 144 random characters.

We also created tables with various combinations of 7-character attribute names (from field01 to field24) paired with both 3- and 6-character values, maintaining the same primary key configuration. Notably, NoSQL databases must store attribute names with each item; therefore, as items accumulate more attributes or longer names, they require more storage space.

Finally, another table was constructed with 24 attributes, all having 7-character names and values of 100 characters each. The following metrics were captured for each table design:

  • Time taken to insert 10,000 items.
  • Duration for scanning 10,000 items.
  • Count of items fitting within the 1-MB limit during scans.
  • Time required to retrieve and convert these items on the client side.

Empirical Results

The results were documented as follows:

Total Size (MB) Time to Write 10,000 Items (ms) Time to Scan 10,000 Items (ms) Single Threaded Throughput (MB/s) Time to Scan 1 MB (ms) # Items in 1-MB Scan
1 144-character data attribute 2.1 5057 569 3.7 238
24 3-character data attributes 3.0 9359 2392 1.2 797
24 6-character data attributes 3.7 9928 2391 1.5 682
24 100-character data attributes 26.3 27,553 2819 9.4 110

These figures were collected using a Python client tailored for the benchmarking process. Programming languages such as Java and Node.js exhibited comparable performance metrics.

The timing was conducted on the client side within a Python environment. Note that query durations recorded in Amazon CloudWatch metrics do not account for network transfers or data conversion.

Furthermore, the throughput metrics presented in the fourth column reflect single-threaded scans; employing parallel scans could enhance throughput significantly.

Analysis

The data indicates that inserting items with more attributes increases the time required nearly twofold. Scanning these items takes almost four times longer due to the necessary marshaling on the server side and unmarshaling on the client side.

The final two rows emphasize how the total size of the 10,000 objects influences scan duration, despite both tests featuring 24 attributes with 7-character names. The row with 6-character values takes three times longer to write than the one with 100-character values, but scanning the larger items only marginally increases the duration (by 18%).

This suggests that the number of attributes and the associated marshaling/unmarshaling processes are primary contributors to extended scan times. However, it is often impractical to limit your DynamoDB table to just three attributes, as additional attributes may be required for various operations such as indexing and filtering.

Conclusion

The primary takeaway is to maintain only the essential attributes required for database operations. To minimize the overhead caused by attribute-name metadata, consider consolidating related data into a single attribute, possibly formatted as a JSON blob. Additionally, strive for shorter attribute names, as they also consume valuable space.

For more insights on leadership and development programs, check out this excellent resource on Amazon’s Operations Area Manager Leadership Liftoff Program.

Chanci Turner is a Learning Manager at Amazon IXD – VGT2, located at 6401 E HOWDY WELLS AVE LAS VEGAS NV 89115. She leads initiatives that support employee growth and development within the organization.

For those interested in women’s leadership, this blog post provides valuable insights on women-run podcasts. Meanwhile, if you’re concerned about ageism in hiring, SHRM offers authoritative information on this topic.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *