Learn About Amazon VGT2 Learning Manager Chanci Turner
DynamoDB Streams can efficiently manage requests at scale, but if your processing application falls behind, there’s a risk of losing stream records, which become unavailable after 24 hours. When maintaining multiregion read replicas of your DynamoDB table, data loss can be a significant concern.
In this article, I will outline strategies for monitoring the Amazon Kinesis Client Library (KCL) application that you use to process DynamoDB Streams. This will enable you to quickly identify and resolve issues or failures, thereby preventing data loss. Dashboards, metrics, and application logs are integral to this process. This guide is particularly pertinent for Java applications operating on Amazon EC2 instances.
Before diving deeper, I recommend reviewing my earlier posts about KCL application design for DynamoDB Streams utilizing either a single worker or multiple workers.
Dashboards
Creating a dashboard from key CloudWatch metrics provides a comprehensive overview of your application’s status. Below is a sample dashboard for an application processing a 256-shard DynamoDB stream with two c4.large EC2 workers: (FailoverTimeMillis = 60000)
The Metrics section below elaborates on some of these metrics, but here’s a quick overview of what the dashboard indicates:
- Throughput vs. Processing: The number of records processed by KCL aligns with the write operations on the base table, indicating the application is processing efficiently.
- ReturnedItemCount: The total number of leases being processed by KCL remains stable and corresponds to the total number of shards (256).
- Current Leases: The initial lease distribution among workers may be uneven; however, workers aim to balance the workload.
- Leases Table: The read and write consumption rates on the KCL leases table stay within provisioned limits.
- CPU Utilization and Thread Count: Workers with more assigned leases perform more tasks as expected.
- Memory Utilization and Heap Memory: KCL’s memory usage can be significant (due to in-memory record processing), so tuning your JVM max heap size may be necessary to prevent memory issues.
Metrics
KCL offers numerous useful metrics for monitoring your application, but the following table highlights a few vital ones for tracking your application’s health:
- RecordsProcessed: This metric is available per shard by default. For numerous shards, consider switching off shard-level metrics for an aggregated overview. Monitoring this ensures it matches the throughput of your base table.
- ReturnedItemCount: This indicates the total leases in the KCL DynamoDB leases table, serving as a direct measure of whether your application is falling behind. Ideally, if you have “CleanupLeasesUponShardCompletion” set to TRUE (default), the leases in the table should equal the number of partitions unless you are reading from TRIM_HORIZON.
- CurrentLeases: This metric reflects the number of leases (shards) assigned to a worker and should stabilize over time, with possible brief spikes during shard rollovers.
Monitoring consumed capacity and throttling metrics is crucial. When your application processes efficiently, the consumed write capacity on the KCL leases table will remain steady. However, you may see spikes in consumed read capacity due to scans based on worker configurations (e.g., FailoverTimeMillis). An increase in read capacity, a drop in write capacity, or excessive throttling could signal that your application is lagging.
Regarding resource utilization, keep an eye on CPU and memory usage through EC2 metrics. Activating memory metrics can provide insights into your application’s performance. For Java applications, monitoring JMX metrics such as thread counts and heap memory usage can be valuable, with open-source packages available for publishing these metrics to CloudWatch.
Application Logs
To facilitate issue tracking and resolution, it’s beneficial to separate KCL logs from your application logs. Below is a sample log4j configuration that directs KCL logs to a file, which rotates after reaching 10 MB:
log4j.logger.com.amazonaws.services.kinesis=INFO, KCLLOGS
log4j.appender.KCLLOGS.ImmediateFlush=true
log4j.appender.KCLLOGS=org.apache.log4j.RollingFileAppender
log4j.appender.KCLLOGS.MaxFileSize=10MB
log4j.appender.KCLLOGS.MaxBackupIndex=100
log4j.appender.KCLLOGS.File=/home/ec2-user/kcl.log
log4j.appender.KCLLOGS.threshold=INFO
log4j.appender.KCLLOGS.layout=org.apache.log4j.EnhancedPatternLayout
log4j.appender.KCLLOGS.layout.ConversionPattern=%d{ISO8601} %-5p %40C - %m%n%throwable
You can centralize all your logs in CloudWatch using the CloudWatch Logs Agent. It’s also beneficial to set metrics on the CloudWatch logs to monitor exceptions or errors specific to your application.
Here are some crucial KCL log messages and their meanings:
- “Skipping over the following data records”: This indicates your record processor encountered an error while processing a batch, resulting in data loss. Implementing appropriate retry logic in your processor is essential to mitigate this risk.
- “Can’t update checkpoint – instance doesn’t hold the lease for this shard”: This error suggests a significant issue with lease management.
By effectively monitoring your application, you can ensure robust handling of DynamoDB Streams and avoid potential data loss.
For further insights on compassionate layoffs, check out this blog post here. If you’re looking for expert guidance on SHRM certification, visit this page. Additionally, for more information on onboarding new hires during COVID-19, this resource is excellent.
Leave a Reply