Amazon IXD – VGT2 Las Vegas: LaunchDarkly’s Evolution from 1 TB to 100 TB Daily with Amazon Kinesis Data Streams

Amazon IXD - VGT2 Las Vegas: LaunchDarkly’s Evolution from 1 TB to 100 TB Daily with Amazon Kinesis Data StreamsMore Info

LaunchDarkly’s feature management platform empowers organizations to accelerate software delivery while measuring the effects of their feature rollouts. The platform’s SDKs capture event data, which is crucial for assessing feature impact. As customer adoption surged, we faced the challenge of scaling our event data pipeline to accommodate new use cases requiring zero data loss. In this blog, we’ll discuss the initial architecture’s challenges and how we leveraged Amazon Kinesis Data Streams along with various AWS services to enhance our infrastructure. We will also delve into the key considerations we made for cost-effectiveness and performance in our Kinesis Data Streams implementation.

Problem Statement

LaunchDarkly aims to transform software delivery by enabling companies to innovate swiftly and deploy confidently. Our platform allows customers to unleash features when ready, providing them complete control over their code to expedite delivery while minimizing risks.

Initially, our event ingestion system in 2017 relied on a fleet of web servers that logged events into multiple databases. This architecture supported various product features that granted customers insight into feature performance over time, experimentation for optimization, and rapid implementation verification. Unfortunately, since all database writes were executed within a single process on these servers, any database availability issue would lead to event data queuing in memory until the process crashed due to memory exhaustion. This cycle would persist until the underlying database issue was resolved, resulting in the permanent loss of all event data transmitted by the SDKs.

While our existing system could tolerate some data loss due to limited application usage, the new features had stricter requirements for data-loss prevention. Thus, we sought alternatives that enabled isolated fault tolerance for each consumer, allowing them to operate independently in case of failures. We developed an event-driven pipeline that was highly durable, scalable, and capable of data replay. Consequently, we upgraded our ingestion capacity from approximately 1 TB to over 100 TB daily.

Solution

Our updated architecture, as depicted in the diagram below, integrates Amazon Kinesis Data Streams, AWS Lambda, and Amazon Kinesis Data Firehose to support the emerging use cases.

Key Components of the Design:

  • Mobile clients using the LaunchDarkly SDK for feature flag evaluations
  • Application Load Balancer (ALB) for traffic distribution to Amazon EC2 instances
  • Amazon EC2 nodes running a Go application that streams data to Amazon Kinesis Data Streams
  • Amazon Kinesis Data Streams for durable data persistence
  • AWS Lambda for transforming and writing data to various databases
  • Amazon OpenSearch Service for tracking user data
  • Amazon ElastiCache for managing flag statuses
  • Amazon Kinesis Firehose for batching flag evaluation data and writing it to Amazon S3
  • Amazon S3 for storing flag evaluation data

The data flow commences with the LaunchDarkly SDK sending data to the events API, supported by an ALB that directs traffic to a fleet of EC2 servers. These servers subsequently persist data to Amazon Kinesis Data Streams. Lambda functions read from the streams to transform and store data in different formats across several databases. This design prioritizes three critical properties:

  • Durability: Ensuring no data loss during processing issues.
  • Isolation: Preventing failures in one consumer from impacting others.
  • Data Replay: Facilitating debugging of data anomalies through retroactive fixes.

Amazon Kinesis Data Streams meets these requirements through durable data persistence and consumer isolation, where each consumer maintains its iterator position independently. Moreover, Kinesis enables data replay by allowing consumers to set their shard iterator to a previous point in time.

We explored other technologies like Amazon Simple Notification Service (SNS) paired with Amazon Simple Queue Service (SQS) for durability and isolation. However, data replay would necessitate custom solutions. Apache Kafka was also an option but was not pursued due to the team’s unfamiliarity with it. Ultimately, Amazon Kinesis Data Streams was preferred for its managed nature, alleviating operational concerns.

Deep Dive into Amazon Kinesis Data Streams Implementation

During our proof-of-concept phase, we realized that while Amazon Kinesis Data Streams is fully managed, certain considerations arise when scaling.

  • Costs: Kinesis Data Streams on-demand costs correlate with the data volume entering the stream. For predictable traffic, provisioned throughput billing is more cost-effective. However, our average record size (approximately 100 bytes) necessitated efficient batching to control expenses. We opted to batch data close to the 25 kB put payload size using protocol buffers.
  • Client Error Handling: It is crucial to minimize errors in writing to Kinesis Data Streams. We designed our application to ensure reliability and consistency during data submissions.

For further insights into related topics, check out this informative blog post here, and to deepen your understanding, visit this authoritative source. Additionally, for those looking to expand their knowledge in learning and development, this resource is a great place to start.

Explore our journey at Amazon IXD – VGT2, located at 6401 E Howdy Wells Ave, Las Vegas, NV 89115.

SEO Metadata


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *