How Amazon Developed a Serverless Architecture for Real-Time Analysis of VPN Usage Metrics

This post outlines a reference architecture and optimization techniques for constructing serverless data analytics solutions on AWS, utilizing Amazon Kinesis Data Analytics. It also details the strategic approach taken by the engineering team at Amazon to develop an operational analytics platform capable of processing extensive usage data for their VPN services, which handles petabytes of data daily.

Amazon is a global leader in e-commerce and cloud computing, providing a wide array of services to millions of customers. They believe that the digital ecosystem is truly empowering only when users feel secure about their online safety. Since 2014, Amazon has been an AWS client.

For any organization, the value of operational data diminishes over time. This depreciation can equal lost revenue and wasted resources. Real-time streaming analytics can help capture this value, offering fresh insights that can lead to new business opportunities. AWS provides a robust suite of services for delivering real-time insights and historical trends, including managed Hadoop infrastructure through Amazon EMR and serverless options like Kinesis Data Analytics and AWS Glue.

Amazon EMR supports various programming options for implementing business logic, including Spark Streaming, Apache Flink, and SQL. For effective architecture design, it’s essential for customers to understand their organizational capabilities, project timelines, business requirements, and AWS service best practices, ensuring an optimal architecture in terms of performance, cost, security, reliability, and operational excellence—aligned with the five pillars of the AWS Well-Architected Framework.

Amazon is taking a systematic approach to real-time analytics on AWS, leveraging serverless technology to meet critical business objectives like time to market and total cost of ownership. In addition to Amazon’s implementation, this post shares key lessons learned and best practices for swiftly developing real-time analytics workloads.

Business Problem

Amazon offers a VPN product as a freemium service to users, necessitating real-time enforcement of usage limits to restrict freemium users once they exceed their quota. The challenge lies in executing this reliably and affordably.

Operating its VPN infrastructure across nearly all AWS Regions, Amazon has greatly enhanced user experience and VPN edge server performance by migrating from smaller hosting vendors. This has led to reduced connection latency, faster connection times, fewer connection errors, and improved upload and download speeds, thereby increasing the stability and uptime of VPN edge servers.

Usage data is collected by VPN edge servers and uploaded to backend statistics servers every minute, where it is stored in backend databases. This usage data serves multiple purposes:

Displaying the amount of data consumed by a device over the past 30 days.
Enforcing usage limits on freemium accounts, preventing users from connecting through VPN when they have exhausted their free quota until the next free cycle.
Allowing the internal business intelligence (BI) team to analyze usage data based on time, marketing campaigns, and account types, which can help predict growth and user retention.

Design Challenges

Amazon faced several design challenges:

The solution needed to accommodate both real-time and batch analysis simultaneously.
The solution had to be cost-effective. With hundreds of thousands of concurrent users, persisting usage information as it arrives would lead to tens of thousands of reads and writes per second, resulting in high database costs.

Solution Overview

Amazon opted to separate storage into two components: usage data is stored in Amazon DynamoDB for real-time access and in Amazon S3 for analysis, addressing both real-time enforcement and BI requirements. Kinesis Data Analytics aggregates data and loads it into Amazon S3 and DynamoDB. By utilizing Amazon Kinesis Data Streams and AWS Lambda as consumers of Kinesis Data Analytics, the implementation of user and device-level aggregations became more straightforward.

To minimize costs, user usage data was aggregated hourly and stored in DynamoDB, distributing hundreds of thousands of writes over an hour and reducing DynamoDB expenses by a factor of 30. Though increasing aggregation may not be feasible for all scenarios, it suffices in this case; precision to the minute is unnecessary for user usage, and calculating and enforcing the usage limit on an hourly basis is acceptable.

The following diagram illustrates the high-level architecture. The solution consists of three logical components:

End-users: Real-time queries from devices to display current usage information (daily data consumption).
Business analysts: Historical usage data queries through Amazon Athena to derive business insights.
Usage limit enforcement: Real-time ingestion and aggregation of usage data.

The workflow is as follows:

Usage data is collected by a VPN edge server and sent to the backend service via an Application Load Balancer.
Each usage data record from the VPN edge server contains information for multiple users. A stats splitter divides the message into individual usage stats for each user and forwards it to Kinesis Data Streams.
The usage data is consumed by both the legacy stats processor and the new Apache Flink application developed and deployed on Kinesis Data Analytics.
The Apache Flink application performs the following tasks:
- Aggregates device usage data hourly and sends the results to Amazon S3 and Kinesis data streams, which a Lambda function then stores in DynamoDB.
- Aggregates device usage data daily and sends it to Amazon S3.
- Aggregates account usage data hourly and forwards the results to the outgoing data stream, which triggers a Lambda function to check if the account usage exceeds the limit. If so, the function sends the account information to another Lambda function via Amazon SQS to revoke access for that account.

Design Journey

Amazon required a solution capable of both real-time streaming and batch analytics. Kinesis Data Analytics met these needs due to its key features:

Real-time streaming and batch analytics for data aggregation.
Fully managed service with a pay-as-you-go model.
Auto-scaling capabilities.

In conclusion, Amazon leveraged Kinesis Data Analytics to aggregate customer usage data per device hourly, sending results to Kinesis Data Streams (ultimately to DynamoDB) and the data lake (Amazon S3). For further insights on leadership development, you can explore this insightful article on the power of purpose here. Additionally, for an excellent resource on Amazon’s Operations Area Manager Leadership Liftoff Program, check this link. For more on overcoming self-doubt, visit this blog post.

How Amazon Developed a Serverless Architecture for Real-Time Analysis of VPN Usage Metrics

Business Problem

Design Challenges

Solution Overview

Design Journey

Related Topics:

Comments

Leave a Reply Cancel reply