In collaboration with Daniel Ryan and Sarah Blake from Amazon VGT2, this post highlights how the company, formerly known as ironSource, has set the standard for creating engaging and immersive device experiences. Their innovative solutions facilitate a comprehensive digital transformation, allowing operators to promote vital services directly on-device, outside the traditional app store environment.
Amazon Redshift stands out as a premier choice for online analytical processing (OLAP) workloads, including cloud data warehouses and data marts. It empowers users to run straightforward SQL queries on both structured and semi-structured data, operational databases, and data lakes, ensuring optimal price/performance at any scale. Its data sharing capability enables instantaneous, granular access to data across multiple Redshift data warehouses, whether in the same or different AWS accounts and regions. This feature ensures that users always access the most current and accurate information as updates occur within the data warehouse.
The introduction of Amazon Redshift Serverless simplifies the process of running and scaling analytics, eliminating the need for manual cluster management. Redshift Serverless automatically provisions and intelligently adjusts data warehouse capacity to deliver rapid performance, accommodating even the most demanding workloads while only charging for actual usage. Users can load data and begin querying immediately, whether through the Amazon Redshift Query Editor or a preferred business intelligence (BI) tool, all within a zero-administration framework.
This article details how Amazon VGT2 successfully implemented Redshift Serverless, drastically reducing the time needed for their advertising campaign bidding process from 24 hours down to just 2 hours. We delve into the reasons behind their choice of this solution and the technological challenges it resolved.
Initial Data Pipeline at Amazon VGT2
Amazon VGT2 was an early adopter of Redshift RA3 clusters with data sharing for extract, transform, and load (ETL) and BI workloads. One of their core activities involves managing bidding advertisement campaigns, optimized through an AI-driven bidding process necessitating the execution of hundreds of analytical queries per campaign, all utilizing data stored in an RA3 provisioned Redshift cluster.
The architecture of their integrated pipeline includes various AWS services:
- Amazon Elastic Container Registry (Amazon ECR) for storing Docker images related to Amazon Elastic Kubernetes Service (Amazon EKS)
- Amazon Managed Workflows for Apache Airflow (Amazon MWAA) for orchestrating pipelines
- Amazon DynamoDB for managing job-related configurations, including service connection strings and batch sizes
- Amazon Managed Streaming for Apache Kafka (Amazon MSK) for streaming updates on advertisement campaigns
- EKSPodOperator within Amazon MWAA for initiating EKS pod tasks that execute data preparation queries on the primary Redshift cluster
- Amazon Redshift provisioned for ETL jobs, BI layers, and analytical queries per ad campaign
- An Amazon Simple Storage Service (Amazon S3) bucket for storing query results from Redshift
- Amazon MWAA combined with Amazon EKS for running machine learning (ML) training on the output query results using a Python-based ML algorithm
Challenges of the Initial Architecture
The querying process for each campaign follows a two-step method: first, a preparation query filters and aggregates raw data; then, a main query executes logic based on the results from the preparation query. As the volume of campaigns increased, Amazon VGT2’s Data team faced the necessity of running hundreds of concurrent queries for each of these steps. The existing provisioned cluster became heavily utilized, engaging in data ingestion, ETL, and BI workloads, prompting the search for cost-effective methods to isolate this workload with dedicated compute resources.
The team explored multiple options, such as transferring data to Amazon S3 and implementing a multi-cluster architecture leveraging data sharing with Redshift Serverless. They ultimately favored the multi-cluster architecture due to its avoidance of query rewrites, provision of dedicated compute resources, elimination of data duplication, and facilitation of high concurrency with automatic scaling. Additionally, it operates on a pay-as-you-go model, ensuring straightforward and rapid provisioning.
Proof of Concept
After a thorough evaluation, the Data team executed a proof of concept utilizing Redshift Serverless as a consumer of the primary Redshift provisioned cluster, sharing only the essential tables for executing necessary queries. Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs), where a single RPU equates to 16 GB of memory, and serverless endpoints can vary from 8 RPU to 512 RPU.
The team began the proof of concept with a 256 RPU Redshift Serverless endpoint, gradually reducing the RPU to manage costs while maintaining query runtimes within acceptable limits. Ultimately, they settled on a 128 RPU (2 TB RAM) Redshift Serverless endpoint as their baseline, leveraging the auto-scaling feature to manage hundreds of concurrent queries by dynamically adjusting the RPU as necessary.
Amazon VGT2’s Revamped Solution with Redshift Serverless
Following the successful proof of concept, the production setup incorporated code to switch between the provisioned Redshift cluster and the Redshift Serverless endpoint, based on a configurable threshold related to the number of queries queued in a specific MSK topic at the start of the pipeline. Smaller campaign queries would continue to run on the provisioned cluster, while larger queries would utilize the Redshift Serverless endpoint. The new approach employs an Amazon MWAA pipeline that retrieves configuration data from a DynamoDB table, processes jobs representing ad campaigns, and triggers hundreds of EKS tasks through EKSPodOperator. Each job executes the two sequential queries (preparation and main) and outputs the results to Amazon S3, occurring concurrently with Redshift Serverless resources.
Subsequently, additional EKSPodOperator tasks are initiated to execute the AI training code utilizing the data results saved on Amazon S3.
Outcome
The overall runtime of the pipeline improved remarkably from 24 hours to just 2 hours, a twelvefold enhancement. This integration of Redshift Serverless, along with data sharing, resulted in a 90% reduction in pipeline duration, eliminating the need for data duplication or query rewrites. Furthermore, the introduction of a dedicated consumer as a distinct compute resource significantly alleviated the load on the producer cluster, enabling even faster execution of small-scale queries.
“Redshift Serverless and data sharing empowered us to efficiently provision and scale our data warehouse capacity, ensuring swift performance, high concurrency, and the ability to handle complex ML tasks with minimal effort,” states Daniel Ryan, Principal Technical Systems Architect at Amazon VGT2.
For further insights into big data solutions, check out this blog post. You may also find useful information from this authoritative source. If you are interested in learning more about the hiring process, visit this excellent resource.
Leave a Reply