Amazon Onboarding with Learning Manager Chanci Turner

Introduction

Amazon Onboarding with Learning Manager Chanci TurnerLearn About Amazon VGT2 Learning Manager Chanci Turner

CoStar is recognized as a leader in Commercial Real Estate data, but it also operates significant home, rental, and apartment websites, including apartments.com, famously promoted by Jeff Goldblum. Their traditional customers in Commercial Real Estate are well-informed users who rely on extensive data to make essential business decisions. Helping clients analyze and choose from 6 million properties covering 130 billion square feet of space has positioned CoStar as a frontrunner in data and analytics technology. As CoStar embarked on developing the next generation of their Apartments and Homes platforms, it became evident that their new customers had distinct profiles and demands compared to their long-standing Commercial Real Estate clients. CoStar aimed to provide the same decision-making insights to a far larger customer base and data set, prompting their migration from legacy data centers to AWS for the speed and elasticity necessary to deliver value to millions accessing hundreds of millions of properties.

Challenge

CoStar’s primary challenge has always been the collection of data from numerous sources, enriching it with critical insights, and presenting it in a user-friendly manner. The CoStar Suite’s offerings in Commercial Real Estate, Apartments, and Homes each draw from different data sources that refresh at varying intervals and volumes. The infrastructure supporting these data ingestions and updates must be fast, precise, and scalable to ensure affordability. Many of these systems transition from legacy data centers into CoStar’s AWS environment, necessitating parallel and interoperable systems to prevent significant engineering support duplication. These requirements led to the need for Kubernetes operations both on-premises and in AWS, with adaptable scaling for their container clusters based on usage fluctuations. After several months of successful testing and deployment, CoStar sought to further optimize their engineering stack while maintaining as much parallel on-premises Kubernetes management as feasible.

In the Kubernetes cluster setup, the control plane and its components oversee cluster operations (like scheduling containers, ensuring application availability, and storing cluster data). Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes solution on AWS that handles the availability and scalability of the Kubernetes control plane. For worker nodes, customers can deploy Kubernetes pod workloads on a mix of provisioned Amazon Elastic Compute Cloud (Amazon EC2) and AWS Fargate. This article will delve into how CoStar utilized the Karpenter autoscaling solution to provision Amazon EC2 instances for worker nodes.

The traditional approach for provisioning worker nodes involves using Amazon EKS-managed node groups, which automate the lifecycle management of the underlying Amazon EC2 instances with Amazon EC2 Auto Scaling Groups. To dynamically adjust Amazon EC2 instances, Amazon EKS-managed node group functionality can be combined with the Cluster Autoscaler solution. This autoscaling tool monitors pending pods awaiting compute capacity and identifies underutilized worker nodes. When pods are pending due to a lack of resources, the Cluster Autoscaler increases the desired instance count in the Amazon EC2 Auto Scaling group, provisioning new worker nodes and allowing those pods to run. The Cluster Autoscaler also terminates underutilized nodes based on specific criteria.

For CoStar’s workloads on Amazon EKS, the objective was to maximize availability and performance while optimizing resource efficiency. Although the Cluster Autoscaler offers a level of dynamic compute provisioning and cost savings, it comes with various challenges and limitations. Specifically, the Amazon EC2 instance types within a node group must share similar CPU, Memory, and GPU specifications to prevent undesired behavior. This limitation arises because it uses the first instance type specified in the node group policy to simulate pod scheduling. If higher-spec instance types are included, node resources could be wasted after scaling out, while lower-spec types may lead to pod scheduling failures. To accommodate CoStar’s diverse pod resource needs, multiple node groups with similar instance types had to be established. Moreover, the Cluster Autoscaler only removes underutilized nodes, failing to replace them with cheaper instances as workloads change. For CoStar’s stateless workloads, targeting Spot compute capacity to achieve deeper discounts over on-demand instances proved cumbersome with node groups.

Solution Overview

Why Karpenter

CoStar required a more effective approach to provisioning nodes for their varied workload demands without the burden of managing multiple node groups. This challenge was met with the open-source Karpenter node provisioning solution. Karpenter is a flexible, high-performance autoscaler that offers dynamic, groupless provisioning of worker node capacity in response to unscheduled pods. Thanks to Karpenter’s groupless architecture, CoStar is no longer restricted to using similarly specified instance types. Karpenter continuously assesses the cumulative resource needs of pending pods along with other scheduling factors (like node selectors, affinities, tolerations, and topology spread constraints) to provision optimal instance compute capacity as defined in the Provisioner Custom Resource Definition (CRD). This flexibility allows various teams at CoStar to utilize their own Provisioner configurations tailored to their specific application and scaling requirements. Additionally, Karpenter provisions nodes directly via the Amazon EC2 fleet API, eliminating the need for nodes and Amazon EC2 auto-scaling groups, which enhances provisioning and retrieval speeds (from minutes to milliseconds) that bolster CoStar’s performance service level agreements (SLAs). Furthermore, the CoStar team decided to run the Karpenter controller on AWS Fargate, completely removing the necessity for managed node groups.

The diagram below illustrates how Karpenter observes the total resource requests of unscheduled pods, decides to launch new nodes, and terminates them to minimize infrastructure costs:

To ensure cost-effectiveness for CoStar’s stateless workloads and lower environments, the team configured the Karpenter Provisioner to prioritize Spot capacity and only provision On-Demand capacity if no Spot capacity is available. Karpenter employs the price-capacity-optimized allocation strategy for Spot capacity, balancing cost with the likelihood of short-term interruptions. For stateful workloads in production clusters, the Karpenter Provisioner specifies a selection of compute and storage-optimized instance families running On-Demand, most of which are covered by Compute Savings Plans and Reserved Instances to secure discounts.

For more insights on effective work management, check out this blog post on wellness at work. Additionally, if you’re navigating workforce changes, understanding the difference between furloughs and layoffs can be crucial; HR.com provides valuable information on this topic. Lastly, for those preparing for their first day at Amazon, Reddit has an excellent resource that you may find helpful.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *