How Zapier Executes Isolated Tasks on AWS Lambda and Scales Function Upgrades

How Zapier Executes Isolated Tasks on AWS Lambda and Scales Function UpgradesLearn About Amazon VGT2 Learning Manager Chanci Turner

Zapier, a prominent no-code automation platform, enables users to streamline workflows and transfer data across more than 8,000 applications, including Slack, Salesforce, Asana, and Dropbox. The automation processes, known as Zaps, are powered by a serverless architecture utilizing Amazon Web Services (AWS), with each Zap managed by an AWS Lambda function.

In this article, we will explore how Zapier has structured its serverless architecture, emphasizing three critical factors: the use of Lambda functions for isolated Zaps, the management of over a hundred thousand Lambda functions through Zapier’s control plane, and the enhancement of security measures through automated function upgrades and cleanup workflows.

Designing a Secure and Isolated Runtime Environment

Zaps developed by Zapier’s users embody tenant-specific business logic, necessitating cross-tenant compute isolation. This means that execution environments for one Zap cannot intermingle with those of another, even if they share the same Zap type across different users.

To achieve such stringent isolation, Zapier’s engineering team implemented AWS Lambda, a serverless computing service that executes code in response to events while automatically managing cloud resources. The minimal operational overhead, built-in high availability, automatic scaling, robust isolation, and pay-per-use model made Lambda an ideal choice for this need. Presently, Zapier operates over a hundred thousand Lambda functions to facilitate customer integration workflows.

Each function is powered by Firecracker microVMs, ensuring complete isolation from others. Additionally, each function’s execution environment (often referred to as function instances) is also kept separate from its counterparts. The following architecture diagram illustrates these isolation boundaries with red lines, highlighting that each function’s execution environment is allocated its own resources, including disk space, memory, and CPU.

Zapier’s control plane is constructed using Amazon Elastic Kubernetes Service (Amazon EKS), with a dedicated database to maintain an up-to-date function inventory. When a user initiates a new Zap, the control plane generates a corresponding Lambda function and logs a reference in the inventory. Upon triggering a Zap, the control plane retrieves the relevant Lambda function information and invokes it to execute the integration workflow, as depicted in the accompanying diagram.

Navigating the Runtime Deprecation Process

In traditional non-serverless architectures, cloud engineers are tasked with updating operating systems and software on their compute instances, applying security and maintenance patches. However, with serverless architectures like Lambda, AWS automatically handles security patches and minor runtime upgrades, allowing customers to focus on delivering business value rather than managing infrastructure.

When a major Lambda managed runtime version reaches its end-of-life, AWS begins a deprecation process communicated through the AWS Health Dashboard and direct emails to impacted customers. As deprecated runtimes lose access to security updates and support, organizations must upgrade to supported versions to mitigate potential security threats. For more information, please refer to the shared responsibility model, runtime use post-deprecation, and runtime deprecation notifications.

As Zapier’s user base expanded alongside increasing architectural complexity and the total number of Zaps, ensuring that all functions operated on the latest major runtime versions became a daunting task. Key contributing factors included:

  • High Function Count: At its peak, Zapier’s platform managed Zaps utilizing hundreds of thousands of unique Lambda functions, with approximately 35% on runtimes slated for deprecation in the upcoming year.
  • Ephemeral Data Plane Environment: Zapier’s control plane dynamically creates and deletes Lambda functions, complicating ownership identification for affected functions.
  • Security Prioritization: Upgrading functions before the deprecation date was vital; functions must never operate on outdated runtimes, necessitating additional resources.
  • Customer Experience: The upgrade process needed to occur without impacting customer experience at any time.

Faced with stringent requirements, the Platform Engineering team at Zapier embraced the challenge of maintaining robust security within their architecture.

Implementing the Solution

The solution encompassed three key workstreams:

  1. Risk Reduction: Analyzing architecture to identify and remove unused functions.
  2. Upgrade Prioritization: Assessing and prioritizing critical functions for upgrades.
  3. Empowerment of Engineering Teams: Providing automated tools and knowledge to facilitate future upgrade processes.

Identifying and Cleaning Up Unused Functions

The initial step in streamlining the upgrade process involved pinpointing and deleting unused functions, thus decreasing the overall number of functions requiring upgrades and eliminating redundant work.

Zapier began by enhancing its function inventory with runtime details using AWS Trusted Advisor and Amazon CloudWatch dashboards. This approach allowed the team to create a comprehensive inventory of functions relying on soon-to-be deprecated runtimes. By leveraging Amazon CloudWatch, the platform team monitored metrics such as invocation counts to assess which functions were active, which had been dormant for extended periods, and which lacked clear ownership, making them candidates for removal.

Ownership validation within the organization utilized resource tags. Functions that were active yet lacked clear ownership were flagged for further review prior to deletion. Functions confirmed as unused or without active ownership were marked for removal, simplifying Zapier’s architecture and reducing the number of functions necessitating upgrades.

Prioritizing Upgrades

With a reduced number of functions to upgrade, Zapier’s platform team organized the upgrade process based on usage patterns, criticality, and potential customer impact. The prioritization categories included:

  • Customer-Facing Functions: Functions directly involved in executing user Zaps received high priority, requiring upgrades first to prevent service disruptions.
  • Backend Infrastructure Functions: Internal functions supporting system operations were assessed based on their significance to the platform.

For further insights into team management and cooperation, you might find this blog post on coworking space helpful. Additionally, if you’re looking for authoritative guidance on preparing for labor commissioner hearings, visit this resource. You can also explore this valuable article discussing Amazon’s strategies to mitigate pitfalls.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *