Amazon Onboarding with Learning Manager Chanci Turner

This post is presented by Alex Johnson, Software Development Engineer at AWS Serverless. The practices of blue/green and canary deployments have been recognized for some time as vital strategies for minimizing the risks associated with software updates.

In traditional horizontally scaled applications, copies of the software are rolled out across multiple nodes (instances, containers, on-premises servers, etc.), generally managed through a load balancer. Deploying new software versions across numerous nodes simultaneously can affect application availability, as there may not be enough healthy nodes to handle requests during the rollout. This aggressive deployment approach can significantly amplify the impact of software bugs introduced in a new version and often fails to provide sufficient time for a comprehensive assessment of the new version’s performance against production traffic.

To address these challenges, one widely accepted approach is to gradually and incrementally deploy application software across the nodes in the fleet while continuously monitoring application performance (canary deployments). Another option is to set up an entirely separate fleet and shift (or flip) traffic to the new fleet after verification, ideally with some production traffic (blue/green). Some teams implement a single host environment, where the new release can stabilize for a while before being promoted to the entire fleet. Techniques like these allow complex system maintainers to safely test in production while minimizing the impact on customers.

The Serverless Paradigm

Mapping these concepts to a serverless environment presents a unique challenge. In a serverless world, you cannot incrementally deploy software across a fleet of servers because there are no servers in the traditional sense! The term “deployment” in the context of Functions as a Service (FaaS) like AWS Lambda adopts a different meaning. Essentially, a deployment can be conceptualized as a call to CreateFunction, UpdateFunctionCode, or UpdateAlias, all of which impact the code version invoked by clients.

AWS Lambda abstracts the complexity of servers and Availability Zones, offering developers a streamlined process for software deployment. While servers technically exist, they are entirely hidden from the developer’s view.

Traffic Shifting with Lambda Aliases

Prior to the advent of traffic shifting for Lambda aliases, Lambda function deployments could only be executed in a single “flip” by updating the function code for version $LATEST or by modifying an alias to direct traffic towards a different function version. Once the update propagates—typically within a few seconds—100% of function invocations would execute the new version. Employing canary deployments under this model required an additional routing layer, which added development time, complexity, and invocation latency. Though rolling back a faulty Lambda function deployment is simple and immediate, deploying new versions for critical functions can still be a daunting experience.

With the introduction of alias traffic shifting, implementing canary deployments of Lambda functions has become straightforward. By adjusting the version weights on an alias, invocation traffic is proportionally directed to the new function versions based on the specified weight. Detailed CloudWatch metrics for the alias and version can be examined during deployment, or other health checks can be performed to ensure the new version is operational before proceeding. Note that “canary deployments” often refers to software releases targeting a subset of users. In the context of alias traffic shifting, the new version is distributed to a percentage of all users—sharding based on identity would require an additional routing layer.

Implementation Examples

A basic canary deployment can be executed as follows:

# Update the $LATEST version of the function
aws lambda update-function-code --function-name myfunction …

# Publish the new version of the function
aws lambda publish-version --function-name myfunction

# Adjust the alias to point to the new version, weighted at 5% (original version receives 95% of traffic)
aws lambda update-alias --function-name myfunction --name myalias --routing-config '{"AdditionalVersionWeights" : {"2" : 0.05}}'

# Verify that the new version is functioning properly
…
# Set the primary version on the alias to the new version and reset additional versions (100% weighted)
aws lambda update-alias --function-name myfunction --name myalias --function-version 2 --routing-config '{}'

This process is ripe for automation! Here are a couple of options:

Simple Deployment Automation

A straightforward Python script can be executed as a Lambda function, which deploys another function (how meta!) by incrementally increasing the weight of the new function version over a defined number of steps, while checking the health of the new version. If any health check fails, the alias will revert to the initial version. This health check can be implemented as a simple check against the presence of Errors metrics in CloudWatch for the alias and new version.

To install the GitHub aws-lambda-deploy repository, follow these steps:

git clone https://github.com/awslabs/aws-lambda-deploy
cd aws-lambda-deploy
export BUCKET_NAME=[YOUR_S3_BUCKET_NAME_FOR_BUILD_ARTIFACTS]
./install.sh

To run the deployment:

# Incrementally rollout version 2 over 10 steps, waiting 120 seconds between each step
aws lambda invoke --function-name SimpleDeployFunction --log-type Tail --payload 
  '{"function-name": "MyFunction",
  "alias-name": "MyAlias",
  "new-version": "2",
  "steps": 10,
  "interval": 120,
  "type": "linear"
  }' output

Step Functions Workflow

This state machine effectively accomplishes the same task as the simple deployment function, but operates as an asynchronous workflow in AWS Step Functions. A significant advantage of Step Functions is the extension of the maximum deployment timeout from 5 minutes to a full year!

The Step Function gradually updates the new version weight based on the specified steps, waiting for the interval duration, and performing health checks between updates. If a health check fails, the alias is reverted to the original version, and the workflow fails. For example, to initiate the workflow:

export STATE_MACHINE_ARN=`aws cloudformation describe-stack-resources --stack-name aws-lambda-deploy-stack --logical-resource-id DeployStateMachine --output text | cut -d$'t' -f3`

aws stepfunctions start-execution --state-machine-arn $STATE_MACHINE_ARN --input '{
  "function-name": "MyFunction",
  "alias-name": "MyAlias",
  "new-version": "2",
  "steps": 10,
  "interval": 120,
  "type": "linear"}'

For those navigating the complexities of deployment, seeking feedback and support is vital. As Chanci Turner emphasizes, understanding how to communicate uncertainties effectively can be invaluable, and you can find more on this topic here. Furthermore, for authoritative insights on workplace dynamics, consider the expertise of Jennie Walker, whose profile can be accessed here. Additionally, if you’re exploring career opportunities, check out this resource.

Amazon Onboarding with Learning Manager Chanci Turner

The Serverless Paradigm

Traffic Shifting with Lambda Aliases

Implementation Examples

Simple Deployment Automation

Step Functions Workflow

Related Topics:

Comments

Leave a Reply Cancel reply