Update: Issue Impacting HashiCorp Terraform Resource Deletions Following VPC Enhancements to AWS Lambda

On September 3, 2019, we shared an exciting announcement detailing enhancements that boost the performance, scalability, and efficiency of AWS Lambda functions when integrated with Amazon VPC networks. For a deeper dive into these improvements, please refer to the original blog post. These enhancements signify a notable shift in the configuration of elastic network interfaces (ENIs) connecting to your VPCs. However, we have identified an issue where VPC resources, including subnets, security groups, and VPCs, may fail to be deleted when using HashiCorp Terraform. More details about this issue can be found here. In this post, we will guide you on how to determine if you are affected and the steps you can take to resolve it.

How can I determine if I’m impacted by this issue?

This issue will only impact you if you are utilizing HashiCorp Terraform for environment destruction. Versions of the Terraform AWS Provider that are v2.30.0 or older are susceptible to this issue. With these versions, you might encounter errors while attempting to destroy environments containing AWS Lambda functions, VPC subnets, security groups, and Amazon VPCs. Common errors include:

Error deleting subnet: timeout while waiting for state to become ‘destroyed’ (last state: ‘pending’, timeout: 20m0s)
Error deleting security group: DependencyViolation: resource sg- has a dependent object status code: 400, request id:

Depending on the AWS Regions where the VPC enhancements have been implemented, you may experience these errors in some Regions but not in others.

What steps should I take to resolve this issue if I’m affected?

You have two possible solutions to address this issue. The recommended approach is to upgrade your Terraform AWS Provider to v2.31.0 or later. For information on how to upgrade the Provider, visit the Terraform AWS Provider Version 2 Upgrade Guide. You can find details and source code for the latest AWS Provider releases on this page. The latest version of the Terraform AWS Provider includes a fix for this issue, along with enhancements that improve the reliability of the environment destruction process. We highly encourage upgrading the Provider version as the preferred method for resolving this issue.

If upgrading the Provider isn’t an option for you, you can mitigate the issue by modifying your Terraform configuration. You will need to implement the following changes:

Add an explicit dependency with a depends_on argument to the aws_security_group and aws_subnet resources associated with your Lambda functions. This dependency should target the aws_iam_policy resource linked to the IAM role configured for the Lambda function. See the example below for more specifics.
Override the delete timeout for all aws_security_group and aws_subnet resources, setting the timeout to 40 minutes.

The following configuration file illustrates an example where these changes have been made (scroll to view the complete code):

provider "aws" {
  region = "eu-central-1"
}

resource "aws_iam_role" "lambda_exec_role" {
  name = "lambda_exec_role"
  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

data "aws_iam_policy" "LambdaVPCAccess" {
  arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "aws_iam_role_policy_attachment" "sto-lambda-vpc-role-policy-attach" {
  role       = "${aws_iam_role.lambda_exec_role.name}"
  policy_arn = "${data.aws_iam_policy.LambdaVPCAccess.arn}"
}

resource "aws_security_group" "allow_tls" {
  name        = "allow_tls"
  description = "Allow TLS inbound traffic"
  vpc_id      = "vpc-"

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"] 
  }

  egress {
    from_port       = 0
    to_port         = 0
    protocol        = "tcp"
    cidr_blocks     = ["0.0.0.0/0"]
  }

  timeouts {
    delete = "40m"
  }
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]  
}

resource "aws_subnet" "main" {
  vpc_id     = "vpc-"
  cidr_block = "172.31.68.0/24"

  timeouts {
    delete = "40m"
  }
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]
}

resource "aws_lambda_function" "demo_lambda" {
    function_name = "demo_lambda"
    handler = "index.handler"
    runtime = "nodejs10.x"
    filename = "function.zip"
    source_code_hash = "${filebase64sha256("function.zip")}"
    role = "${aws_iam_role.lambda_exec_role.arn}"
    vpc_config {
     subnet_ids         = ["${aws_subnet.main.id}"]
     security_group_ids = ["${aws_security_group.allow_tls.id}"]
  }
}

The crucial sections to highlight here are the blocks within both the “allow_tls” security group and the “main” subnet resources:

timeouts {
  delete = "40m"
}
depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]

Make sure to implement these changes in your Terraform configuration files before attempting to destroy your environments for the first time.

Can I delete resources left behind after a failed destruction attempt?

If you attempt to destroy environments without upgrading the provider or implementing the configuration modifications mentioned above, you may encounter failures. Consequently, you could have ENIs remaining in your account due to a failed destruction attempt. These ENIs can be manually deleted a few minutes after the Lambda functions using them have been removed (usually within 40 minutes). Once the ENIs are deleted, you can rerun terraform destroy.

For further insights, consider checking out this blog post on London job opportunities. Additionally, if you’re interested in transparency in workplace practices, SHRM is an authority on this subject. Lastly, you can explore a wealth of resources through Amazon’s Learning and Development teams.

Update: Issue Impacting HashiCorp Terraform Resource Deletions Following VPC Enhancements to AWS Lambda

How can I determine if I’m impacted by this issue?

What steps should I take to resolve this issue if I’m affected?

Can I delete resources left behind after a failed destruction attempt?

Related Topics:

Comments

Leave a Reply Cancel reply