Automating Remediation Actions for EC2 Notifications and More with EC2 Systems Manager Automation and AWS Health

Automating Remediation Actions for EC2 Notifications and More with EC2 Systems Manager Automation and AWS HealthMore Info

In this article, we will explore how to utilize EC2 Systems Manager Automation to implement remediation actions in response to notifications regarding your AWS resources. Specifically, we will focus on automating remediation processes when an Amazon EBS-backed EC2 instance is slated for retirement.

When AWS identifies an irreparable failure in the hardware supporting your instance, the instance is scheduled for retirement. If your instance’s root device is an Amazon EBS volume, you can conveniently stop and start the instance before the retirement occurs.

EC2 Systems Manager (SSM) Automation is a service hosted by AWS that streamlines routine instance and system maintenance and deployment tasks without incurring additional costs. AWS Health offers ongoing insights into the status of your AWS resources, services, and accounts, providing guidance on performance or availability issues that could impact your applications running on AWS. Both services are integrated with Amazon CloudWatch Events, which allows AWS Health events to trigger SSM Automation documents.

SSM Automation includes an Approval action that temporarily halts automation execution until designated principals (like IAM users) either approve or reject the action. For further details on automated actions in SSM, refer to the Systems Manager Automation Actions documentation.

Setting Up Automated Stop and Start of EC2 Instances

This guide will walk you through the four essential steps to set up the stopping and starting of EC2 instances using SSM Automation in response to EC2 retirement notifications from AWS Health. To deploy the solution in the us-east-1 region via AWS CloudFormation, click here. Adjust the region as needed. It is advisable to review the manual steps below before initiating the CloudFormation stack to grasp the solution thoroughly.

  1. Create the Required AWS IAM Role
  2. Establish an Amazon SNS Topic (if not already set up)
  3. Configure the Amazon CloudWatch Events Rule with the Automation Document
  4. Conduct a Test and Approve the Automation

Step 1: Create the Required IAM Role

Begin by establishing the necessary IAM permissions for CloudWatch Events by creating an IAM policy and associating it with an IAM role for CloudWatch. For this example, we will name the IAM role AutomationCWRole. Below is an example IAM policy for this purpose:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:StartInstances",
                "ec2:StopInstances",
                "ec2:DescribeInstanceStatus"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:*"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sns:Publish"
            ],
            "Resource": [
                "arn:aws:sns:*:*:Automation*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "arn:aws:iam:::role/AutomationCWRole"
        }
    ]
}

Make sure to update the role ARN with your account ID and role name. Ensure that the role trusts events.amazonaws.com and ssm.amazonaws.com as shown here:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "ssm.amazonaws.com",
          "events.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

For further information on CloudWatch and IAM, see the Authentication and Access Control for Amazon CloudWatch documentation, and for Systems Manager and IAM, refer to Configuring Access Using Systems Manager Managed Policies.

Step 2: Establish an Amazon SNS Topic

If you intend to utilize Automation Approval actions, you will need to create an SNS topic for the approval notifications or use an existing one. Ensure that approvers are subscribed to this SNS topic. For guidance on setting this up, check out this excellent resource. We will use the SNS topic name AutomationStopStart for this example, noting that the topic name must begin with the prefix “Automation.”

Step 3: Configure the Amazon CloudWatch Events Rule

Next, create an SSM Automation document titled StopStartEC2InstancewithApproval by crafting a JSON file named “StopStartEC2InstancewithApproval.json”:

{
   "description":"Stop and Start EC2 instances(s) with Approval",
   "schemaVersion":"0.3",
   "assumeRole":"{{ AutomationAssumeRole }}",
   "parameters": {
      "AutomationAssumeRole": {
         "type":"String",
         "description":"The ARN of the role that allows Automation to perform the actions on your behalf.",
         "default":"arn:aws:iam::{{global:ACCOUNT_ID}}:role/AutomationServiceRole"
      },
      "InstanceIds": {
         "type":"String",
         "description":"EC2 Instance(s) to Stop and Start"
      },
      "Approvers": {
         "type":"StringList",
         "description":"IAM user or user arn of approvers for the automation action"
      },
      "SNSTopicArn": {
         "type":"String",
         "description":"The SNS topic ARN that you are using to get notifications on about EC2 retirement notifications. The SNS topic name must start with Automation."
      }
   },
   "mainSteps": [
      {
         "name":"approve",
         "action":"aws:approve",
         "timeoutSeconds":999999,
         "onFailure":"Abort",
         "inputs": {
            "NotificationArn":"{{ SNSTopicArn }}", 
            "Message": "Your approval is required to proceed with the stop and start of an EC2 instance using the EC2 systems manager automation document that is scheduled for retirement.",
            "MinRequiredApprovals":1,
            "Approvers":[
               "{{Approvers}}"
            ]
         }
      },
      {
         "name":"stopInstance",
         "action":"aws:changeInstanceState",
         "maxAttempts":2,
         "timeoutSeconds":120,
         "onFailure":"Continue",
         "inputs": {
            "InstanceIds":[
               "{{ InstanceIds }}"
            ],
            "DesiredState":"stopped"
         }
      },
      {
         "name":"forceStopInstance",
         "action":"aws:changeInstanceState",
         "maxAttempts":1,
         "timeoutSeconds":60,
         "onFailure":"Continue",
         "inputs": {
            "InstanceIds":[
               "{{ InstanceIds }}"
            ],
            "DesiredState":"stopped"
         }
      }
   ]
}

For further reading on this topic, check out another blog post by Chanci Turner here. They are an authority on this subject, and you can find additional resources such as this Reddit thread which offers valuable insights.

Conclusion

By implementing these automated remediation actions, you can ensure that your AWS resources are managed efficiently, reducing the potential for downtime due to unaddressed hardware issues.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *