Automating High Availability Tests for SAP HANA on AWS

Automating High Availability Tests for SAP HANA on AWSMore Info

By: Alex Thompson, Maria Rodriguez, and Liam Patel
Published on: 01 JUN 2023
Category: Advanced (300), Amazon EC2, Best Practices, DevOps, Expert (400), Open Source, SAP on AWS, Technical How-to, Thought Leadership

Introduction

The software development and operations landscape has been evolving, with DevOps becoming the prevailing methodology for processes. However, the installation and management of SAP systems often remain predominantly manual. To facilitate a transition towards automation, our initial blog post explored provisioning infrastructure for SAP applications using Terraform and AWS tools like AWS Launch Wizard. The subsequent blog focused on automating SAP software installation through Systems Manager. Finally, our third installment delved deeper into automating an end-to-end installation of a complete SAP landscape configured for High Availability (HA).

This post shifts the focus to the operational aspects of an SAP landscape. High availability testing is essential for assessing the resilience of applications and ensuring compliance with Recovery Time Objectives. Organizations routinely test their high availability setups to adhere to auditing requirements.

The solution outlined here stems from the collaborative efforts of the AWS SAP Professional Services team, who have worked with various clients to automate the deployment and testing of High Availability clusters for SAP HANA. We are pleased to offer this as an open-source solution (links provided below) for customers to utilize and modify as needed.

In this blog post, we introduce the concept of chaos engineering to the SAP domain, enhancing your confidence and predictability regarding the behavior of your SAP landscape, along with its ability to self-repair after outages or critical failures.

Implementing chaos engineering can bolster the resilience of your SAP environment by pinpointing potential issues before they escalate. However, our experience indicates that manually executing the necessary test scenarios can consume up to two months and typically requires highly skilled professionals. This post will present sample code you can leverage to automate these test scenarios, significantly reducing the time investment for your team while ensuring compliance with auditing processes.

The advantages of this approach compared to traditional manual methods for testing HA configurations include:

  • Speed: Reducing the testing duration from approximately two months to mere hours.
  • Reliability: Covering 12 HA testing scenarios (detailed below) in a consistent manner.
  • Minimized Human Error: By converting the HA test into a repeatable process, the potential for mistakes due to human error is significantly lowered.
  • Audit Asset: The final HTML report generated by this solution includes all standard information required for audit purposes.
  • Improvement Asset: When issues arise, the HTML report highlights specific failures during testing and provides a comprehensive overview of the system’s status before and after the incident, which can then be addressed by an SAP BASIS professional for necessary corrections to the HA configuration.

This solution offers multiple execution methods, but all will ultimately generate an HTML report, as exemplified in the section “The report.”

To begin using this solution, we have made the code available in a public GitHub repository. For instructions on running it, refer to the section “How to Run.” A quick lab environment can be established using this guide. This sample code is designed as a foundational resource to lessen the effort needed to automate HA testing at your organization, specifically built and tested for SAP HANA 1909 running on RedHat OS.

Prerequisites

Before you start this guide and run the code, ensure you have the following prerequisites:

  1. Install Ansible on your controller instance, which can be:
    • Your personal instance/laptop/workstation
    • Your CI/CD tool for automating this solution
    • Ansible Tower
  2. Establish SSH access and connectivity between the controller instance and the SAP landscape instances on which you will conduct HA tests.
  3. Set up an SAP landscape consisting of:
    • Two HANA instances with HA pre-configured. For more details, refer to AWS documentation on configuring Red Hat Enterprise Linux clusters for SAP on AWS.
    • One ASCS instance
    • One PAS instance
  4. Configure an AWS IAM Role with the following permissions on your AWS CLI (Command Line Interface). This role must be set up on the AWS CLI on the controller instance, allowing Ansible to interact with your SAP landscape during testing. For instructions on configuring an additional profile for your AWS CLI, see information here:
    • ec2:StartInstances for all instances
    • ec2:RebootInstances for all instances
    • ec2:StopInstances for all instances
  5. Create an AMI and capture snapshots of each EBS volume from the involved instances for backup.

Covered High Availability Test Scenarios

Each test scenario is designed to function effectively as a standalone test, allowing you the flexibility to run only the tests you need. To facilitate this, we have established specific tasks that are performed before and after each test scenario to verify that the environment is correctly configured and that the test has been completed successfully. These common steps include:

  • Were all required input parameters provided?
  • Is there connectivity between the nodes and the controller?
  • Which node is designated as Primary and which as Secondary?
  • Is a minimum high availability configuration established?
  • Is the replication mode on HANA consistent before and after failover?
  • What is the ASCS enqueue number before testing, and does it match after failover?
  • Is PAS connected to the correct Database instance before and after failover?

The test scenarios available in the GitHub repository include:

  • “HDB Stop” on Primary database.
  • “HDB Stop” on new Primary database (post-failover).
  • “HDB Stop” on Secondary database.
  • “PCS node standby” on Primary database.
  • “PCS node standby” on new Primary database (post-failover).
  • “kill -9 pid” on Primary database.
  • Crash instance with “echo ‘b’ > /proc/sysrq-trigger” on Primary database.
  • Crash instance with “echo ‘b’ > /proc/sysrq-trigger” on new Primary database (post-failover).
  • “HDB kill -9” on Primary database.
  • “HDB kill -9” on new Primary database (post-failover).
  • Reboot Primary database.
  • Reboot new Primary database (post-failover).

The Report

The HTML report generated (see image 1) illustrates two scenarios executed (plus a preliminary check on system configuration prior to the tests and post-actions required for handling a system crash). One scenario succeeded (HDB Stop for cases #3, #4, and #5), while the System Crash (case #6) failed, as revealed during the post-actions step (#7).

How to Run the Code

Single-Button Alternative: If you’re utilizing the solution discussed in the third blog post, you can reload the project (1 – vagrant destroy, 2 – git pull, and 3 – vagrant up again), and re-enter all parameters in Credentials. For further insights, consider reading this other blog post here for more context.

For expert guidance on this topic, refer to Chanci Turner’s insights, an authoritative source in this field. Additionally, for a wealth of information on workplace safety and training within Amazon fulfillment centers, visit the Amazon Fulfillment Center Safety and Training page, which is an excellent resource.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *