Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 6. Testing the setup
Test your new HANA HA cluster thoroughly before you enable it for production workloads.
Enhance the basic example test cases with your specific requirements.
The following test case examples show the testing on a 2-node cluster with ASCS and ERS resource groups of an S/4HANA setup.
6.1. Moving the ASCS instance using cluster commands Link kopierenLink in die Zwischenablage kopiert!
Test how the cluster moves an application server instance and its related resources from one node to another on demand. You can use this procedure to distribute instances to specific nodes.
Prerequisites
- You have ensured that all cluster nodes are up and the resource groups for the ASCS and ERS are running on different nodes.
- You have no failures in the cluster status.
Procedure
Move the ASCS resource to any other HA cluster node. You can use either the
SAPInstanceresource or the resource group in the command:[root]# pcs resource move rsc_SAPInstance_<SID>_ASCS<instance> [<node>] Location constraint to move resource 'rsc_SAPInstance_S4H_ASCS20' has been created Waiting for the cluster to apply configuration changes... Location constraint created to move resource 'rsc_SAPInstance_S4H_ASCS20' has been removed Waiting for the cluster to apply configuration changes... resource 'rsc_SAPInstance_S4H_ASCS20' is running on node 'node2'-
Replace
<SID>with the ASCS SID, for example,S4H. -
Replace
<instance>with the ASCS instance number, for example,20. - Optionally, you can define a target node where the instance is moved to. When you do not define a node, the cluster chooses a healthy target node that meets the configuration.
-
Replace
Verification
Check that the resource group starts fully on the other node, for example, after moving the ASCS group from node1 to node2:
[root]# pcs resource * Resource Group: grp_S4H_ASCS20: * rsc_vip_S4H_ASCS20 (ocf:heartbeat:IPAddr2): Started node2 * rsc_SAPStartSrv_S4H_ASCS20 (ocf:heartbeat:SAPStartSrv): Started node2 * rsc_SAPInstance_S4H_ASCS20 (ocf:heartbeat:SAPInstance): Started node2 * Resource Group: grp_S4H_ERS29: * rsc_vip_S4H_ERS29 (ocf:heartbeat:IPAddr2): Started node1 * rsc_SAPStartSrv_S4H_ERS29 (ocf:heartbeat:SAPStartSrv): Started node1 * rsc_SAPInstance_S4H_ERS29 (ocf:heartbeat:SAPInstance): Started node1Optionally, when the cluster consists of 2 nodes: Verify that the ASCS resource group fully starts, then the ERS resource group automatically stops on this node and then moves to the node where the ASCS resource group was running before. This is triggered by the colocation constraint. Check for the related chain of actions by the
pacemaker-controldin the system logfile:[root]# less /var/log/messages … … notice: Result of start operation for rsc_SAPInstance_S4H_ASCS20 on node2: ok … … notice: Requesting local execution of stop operation for rsc_SAPInstance_S4H_ERS29 on node2
6.2. Manually moving the ASCS instance using sapcontrol and the SAP HA interface Link kopierenLink in die Zwischenablage kopiert!
This test verifies that the sapcontrol command is able to move the instances to the other HA cluster node when the SAP HA interface is enabled for the instance.
Prerequisites
- You have enabled the SAP HA interface for the ASCS instance.
- You have ensured that all cluster nodes are up and the resource groups for the ASCS and ERS are running on different nodes.
- You have no failures in the cluster status.
Procedure
Run the
HAFailoverToNodefunction ofsapcontrolto move the ASCS instance to the other node. Execute as user<sid>adm:<sid>adm $ sapcontrol -nr <instance> -function HAFailoverToNode ""Check that the instance stops on the current node and starts on the other node. The ERS instance automatically stops and starts as well after ASCS is fully up. For example, cluster resource status after you have moved the ASCS instance from node1 to node2:
[root]# pcs resource * Resource Group: grp_S4H_ASCS20: * rsc_vip_S4H_ASCS20 (ocf:heartbeat:IPAddr2): Started node2 * rsc_SAPStartSrv_S4H_ASCS20 (ocf:heartbeat:SAPStartSrv): Started node2 * rsc_SAPInstance_S4H_ASCS20 (ocf:heartbeat:SAPInstance): Started node2 * Resource Group: grp_S4H_ERS29: * rsc_vip_S4H_ERS29 (ocf:heartbeat:IPAddr2): Started node1 * rsc_SAPStartSrv_S4H_ERS29 (ocf:heartbeat:SAPStartSrv): Started node1 * rsc_SAPInstance_S4H_ERS29 (ocf:heartbeat:SAPInstance): Started node1Check the new location constraint that has been created by the manual move due to the HA integration:
[root]# pcs constraint Location Constraints: Started resource 'grp_S4H_ASCS20' Rules: Rule: boolean-op=and score=-INFINITY Expression: #uname eq string node1 Expression: date lt YYYY-MM-DDT13:40:45ZThe constraint defines that the ASCS resource group is banned for 5 minutes from the original node, which enforces the move to the other node. The date string in the rule defines the time at which the cluster deletes the constraint automatically.
Optional: Remove the temporary constraint to enable the ASCS resource group on the previous node immediately and end this test:
[root]# pcs resource clear grp_S4H_ASCS20
6.3. Testing failures of the ASCS instance Link kopierenLink in die Zwischenablage kopiert!
Test that the pacemaker cluster executes the recovery action when the enqueue server of the ASCS instance or the whole ASCS instance fails.
Prerequisites
- You have ensured that all cluster nodes are up and the resource groups for the ASCS and ERS are running on different nodes.
- You have no failures in the cluster status.
Procedure
Identify the process ID (PID) of the enqueue server on the node where ASCS is running. Use
sapcontrol… GetProcessListas user<sid>admto find the PID as the last entry:<sid>adm $ sapcontrol -nr <instance> -function GetProcessList name, description, dispstatus, textstatus, starttime, elapsedtime, pid msg_server, MessageServer, GREEN, Running, YYYY MM DD 14:10:29, 0:01:00, 142607 enq_server, Enqueue Server 2, GREEN, Running, YYYY MM DD 14:10:29, 0:01:00, 142608In the example, the enqueue server PID is
142608.Send a
SIGKILLsignal to the identified process to kill it instantly:<sid>adm $ kill -9 <pid>-
Replace
<pid>with the PID of the enqueue server, for example,142608.
-
Replace
-
Check that the ASCS instance recovers. With the default
migration-thresholdof all resources set to3, the cluster restarts the resource on the same node twice, before it recovers on a different node. - Repeat steps 1 and 2 until you have killed the process 3 times. The cluster recovers the resource on another node after 3 consecutive failures of the same resource.
Verification
Check that the ASCS instance is running on the other node:
[root]# pcs resource * Resource Group: grp_S4H_ASCS20: * rsc_vip_S4H_ASCS20 (ocf:heartbeat:IPAddr2): Started node2 * rsc_SAPStartSrv_S4H_ASCS20 (ocf:heartbeat:SAPStartSrv): Started node2 * rsc_SAPInstance_S4H_ASCS20 (ocf:heartbeat:SAPInstance): Started node2 * Resource Group: grp_S4H_ERS29: * rsc_vip_S4H_ERS29 (ocf:heartbeat:IPAddr2): Started node1 * rsc_SAPStartSrv_S4H_ERS29 (ocf:heartbeat:SAPStartSrv): Started node1 * rsc_SAPInstance_S4H_ERS29 (ocf:heartbeat:SAPInstance): Started node1In our example, the ASCS instance failed on node1 and has been moved to node2. As configured, the ERS instance moved to node1 after ASCS was up on node2.
Check the fail count and notice that it has failed 3 times to trigger the recovery on the other node:
[root]# pcs resource failcount Failcounts for resource 'rsc_SAPInstance_S4H_ASCS20' node1: 3
Next steps
- Clear any failure notifications from the cluster that may be there from previous testing. For more information, see Cleaning up the failure history.
6.4. Testing failures of the ERS instance Link kopierenLink in die Zwischenablage kopiert!
Test that the pacemaker cluster executes the recovery action when the enqueue server of the ERS instance or the whole ERS instance fails.
Prerequisites
- You have ensured that all cluster nodes are up and the resource groups for the ASCS and ERS are running on different nodes.
- You have no failures in the cluster status.
Procedure
Identify the process ID (PID) of the enqueue server on the node where ERS is running. Use
sapcontrol… GetProcessListas user<sid>admto find the PID as the last entry:<sid>adm $ sapcontrol -nr <instance> -function GetProcessList name, description, dispstatus, textstatus, starttime, elapsedtime, pid enq_replicator, Enqueue Replicator 2, GREEN, Running, YYYY MM DD 15:42:03, 0:00:04, 19124In the example, the enqueue server PID is
191240.Send a
SIGKILLsignal to the identified process to kill it instantly:<sid>adm $ kill -9 <pid>-
Replace
<pid>with the PID of the enqueue server, for example,191240.
-
Replace
-
Check that the ERS instance recovers. With the default
migration-thresholdof all resources set to3, the cluster restarts the resource on the same node twice, before it recovers on a different node. - Repeat steps 1 and 2 until you have killed the process 3 times. The cluster recovers the resource on another node after 3 consecutive failures of the same resource.
Verification
Check that the ERS instance is running on the other node:
[root]# pcs resource * Resource Group: grp_S4H_ERS20: * rsc_vip_S4H_ERS20 (ocf:heartbeat:IPAddr2): Started node2 * rsc_SAPStartSrv_S4H_ERS20 (ocf:heartbeat:SAPStartSrv): Started node2 * rsc_SAPInstance_S4H_ERS20 (ocf:heartbeat:SAPInstance): Started node2 * Resource Group: grp_S4H_ERS29: * rsc_vip_S4H_ERS29 (ocf:heartbeat:IPAddr2): Started node1 * rsc_SAPStartSrv_S4H_ERS29 (ocf:heartbeat:SAPStartSrv): Started node1 * rsc_SAPInstance_S4H_ERS29 (ocf:heartbeat:SAPInstance): Started node1In our example, the ERS instance failed on node2 and has been moved to node1. When the ERS instance cannot run on a different node than ASCS anymore due to the failures, it is restarted on the same node as the ASCS instance.
When you clear the failure, the cluster automatically moves the ERS instance back to the other node.
Check the fail count and notice that it has failed 3 times to trigger the recovery on the other node:
[root]# pcs resource failcount Failcounts for resource 'rsc_SAPInstance_S4H_ERS20' node1: 3
Next steps
- Clear any failure notifications from the cluster that may be there from previous testing. For more information, see Cleaning up the failure history.
6.5. Crashing the node with the ASCS instance Link kopierenLink in die Zwischenablage kopiert!
Simulate the crash of the cluster node on which the ASCS instance runs to test the behavior of your cluster resources.
Prerequisites
- You have ensured that all cluster nodes are up and the resource groups for the ASCS and ERS are running on different nodes.
- You have no failures in the cluster status.
Procedure
Trigger a crash on the node that runs the ASCS instance, for example, node1. This immediately causes a kernel panic on the node, effectively simulating a system crash and the node becomes unresponsive.
The cluster’s fencing mechanism (STONITH) detects the failure and initiates recovery actions. Typically it fences the node and restarts any failed resources on a surviving cluster node.
The following command immediately causes a crash of the node on which you run the command, with no further warning:
[root]# echo c > /proc/sysrq-trigger
Verification
Check that the cluster on the other node fences the crashed node:
[root]# pcs stonith history reboot of node1 successful: delegate=node2, client=pacemaker-controld.1468, origin=node2, completed=... 1 event foundCheck that the cluster starts the ASCS resources on the remaining node. In a 2-node cluster this leads to ASCS and ERS running on the same node:
[root]# pcs resource * Resource Group: grp_S4H_ASCS20: * rsc_vip_S4H_ASCS20 (ocf:heartbeat:IPAddr2): Started node2 * rsc_SAPStartSrv_S4H_ASCS20 (ocf:heartbeat:SAPStartSrv): Started node2 * rsc_SAPInstance_S4H_ASCS20 (ocf:heartbeat:SAPInstance): Started node2 * Resource Group: grp_S4H_ERS29: * rsc_vip_S4H_ERS29 (ocf:heartbeat:IPAddr2): Started node2 * rsc_SAPStartSrv_S4H_ERS29 (ocf:heartbeat:SAPStartSrv): Started node2 * rsc_SAPInstance_S4H_ERS29 (ocf:heartbeat:SAPInstance): Started node2Check that the cluster automatically moves the ERS instance to the previously failed node after the fenced node is running again:
[root]# pcs resource * Resource Group: grp_S4H_ASCS20: * rsc_vip_S4H_ASCS20 (ocf:heartbeat:IPAddr2): Started node2 * rsc_SAPStartSrv_S4H_ASCS20 (ocf:heartbeat:SAPStartSrv): Started node2 * rsc_SAPInstance_S4H_ASCS20 (ocf:heartbeat:SAPInstance): Started node2 * Resource Group: grp_S4H_ERS29: * rsc_vip_S4H_ERS29 (ocf:heartbeat:IPAddr2): Started node1 * rsc_SAPStartSrv_S4H_ERS29 (ocf:heartbeat:SAPStartSrv): Started node1 * rsc_SAPInstance_S4H_ERS29 (ocf:heartbeat:SAPInstance): Started node1The higher ASCS resource stickiness and the constraints you have configured earlier ensure that the ASCS instance stays in place to avoid the unnecessary disruption of the service again.
Next steps
- Clear any failure notifications from the cluster that may be there from previous testing. For more information, see Cleaning up the failure history.
6.6. Crashing the node with the ERS instance Link kopierenLink in die Zwischenablage kopiert!
Simulate the crash of the cluster node on which the ERS instance runs to test the behavior of your cluster resources.
Prerequisites
- You have ensured that all cluster nodes are up and the resource groups for the ASCS and ERS are running on different nodes.
- You have no failures in the cluster status.
Procedure
Trigger a crash on the node that runs the ERS instance, for example, node2. This immediately causes a kernel panic on the node, effectively simulating a system crash and the node becomes unresponsive.
The cluster’s fencing mechanism (STONITH) detects the failure and initiates recovery actions. Typically it fences the node and restarts any failed resources on a surviving cluster node.
The following command immediately causes a crash of the node on which you run the command, with no further warning:
[root]# echo c > /proc/sysrq-trigger
Verification
Check that the cluster on the other node fences the crashed node:
[root]# pcs stonith history reboot of node2 successful: delegate=node1, client=pacemaker-controld.1426, origin=node1, completed=... 1 event foundCheck that the cluster starts the ERS resources on the remaining node. In a 2-node cluster this leads to ASCS and ERS running on the same node:
[root]# pcs resource * Resource Group: grp_S4H_ASCS20: * rsc_vip_S4H_ASCS20 (ocf:heartbeat:IPAddr2): Started node1 * rsc_SAPStartSrv_S4H_ASCS20 (ocf:heartbeat:SAPStartSrv): Started node1 * rsc_SAPInstance_S4H_ASCS20 (ocf:heartbeat:SAPInstance): Started node1 * Resource Group: grp_S4H_ERS29: * rsc_vip_S4H_ERS29 (ocf:heartbeat:IPAddr2): Started node1 * rsc_SAPStartSrv_S4H_ERS29 (ocf:heartbeat:SAPStartSrv): Started node1 * rsc_SAPInstance_S4H_ERS29 (ocf:heartbeat:SAPInstance): Started node1Check that the cluster automatically moves the ERS instance to the previously failed node after the fenced node is running again:
[root]# pcs resource * Resource Group: grp_S4H_ASCS20: * rsc_vip_S4H_ASCS20 (ocf:heartbeat:IPAddr2): Started node1 * rsc_SAPStartSrv_S4H_ASCS20 (ocf:heartbeat:SAPStartSrv): Started node1 * rsc_SAPInstance_S4H_ASCS20 (ocf:heartbeat:SAPInstance): Started node1 * Resource Group: grp_S4H_ERS29: * rsc_vip_S4H_ERS29 (ocf:heartbeat:IPAddr2): Started node2 * rsc_SAPStartSrv_S4H_ERS29 (ocf:heartbeat:SAPStartSrv): Started node2 * rsc_SAPInstance_S4H_ERS29 (ocf:heartbeat:SAPInstance): Started node2
Next steps
- Clear any failure notifications from the cluster that may be there from previous testing. For more information, see Cleaning up the failure history.
6.7. Testing failures of the ASCS or ERS instance in a 3-node cluster Link kopierenLink in die Zwischenablage kopiert!
Test that the pacemaker cluster recovers the ASCS or the ERS instance on a different node after consecutive failures.
The constraints you have configured in the ENSA2 setup try to keep the ASCS and ERS instances on separate nodes and the cluster uses any extra node for the recovery.
In a cluster with more than 2 nodes, the recovery of a failed ASCS or ERS instance is similar. The following test demonstrates an example of a failing ASCS instance.
Prerequisites
- You have configured your ASCS and ERS instance in an ENSA2 setup.
- You have configured 3 or more cluster nodes in this cluster.
- You have configured the additional cluster node(s) to be able to run the instances.
- You have ensured that all cluster nodes are up and the resource groups for the ASCS and ERS are running on different nodes.
- You have no failures in the cluster status.
Procedure
Identify the process ID (PID) of the enqueue server on the node where ASCS is running. Use
sapcontrol… GetProcessListas user<sid>admto find the PID as the last entry:<sid>adm $ sapcontrol -nr <instance> -function GetProcessList name, description, dispstatus, textstatus, starttime, elapsedtime, pid enq_replicator, Enqueue Replicator 2, GREEN, Running, YYYY MM DD 13:20:07, 0:00:08, 161323In the example, the enqueue server PID is
161323.Send a
SIGKILLsignal to the identified process to kill it instantly, for example, ASCS on node1:<sid>adm $ kill -9 <pid>-
Replace
<pid>with the PID of the enqueue server, for example,161323.
-
Replace
-
Check that the ASCS instance recovers on the same node. With the default
migration-thresholdof all resources set to3, the cluster restarts the resource on the same node twice, before it recovers on a different node. - Repeat steps 1 and 2 until you have killed the process 3 times. The cluster recovers the resource on another node after 3 consecutive failures of the same resource.
Verification
Check that the ASCS instance is now running on the additional node:
[root]# pcs resource * Resource Group: grp_S4H_ASCS20: * rsc_vip_S4H_ASCS20 (ocf:heartbeat:IPAddr2): Started node3 * rsc_SAPStartSrv_S4H_ASCS20 (ocf:heartbeat:SAPStartSrv): Started node3 * rsc_SAPInstance_S4H_ASCS20 (ocf:heartbeat:SAPInstance): Started node3 * Resource Group: grp_S4H_ERS29: * rsc_vip_S4H_ERS29 (ocf:heartbeat:IPAddr2): Started node2 * rsc_SAPStartSrv_S4H_ERS29 (ocf:heartbeat:SAPStartSrv): Started node2 * rsc_SAPInstance_S4H_ERS29 (ocf:heartbeat:SAPInstance): Started node2In our example, the ASCS instance failed on node1 and has been moved to node3. The ERS instance stays in place on node2.
Check the fail count and notice that it has failed 3 times to trigger the recovery on the other node:
[root]# pcs resource failcount Failcounts for resource 'rsc_SAPInstance_S4H_ASCS20' node1: 3
Next steps
- Clear any failure notifications from the cluster that may be there from previous testing. For more information, see Cleaning up the failure history.