Chapter 5. Testing the cluster configuration
Before the HA cluster setup is put in production, it is recommended to perform the following tests to ensure that the HA cluster setup works as expected.
These tests should also be repeated later on as part of regular HA/DR drills to ensure that the cluster still works as expected and that admins stay familiar with the procedures required to bring the setup back to a healthy state in case an issue occurs during normal operation, or if manual maintenance of the setup is required.
5.1. Manually moving ASCS
instance using pcs
command
To verify that the pacemaker cluster is able to move the instances to the other HA cluster node on demand.
Test Preconditions
Both cluster nodes are up, with the resource groups for the
ASCS
andERS
running on different HA cluster nodes:* Resource Group: S4H_ASCS20_group: * S4H_lvm_ascs20 (ocf:heartbeat:LVM-activate): Started node1 * S4H_fs_ascs20 (ocf:heartbeat:Filesystem): Started node1 * S4H_vip_ascs20 (ocf:heartbeat:IPaddr2): Started node1 * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * Resource Group: S4H_ERS29_group: * S4H_lvm_ers29 (ocf:heartbeat:LVM-activate): Started node2 * S4H_fs_ers29 (ocf:heartbeat:Filesystem): Started node2 * S4H_vip_ers29 (ocf:heartbeat:IPaddr2): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
Run the following command from any node to initiate the move of the
ASCS
instance to the other HA cluster node:[root@node1]# pcs resource move S4H_ascs20
Monitoring
Run the following command in a separate terminal during the test:
[root@node2]# watch -n 1 pcs status
Expected behavior
-
The
ASCS
resource group is moved to the other node. -
The
ERS
resource group stops after that and moves to the node where theASCS
resource group was running before.
-
The
Test Result
ASCS
resource group moves to other node, in this scenario node node2 andERS
resource group moves to node node1:* Resource Group: S4H_ASCS20_group: * S4H_lvm_ascs20 (ocf:heartbeat:LVM-activate): Started node2 * S4H_fs_ascs20 (ocf:heartbeat:Filesystem): Started node2 * S4H_vip_ascs20 (ocf:heartbeat:IPaddr2): Started node2 * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * Resource Group: S4H_ERS29_group: * S4H_lvm_ers29 (ocf:heartbeat:LVM-activate): Started node1 * S4H_fs_ers29 (ocf:heartbeat:Filesystem): Started node1 * S4H_vip_ers29 (ocf:heartbeat:IPaddr2): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
Recovery Procedure:
Remove the location constraints, if any:
[root@node1]# pcs resource clear S4H_ascs20
5.2. Manually moving of the ASCS
instance using sapcontrol
(with SAP HA interface enabled)
To verify that the sapcontrol
command is able to move the instances to the other HA cluster node when the SAP HA interface is enabled for the instance.
Test Preconditions
- The SAP HA interface is enabled for the SAP instance.
Both cluster nodes are up with the resource groups for the
ASCS
andERS
running.[root@node2: ~]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
As the
<sid>adm
user, run theHAFailoverToNode
function ofsapcontrol
to move theASCS
instance to the other node.
-
As the
Monitoring
Run the following command in a separate terminal during the test:
[root@node2]# watch -n 1 pcs status
Expected behavior
-
ASCS
instances should move to the other HA cluster node, creating a temporary location constraint for the move to complete.
-
Test
[root@node2]# su - s4hadm node2:s4hadm 52> sapcontrol -nr 20 -function HAFailoverToNode "" 06.12.2023 12:57:04 HAFailoverToNode OK
Test result
ASCS
andERS
both move to the other node:[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
Constraints are created as shown below:
[root@node1]# pcs constraint Location Constraints: Resource: S4H_ASCS20_group Constraint: cli-ban-S4H_ASCS20_group-on-node2 Rule: boolean-op=and score=-INFINITY Expression: #uname eq string node1 Expression: date lt xxxx-xx-xx xx:xx:xx +xx:xx
Recovery Procedure
-
The constraint shown above is cleared automatically when the
date lt
mentioned in the Expression is reached. Alternatively, the constraint can be removed with the following command:
[root@node1]# pcs resource clear S4H_ascs20
-
The constraint shown above is cleared automatically when the
5.3. Testing failure of the ASCS
instance
To verify that the pacemaker cluster takes necessary action when the enqueue server of the ASCS
instance or the whole ASCS
instance fails.
Test Preconditions
Both cluster nodes are up with the resource groups for the
ASCS
andERS
running:[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
Identify the
PID
of the enqueue server on the node whereASCS
is running. -
Send a
SIGKILL
signal to the identified process.
-
Identify the
Monitoring
Run the following command in a separate terminal during the test:
[root@node2]# watch -n 1 pcs status
Expected behavior
- Enqueue server process gets killed.
-
The pacemaker cluster takes the required action as per configuration, in this case moving the
ASCS
to the other node.
Test
Switch to the
<sid>adm user
on the node whereASCS
is running:[root@node1]# su - s4hadm
Identify the PID of en.sap(NetWeaver) enq.sap(S/4HANA):
node1:s4hadm 51> pgrep -af "(en|enq).sap" 31464 enq.sapS4H_ASCS20 pf=/usr/sap/S4H/SYS/profile/S4H_ASCS20_s4ascs
Kill the identified process:
node1:s4hadm 52> kill -9 31464
Notice the cluster
Failed Resource Actions
:[root@node2]# pcs status | grep "Failed Resource Actions" -A1 Failed Resource Actions: * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec 6 15:37:24 2023
ASCS
andERS
move to the other node:[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec 6 15:37:24 2023
Recovery Procedure
Clear the failed action:
[root@node2]# pcs resource cleanup S4H_ascs20 … Waiting for 1 reply from the controller ... got reply (done)
5.4. Testing failure of the ERS instance
To verify that the pacemaker cluster takes necessary action when the enqueue replication server (ERS
) of the ASCS
instance fails.
Test Preconditions
Both cluster nodes are up with the resource groups for the
ASCS
andERS
running:[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
Identify the PID of the enqueue replication server process on the node where the
ERS
instance is running. - Send a SIGKILL signal to the identified process.
-
Identify the PID of the enqueue replication server process on the node where the
Monitoring
Run the following command in a separate terminal during the test:
[root@node2]# watch -n 1 pcs status
Expected behavior
- Enqueue Replication server process gets killed.
-
Pacemaker cluster takes the required action as per configuration, in this case, restarting the
ERS
instance on the same node.
Test
Switch to the
<sid>adm
user:[root@node1]# su - s4hadm
Identify the PID of
enqr.sap
:node1:s4hadm 56> pgrep -af enqr.sap 532273 enqr.sapS4H_ERS29 pf=/usr/sap/S4H/SYS/profile/S4H_ERS29_s4ers
Kill the identified process:
node1:s4hadm 58> kill -9 532273
Notice the cluster “Failed Resource Actions”:
[root@node1]# pcs status | grep "Failed Resource Actions" -A1 Failed Resource Actions: * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023
ERS
restarts on the same node without disturbing theASCS
already running on the other node:[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023
Recovery Procedure
Clear the failed action:
[root@node1]# pcs resource cleanup S4H_ers29 … Waiting for 1 reply from the controller ... got reply (done)
5.5. Failover of ASCS instance due to node crash
To verify that the ASCS
instance moves correctly in case of a node crash.
Test Preconditions
Both cluster nodes are up with the resource groups for the
ASCS
andERS
running:[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
Crash the node where
ASCS
is running.
-
Crash the node where
Monitoring
Run the following command in a separate terminal on the other node during the test:
[root@node1]# watch -n 1 pcs status
Expected behavior
-
Node where
ASCS
is running gets crashed and shuts down or restarts as per configuration. -
Meanwhile
ASCS
moves to the other node. -
ERS
starts on the previously crashed node, after it comes back online.
-
Node where
Test
Run the following command as the root user on the node where
ASCS
is running:[root@node2]# echo c > /proc/sysrq-trigger
ASCS
moves to the other node:[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
ERS
stops and moves to the previously crashed node once it comes back online:[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Stopped [root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
Recovery Procedure
Clean up failed actions, if any:
[root@node1]# pcs resource cleanup
5.6. Failure of ERS instance due to node crash
To verify that the ERS
instance restarts on the same node.
Test Preconditions
Both cluster nodes are up with the resource groups for the
ASCS
andERS
running:[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
Crash the node where
ERS
is running.
-
Crash the node where
Monitoring
Run the following command in a separate terminal on the other node during the test:
[root@nod1]# watch -n 1 pcs status
Expected behavior
-
Node where
ERS
is running gets crashed and shuts down or restarts as per configuration. -
Meanwhile
ASCS
continues to run to the other node.ERS
restarts on the crashed node, after it comes back online.
-
Node where
Test
Run the following command as the root user on the node where
ERS
is running:[root@node2]# echo c > /proc/sysrq-trigger
ERS
restarts on the crashed node, after it comes back online, without disturbing theASCS
instance throughout the test:[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
Recovery Procedure
Clean up failed actions if any:
[root@node2]# pcs resource cleanup
5.7. Failure of ASCS Instance due to node crash (ENSA2)
In case of 3 node ENSA 2 cluster environment, the third node is considered during failover events of any instance.
Test Preconditions
-
A 3 node SAP S/4HANA cluster with the resource groups for the
ASCS
andERS
running. - The 3rd node has access to all the file systems and can provision the required instance specific IP addresses the same way as the first 2 nodes.
In the example setup, the underlying shared
NFS
filesystems are as follows:Node List: * Online: [ node1 node2 node3 ] Active Resources: * s4r9g2_fence (stonith:fence_rhevm): Started node1 * Clone Set: s4h_fs_sapmnt-clone [fs_sapmnt]: * Started: [ node1 node2 node3 ] * Clone Set: s4h_fs_sap_trans-clone [fs_sap_trans]: * Started: [ node1 node2 node3 ] * Clone Set: s4h_fs_sap_SYS-clone [fs_sap_SYS]: * Started: [ node1 node2 node3 ] * Resource Group: S4H_ASCS20_group: * S4H_lvm_ascs20 (ocf:heartbeat:LVM-activate): Started node1 * S4H_fs_ascs20 (ocf:heartbeat:Filesystem): Started node1 * S4H_vip_ascs20 (ocf:heartbeat:IPaddr2): Started node1 * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * Resource Group: S4H_ERS29_group: * S4H_lvm_ers29 (ocf:heartbeat:LVM-activate): Started node2 * S4H_fs_ers29 (ocf:heartbeat:Filesystem): Started node2 * S4H_vip_ers29 (ocf:heartbeat:IPaddr2): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
-
A 3 node SAP S/4HANA cluster with the resource groups for the
Test Procedure
-
Crash the node where
ASCS
is running.
-
Crash the node where
Monitoring
Run the following command in a separate terminal on one of the nodes where the
ASCS
group is currently not running during the test:[root@node2]# watch -n 1 pcs status
Expected behavior
-
ASCS
moves to the 3rd node. -
ERS
continues to run on the same node where it is already running.
-
Test
Crash the node where the
ASCS
group is currently running:[root@node1]# echo c > /proc/sysrq-trigger
ASCS
moves to the 3rd node without disturbing the already runningERS
instance on 2nd node:[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node3 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
Recovery Procedure
Clean up failed actions if any:
[root@node2]# pcs resource cleanup