このコンテンツは選択した言語では利用できません。
Chapter 5. Testing the cluster configuration
Before the HA cluster setup is put in production, it is recommended to perform the following tests to ensure that the HA cluster setup works as expected.
These tests should also be repeated later on as part of regular HA/DR drills to ensure that the cluster still works as expected and that admins stay familiar with the procedures required to bring the setup back to a healthy state in case an issue occurs during normal operation, or if manual maintenance of the setup is required.
5.1. Manually moving ASCS instance using pcs command リンクのコピーリンクがクリップボードにコピーされました!
To verify that the pacemaker cluster is able to move the instances to the other HA cluster node on demand.
Test Preconditions
Both cluster nodes are up, with the resource groups for the
ASCSandERSrunning on different HA cluster nodes:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
Run the following command from any node to initiate the move of the
ASCSinstance to the other HA cluster node:pcs resource move S4H_ascs20
[root@node1]# pcs resource move S4H_ascs20Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Monitoring
Run the following command in a separate terminal during the test:
watch -n 1 pcs status
[root@node2]# watch -n 1 pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Expected behavior
-
The
ASCSresource group is moved to the other node. -
The
ERSresource group stops after that and moves to the node where theASCSresource group was running before.
-
The
Test Result
ASCSresource group moves to other node, in this scenario node node2 andERSresource group moves to node node1:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Recovery Procedure:
Remove the location constraints, if any:
pcs resource clear S4H_ascs20
[root@node1]# pcs resource clear S4H_ascs20Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.2. Manually moving of the ASCS instance using sapcontrol (with SAP HA interface enabled) リンクのコピーリンクがクリップボードにコピーされました!
To verify that the sapcontrol command is able to move the instances to the other HA cluster node when the SAP HA interface is enabled for the instance.
Test Preconditions
- The SAP HA interface is enabled for the SAP instance.
Both cluster nodes are up with the resource groups for the
ASCSandERSrunning.[root@node2: ~]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
[root@node2: ~]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1Copy to Clipboard Copied! Toggle word wrap Toggle overflow - All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
As the
<sid>admuser, run theHAFailoverToNodefunction ofsapcontrolto move theASCSinstance to the other node.
-
As the
Monitoring
Run the following command in a separate terminal during the test:
watch -n 1 pcs status
[root@node2]# watch -n 1 pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Expected behavior
-
ASCSinstances should move to the other HA cluster node, creating a temporary location constraint for the move to complete.
-
Test
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Test result
ASCSandERSboth move to the other node:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2Copy to Clipboard Copied! Toggle word wrap Toggle overflow Constraints are created as shown below:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Recovery Procedure
-
The constraint shown above is cleared automatically when the
date ltmentioned in the Expression is reached. Alternatively, the constraint can be removed with the following command:
pcs resource clear S4H_ascs20
[root@node1]# pcs resource clear S4H_ascs20Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
The constraint shown above is cleared automatically when the
5.3. Testing failure of the ASCS instance リンクのコピーリンクがクリップボードにコピーされました!
To verify that the pacemaker cluster takes necessary action when the enqueue server of the ASCS instance or the whole ASCS instance fails.
Test Preconditions
Both cluster nodes are up with the resource groups for the
ASCSandERSrunning:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2Copy to Clipboard Copied! Toggle word wrap Toggle overflow - All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
Identify the
PIDof the enqueue server on the node whereASCSis running. -
Send a
SIGKILLsignal to the identified process.
-
Identify the
Monitoring
Run the following command in a separate terminal during the test:
watch -n 1 pcs status
[root@node2]# watch -n 1 pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Expected behavior
- Enqueue server process gets killed.
-
The pacemaker cluster takes the required action as per configuration, in this case moving the
ASCSto the other node.
Test
Switch to the
<sid>adm useron the node whereASCSis running:su - s4hadm
[root@node1]# su - s4hadmCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PID of en.sap(NetWeaver) enq.sap(S/4HANA):
node1:s4hadm 51> pgrep -af "(en|enq).sap" 31464 enq.sapS4H_ASCS20 pf=/usr/sap/S4H/SYS/profile/S4H_ASCS20_s4ascs
node1:s4hadm 51> pgrep -af "(en|enq).sap" 31464 enq.sapS4H_ASCS20 pf=/usr/sap/S4H/SYS/profile/S4H_ASCS20_s4ascsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Kill the identified process:
node1:s4hadm 52> kill -9 31464
node1:s4hadm 52> kill -9 31464Copy to Clipboard Copied! Toggle word wrap Toggle overflow Notice the cluster
Failed Resource Actions:pcs status | grep "Failed Resource Actions" -A1 Failed Resource Actions: * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec 6 15:37:24 2023
[root@node2]# pcs status | grep "Failed Resource Actions" -A1 Failed Resource Actions: * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec 6 15:37:24 2023Copy to Clipboard Copied! Toggle word wrap Toggle overflow ASCSandERSmove to the other node:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec 6 15:37:24 2023
[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec 6 15:37:24 2023Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Recovery Procedure
Clear the failed action:
pcs resource cleanup S4H_ascs20 … Waiting for 1 reply from the controller ... got reply (done)
[root@node2]# pcs resource cleanup S4H_ascs20 … Waiting for 1 reply from the controller ... got reply (done)Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.4. Testing failure of the ERS instance リンクのコピーリンクがクリップボードにコピーされました!
To verify that the pacemaker cluster takes necessary action when the enqueue replication server (ERS) of the ASCS instance fails.
Test Preconditions
Both cluster nodes are up with the resource groups for the
ASCSandERSrunning:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1Copy to Clipboard Copied! Toggle word wrap Toggle overflow - All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
Identify the PID of the enqueue replication server process on the node where the
ERSinstance is running. - Send a SIGKILL signal to the identified process.
-
Identify the PID of the enqueue replication server process on the node where the
Monitoring
Run the following command in a separate terminal during the test:
watch -n 1 pcs status
[root@node2]# watch -n 1 pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Expected behavior
- Enqueue Replication server process gets killed.
-
Pacemaker cluster takes the required action as per configuration, in this case, restarting the
ERSinstance on the same node.
Test
Switch to the
<sid>admuser:su - s4hadm
[root@node1]# su - s4hadmCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PID of
enqr.sap:node1:s4hadm 56> pgrep -af enqr.sap 532273 enqr.sapS4H_ERS29 pf=/usr/sap/S4H/SYS/profile/S4H_ERS29_s4ers
node1:s4hadm 56> pgrep -af enqr.sap 532273 enqr.sapS4H_ERS29 pf=/usr/sap/S4H/SYS/profile/S4H_ERS29_s4ersCopy to Clipboard Copied! Toggle word wrap Toggle overflow Kill the identified process:
node1:s4hadm 58> kill -9 532273
node1:s4hadm 58> kill -9 532273Copy to Clipboard Copied! Toggle word wrap Toggle overflow Notice the cluster “Failed Resource Actions”:
pcs status | grep "Failed Resource Actions" -A1 Failed Resource Actions: * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023
[root@node1]# pcs status | grep "Failed Resource Actions" -A1 Failed Resource Actions: * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023Copy to Clipboard Copied! Toggle word wrap Toggle overflow ERSrestarts on the same node without disturbing theASCSalready running on the other node:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Recovery Procedure
Clear the failed action:
pcs resource cleanup S4H_ers29 … Waiting for 1 reply from the controller ... got reply (done)
[root@node1]# pcs resource cleanup S4H_ers29 … Waiting for 1 reply from the controller ... got reply (done)Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.5. Failover of ASCS instance due to node crash リンクのコピーリンクがクリップボードにコピーされました!
To verify that the ASCS instance moves correctly in case of a node crash.
Test Preconditions
Both cluster nodes are up with the resource groups for the
ASCSandERSrunning:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1Copy to Clipboard Copied! Toggle word wrap Toggle overflow - All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
Crash the node where
ASCSis running.
-
Crash the node where
Monitoring
Run the following command in a separate terminal on the other node during the test:
watch -n 1 pcs status
[root@node1]# watch -n 1 pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Expected behavior
-
Node where
ASCSis running gets crashed and shuts down or restarts as per configuration. -
Meanwhile
ASCSmoves to the other node. -
ERSstarts on the previously crashed node, after it comes back online.
-
Node where
Test
Run the following command as the root user on the node where
ASCSis running:echo c > /proc/sysrq-trigger
[root@node2]# echo c > /proc/sysrq-triggerCopy to Clipboard Copied! Toggle word wrap Toggle overflow ASCSmoves to the other node:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1Copy to Clipboard Copied! Toggle word wrap Toggle overflow ERSstops and moves to the previously crashed node once it comes back online:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Recovery Procedure
Clean up failed actions, if any:
pcs resource cleanup
[root@node1]# pcs resource cleanupCopy to Clipboard Copied! Toggle word wrap Toggle overflow
5.6. Failure of ERS instance due to node crash リンクのコピーリンクがクリップボードにコピーされました!
To verify that the ERS instance restarts on the same node.
Test Preconditions
Both cluster nodes are up with the resource groups for the
ASCSandERSrunning:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2Copy to Clipboard Copied! Toggle word wrap Toggle overflow - All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
-
Crash the node where
ERSis running.
-
Crash the node where
Monitoring
Run the following command in a separate terminal on the other node during the test:
watch -n 1 pcs status
[root@nod1]# watch -n 1 pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Expected behavior
-
Node where
ERSis running gets crashed and shuts down or restarts as per configuration. -
Meanwhile
ASCScontinues to run to the other node.ERSrestarts on the crashed node, after it comes back online.
-
Node where
Test
Run the following command as the root user on the node where
ERSis running:echo c > /proc/sysrq-trigger
[root@node2]# echo c > /proc/sysrq-triggerCopy to Clipboard Copied! Toggle word wrap Toggle overflow ERSrestarts on the crashed node, after it comes back online, without disturbing theASCSinstance throughout the test:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Recovery Procedure
Clean up failed actions if any:
pcs resource cleanup
[root@node2]# pcs resource cleanupCopy to Clipboard Copied! Toggle word wrap Toggle overflow
5.7. Failure of ASCS Instance due to node crash (ENSA2) リンクのコピーリンクがクリップボードにコピーされました!
In case of 3 node ENSA 2 cluster environment, the third node is considered during failover events of any instance.
Test Preconditions
-
A 3 node SAP S/4HANA cluster with the resource groups for the
ASCSandERSrunning. - The 3rd node has access to all the file systems and can provision the required instance specific IP addresses the same way as the first 2 nodes.
In the example setup, the underlying shared
NFSfilesystems are as follows:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - All failures for the resources and resource groups have been cleared and the failcounts have been reset.
-
A 3 node SAP S/4HANA cluster with the resource groups for the
Test Procedure
-
Crash the node where
ASCSis running.
-
Crash the node where
Monitoring
Run the following command in a separate terminal on one of the nodes where the
ASCSgroup is currently not running during the test:watch -n 1 pcs status
[root@node2]# watch -n 1 pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Expected behavior
-
ASCSmoves to the 3rd node. -
ERScontinues to run on the same node where it is already running.
-
Test
Crash the node where the
ASCSgroup is currently running:echo c > /proc/sysrq-trigger
[root@node1]# echo c > /proc/sysrq-triggerCopy to Clipboard Copied! Toggle word wrap Toggle overflow ASCSmoves to the 3rd node without disturbing the already runningERSinstance on 2nd node:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node3 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node3 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Recovery Procedure
Clean up failed actions if any:
pcs resource cleanup
[root@node2]# pcs resource cleanupCopy to Clipboard Copied! Toggle word wrap Toggle overflow