Chapter 5. Testing the cluster configuration

PDF

Before the HA cluster setup is put in production, it is recommended to perform the following tests to ensure that the HA cluster setup works as expected.

These tests should also be repeated later on as part of regular HA/DR drills to ensure that the cluster still works as expected and that admins stay familiar with the procedures required to bring the setup back to a healthy state in case an issue occurs during normal operation, or if manual maintenance of the setup is required.

5.1. Manually moving `ASCS` instance using `pcs` command

To verify that the pacemaker cluster is able to move the instances to the other HA cluster node on demand.

Test Preconditions

Both cluster nodes are up, with the resource groups for the ASCS and ERS running on different HA cluster nodes:

  * Resource Group: S4H_ASCS20_group:
    * S4H_lvm_ascs20    (ocf:heartbeat:LVM-activate):    Started node1
    * S4H_fs_ascs20     (ocf:heartbeat:Filesystem): Started node1
    * S4H_vip_ascs20    (ocf:heartbeat:IPaddr2):         Started node1
    * S4H_ascs20        (ocf:heartbeat:SAPInstance):     Started node1
  * Resource Group: S4H_ERS29_group:
    * S4H_lvm_ers29     (ocf:heartbeat:LVM-activate):    Started node2
    * S4H_fs_ers29 (ocf:heartbeat:Filesystem): Started node2
    * S4H_vip_ers29     (ocf:heartbeat:IPaddr2):         Started node2
    * S4H_ers29 (ocf:heartbeat:SAPInstance):     Started node2

All failures for the resources and resource groups have been cleared and the failcounts have been reset.

Test Procedure
- Run the following command from any node to initiate the move of the ASCS instance to the other HA cluster node:
```
[root@node1]# pcs resource move S4H_ascs20
```
Monitoring
- Run the following command in a separate terminal during the test:
```
[root@node2]# watch -n 1 pcs status
```
Expected behavior
- The ASCS resource group is moved to the other node.
- The ERS resource group stops after that and moves to the node where the ASCS resource group was running before.

Test Result

ASCS resource group moves to other node, in this scenario node node2 and ERS resource group moves to node node1:

  * Resource Group: S4H_ASCS20_group:
    * S4H_lvm_ascs20 (ocf:heartbeat:LVM-activate): Started node2
    * S4H_fs_ascs20 (ocf:heartbeat:Filesystem): Started node2
    * S4H_vip_ascs20 (ocf:heartbeat:IPaddr2): Started node2
    * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2
  * Resource Group: S4H_ERS29_group:
    * S4H_lvm_ers29 (ocf:heartbeat:LVM-activate): Started node1
    * S4H_fs_ers29 (ocf:heartbeat:Filesystem): Started node1
    * S4H_vip_ers29 (ocf:heartbeat:IPaddr2): Started node1
    * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1

Recovery Procedure:
- Remove the location constraints, if any:
```
[root@node1]# pcs resource clear S4H_ascs20
```

5.2. Manually moving of the `ASCS` instance using `sapcontrol` (with SAP HA interface enabled)

To verify that the sapcontrol command is able to move the instances to the other HA cluster node when the SAP HA interface is enabled for the instance.

Test Preconditions
- The SAP HA interface is enabled for the SAP instance.
- Both cluster nodes are up with the resource groups for the ASCS and ERS running.
```
[root@node2: ~]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
     * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2
     * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
```
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
- As the <sid>adm user, run the HAFailoverToNode function of sapcontrol to move the ASCS instance to the other node.
Monitoring
- Run the following command in a separate terminal during the test:
```
[root@node2]# watch -n 1 pcs status
```
Expected behavior
- ASCS instances should move to the other HA cluster node, creating a temporary location constraint for the move to complete.

Test

[root@node2]# su - s4hadm
node2:s4hadm 52> sapcontrol -nr 20 -function HAFailoverToNode ""

06.12.2023 12:57:04
HAFailoverToNode
OK

Test result

ASCS and ERS both move to the other node:

[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1
    * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2

Constraints are created as shown below:

[root@node1]# pcs constraint
Location Constraints:
  Resource: S4H_ASCS20_group
    Constraint: cli-ban-S4H_ASCS20_group-on-node2
      Rule: boolean-op=and score=-INFINITY
        Expression: #uname eq string node1
        Expression: date lt xxxx-xx-xx xx:xx:xx +xx:xx

Recovery Procedure
- The constraint shown above is cleared automatically when the date lt mentioned in the Expression is reached.
- Alternatively, the constraint can be removed with the following command:
```
[root@node1]# pcs resource clear S4H_ascs20
```

5.3. Testing failure of the `ASCS` instance

To verify that the pacemaker cluster takes necessary action when the enqueue server of the ASCS instance or the whole ASCS instance fails.

Test Preconditions
- Both cluster nodes are up with the resource groups for the ASCS and ERS running:
```
[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1
    * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
```
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
- Identify the PID of the enqueue server on the node where ASCS is running.
- Send a SIGKILL signal to the identified process.
Monitoring
- Run the following command in a separate terminal during the test:
```
[root@node2]# watch -n 1 pcs status
```
Expected behavior
- Enqueue server process gets killed.
- The pacemaker cluster takes the required action as per configuration, in this case moving the ASCS to the other node.

Test

Switch to the <sid>adm user on the node where ASCS is running:
```
[root@node1]# su - s4hadm
```

Identify the PID of en.sap(NetWeaver) enq.sap(S/4HANA):

node1:s4hadm 51> pgrep -af "(en|enq).sap"
31464 enq.sapS4H_ASCS20 pf=/usr/sap/S4H/SYS/profile/S4H_ASCS20_s4ascs

Kill the identified process:
```
node1:s4hadm 52> kill -9 31464
```

Notice the cluster Failed Resource Actions:

[root@node2]# pcs status | grep "Failed Resource Actions" -A1
Failed Resource Actions:
  * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec  6 15:37:24 2023

ASCS and ERS move to the other node:

[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2
    * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
  * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec  6 15:37:24 2023

Recovery Procedure

Clear the failed action:

[root@node2]# pcs resource cleanup S4H_ascs20
…
Waiting for 1 reply from the controller
... got reply (done)

5.4. Testing failure of the ERS instance

To verify that the pacemaker cluster takes necessary action when the enqueue replication server (ERS) of the ASCS instance fails.

Test Preconditions
- Both cluster nodes are up with the resource groups for the ASCS and ERS running:
```
[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node2
    * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node1
```
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
- Identify the PID of the enqueue replication server process on the node where the ERS instance is running.
- Send a SIGKILL signal to the identified process.
Monitoring
- Run the following command in a separate terminal during the test:
```
[root@node2]# watch -n 1 pcs status
```
Expected behavior
- Enqueue Replication server process gets killed.
- Pacemaker cluster takes the required action as per configuration, in this case, restarting the ERS instance on the same node.

Test

Switch to the <sid>adm user:
```
[root@node1]# su - s4hadm
```

Identify the PID of enqr.sap:

node1:s4hadm 56> pgrep -af enqr.sap
532273 enqr.sapS4H_ERS29 pf=/usr/sap/S4H/SYS/profile/S4H_ERS29_s4ers

Kill the identified process:
```
node1:s4hadm 58> kill -9 532273
```

Notice the cluster “Failed Resource Actions”:

[root@node1]# pcs status | grep "Failed Resource Actions" -A1
Failed Resource Actions:
  * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec  7 13:15:02 2023

ERS restarts on the same node without disturbing the ASCS already running on the other node:

[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node2
    * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node1
  * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec  7 13:15:02 2023

Recovery Procedure

Clear the failed action:

[root@node1]# pcs resource cleanup S4H_ers29
…
Waiting for 1 reply from the controller
... got reply (done)

5.5. Failover of ASCS instance due to node crash

To verify that the ASCS instance moves correctly in case of a node crash.

Test Preconditions
- Both cluster nodes are up with the resource groups for the ASCS and ERS running:
```
[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node2
    * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node1
```
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
- Crash the node where ASCS is running.
Monitoring
- Run the following command in a separate terminal on the other node during the test:
```
[root@node1]# watch -n 1 pcs status
```
Expected behavior
- Node where ASCS is running gets crashed and shuts down or restarts as per configuration.
- Meanwhile ASCS moves to the other node.
- ERS starts on the previously crashed node, after it comes back online.

Test

Run the following command as the root user on the node where ASCS is running:
```
[root@node2]# echo c > /proc/sysrq-trigger
```

ASCS moves to the other node:

[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
    * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node1

ERS stops and moves to the previously crashed node once it comes back online:

[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
    * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Stopped


[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
    * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node2

Recovery Procedure
- Clean up failed actions, if any:
```
[root@node1]# pcs resource cleanup
```

5.6. Failure of ERS instance due to node crash

To verify that the ERS instance restarts on the same node.

Test Preconditions
- Both cluster nodes are up with the resource groups for the ASCS and ERS running:
```
[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
    * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node2
```
- All failures for the resources and resource groups have been cleared and the failcounts have been reset.
Test Procedure
- Crash the node where ERS is running.
Monitoring
- Run the following command in a separate terminal on the other node during the test:
```
[root@nod1]# watch -n 1 pcs status
```
Expected behavior
- Node where ERS is running gets crashed and shuts down or restarts as per configuration.
- Meanwhile ASCS continues to run to the other node. ERS restarts on the crashed node, after it comes back online.

Test

Run the following command as the root user on the node where ERS is running:
```
[root@node2]# echo c > /proc/sysrq-trigger
```

ERS restarts on the crashed node, after it comes back online, without disturbing the ASCS instance throughout the test:

[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
    * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node2

Recovery Procedure
- Clean up failed actions if any:
```
[root@node2]# pcs resource cleanup
```

5.7. Failure of ASCS Instance due to node crash (ENSA2)

In case of 3 node ENSA 2 cluster environment, the third node is considered during failover events of any instance.

Test Preconditions

A 3 node SAP S/4HANA cluster with the resource groups for the ASCS and ERS running.
The 3rd node has access to all the file systems and can provision the required instance specific IP addresses the same way as the first 2 nodes.

In the example setup, the underlying shared NFS filesystems are as follows:

Node List:
  * Online: [ node1 node2 node3 ]

Active Resources:
  * s4r9g2_fence        (stonith:fence_rhevm):   Started node1
  * Clone Set: s4h_fs_sapmnt-clone [fs_sapmnt]:
    * Started: [ node1 node2 node3 ]
  * Clone Set: s4h_fs_sap_trans-clone [fs_sap_trans]:
    * Started: [ node1 node2 node3 ]
  * Clone Set: s4h_fs_sap_SYS-clone [fs_sap_SYS]:
    * Started: [ node1 node2 node3 ]
  * Resource Group: S4H_ASCS20_group:
    * S4H_lvm_ascs20    (ocf:heartbeat:LVM-activate):    Started node1
    * S4H_fs_ascs20     (ocf:heartbeat:Filesystem):	 Started node1
    * S4H_vip_ascs20    (ocf:heartbeat:IPaddr2):         Started node1
    * S4H_ascs20        (ocf:heartbeat:SAPInstance):     Started node1
  * Resource Group: S4H_ERS29_group:
    * S4H_lvm_ers29     (ocf:heartbeat:LVM-activate):    Started node2
    * S4H_fs_ers29	(ocf:heartbeat:Filesystem):	 Started node2
    * S4H_vip_ers29     (ocf:heartbeat:IPaddr2):         Started node2
    * S4H_ers29 (ocf:heartbeat:SAPInstance):     Started node2

All failures for the resources and resource groups have been cleared and the failcounts have been reset.

Test Procedure
- Crash the node where ASCS is running.
Monitoring
- Run the following command in a separate terminal on one of the nodes where the ASCS group is currently not running during the test:
```
[root@node2]# watch -n 1 pcs status
```
Expected behavior
- ASCS moves to the 3rd node.
- ERS continues to run on the same node where it is already running.

Test

Crash the node where the ASCS group is currently running:
```
[root@node1]# echo c > /proc/sysrq-trigger
```

ASCS moves to the 3rd node without disturbing the already running ERS instance on 2nd node:

[root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
    * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node3
    * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node2

Recovery Procedure
- Clean up failed actions if any:
```
[root@node2]# pcs resource cleanup
```

Chapter 5. Testing the cluster configuration

5.1. Manually moving `ASCS` instance using `pcs` command

5.2. Manually moving of the `ASCS` instance using `sapcontrol` (with SAP HA interface enabled)

5.3. Testing failure of the `ASCS` instance

5.4. Testing failure of the ERS instance

5.5. Failover of ASCS instance due to node crash

5.6. Failure of ERS instance due to node crash

5.7. Failure of ASCS Instance due to node crash (ENSA2)

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 5. Testing the cluster configuration

5.1. Manually moving ASCS instance using pcs command

5.2. Manually moving of the ASCS instance using sapcontrol (with SAP HA interface enabled)

5.3. Testing failure of the ASCS instance

5.4. Testing failure of the ERS instance

5.5. Failover of ASCS instance due to node crash

5.6. Failure of ERS instance due to node crash

5.7. Failure of ASCS Instance due to node crash (ENSA2)

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links

5.1. Manually moving `ASCS` instance using `pcs` command

5.2. Manually moving of the `ASCS` instance using `sapcontrol` (with SAP HA interface enabled)

5.3. Testing failure of the `ASCS` instance