Rechercher

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 5. Testing the cluster configuration

download PDF

Before the HA cluster setup is put in production, it is recommended to perform the following tests to ensure that the HA cluster setup works as expected.

These tests should also be repeated later on as part of regular HA/DR drills to ensure that the cluster still works as expected and that admins stay familiar with the procedures required to bring the setup back to a healthy state in case an issue occurs during normal operation, or if manual maintenance of the setup is required.

5.1. Manually moving ASCS instance using pcs command

To verify that the pacemaker cluster is able to move the instances to the other HA cluster node on demand.

  • Test Preconditions

    • Both cluster nodes are up, with the resource groups for the ASCS and ERS running on different HA cluster nodes:

        * Resource Group: S4H_ASCS20_group:
          * S4H_lvm_ascs20    (ocf:heartbeat:LVM-activate):    Started node1
          * S4H_fs_ascs20     (ocf:heartbeat:Filesystem): Started node1
          * S4H_vip_ascs20    (ocf:heartbeat:IPaddr2):         Started node1
          * S4H_ascs20        (ocf:heartbeat:SAPInstance):     Started node1
        * Resource Group: S4H_ERS29_group:
          * S4H_lvm_ers29     (ocf:heartbeat:LVM-activate):    Started node2
          * S4H_fs_ers29 (ocf:heartbeat:Filesystem): Started node2
          * S4H_vip_ers29     (ocf:heartbeat:IPaddr2):         Started node2
          * S4H_ers29 (ocf:heartbeat:SAPInstance):     Started node2
    • All failures for the resources and resource groups have been cleared and the failcounts have been reset.
  • Test Procedure

    • Run the following command from any node to initiate the move of the ASCS instance to the other HA cluster node:

      [root@node1]# pcs resource move S4H_ascs20
  • Monitoring

    • Run the following command in a separate terminal during the test:

      [root@node2]# watch -n 1 pcs status
  • Expected behavior

    • The ASCS resource group is moved to the other node.
    • The ERS resource group stops after that and moves to the node where the ASCS resource group was running before.
  • Test Result

    • ASCS resource group moves to other node, in this scenario node node2 and ERS resource group moves to node node1:

        * Resource Group: S4H_ASCS20_group:
          * S4H_lvm_ascs20 (ocf:heartbeat:LVM-activate): Started node2
          * S4H_fs_ascs20 (ocf:heartbeat:Filesystem): Started node2
          * S4H_vip_ascs20 (ocf:heartbeat:IPaddr2): Started node2
          * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2
        * Resource Group: S4H_ERS29_group:
          * S4H_lvm_ers29 (ocf:heartbeat:LVM-activate): Started node1
          * S4H_fs_ers29 (ocf:heartbeat:Filesystem): Started node1
          * S4H_vip_ers29 (ocf:heartbeat:IPaddr2): Started node1
          * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
  • Recovery Procedure:

    • Remove the location constraints, if any:

      [root@node1]# pcs resource clear S4H_ascs20

5.2. Manually moving of the ASCS instance using sapcontrol (with SAP HA interface enabled)

To verify that the sapcontrol command is able to move the instances to the other HA cluster node when the SAP HA interface is enabled for the instance.

  • Test Preconditions

    • The SAP HA interface is enabled for the SAP instance.
    • Both cluster nodes are up with the resource groups for the ASCS and ERS running.

      [root@node2: ~]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
           * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2
           * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
    • All failures for the resources and resource groups have been cleared and the failcounts have been reset.
  • Test Procedure

    • As the <sid>adm user, run the HAFailoverToNode function of sapcontrol to move the ASCS instance to the other node.
  • Monitoring

    • Run the following command in a separate terminal during the test:

      [root@node2]# watch -n 1 pcs status
  • Expected behavior

    • ASCS instances should move to the other HA cluster node, creating a temporary location constraint for the move to complete.
  • Test

    [root@node2]# su - s4hadm
    node2:s4hadm 52> sapcontrol -nr 20 -function HAFailoverToNode ""
    
    06.12.2023 12:57:04
    HAFailoverToNode
    OK
  • Test result

    • ASCS and ERS both move to the other node:

      [root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1
          * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
    • Constraints are created as shown below:

      [root@node1]# pcs constraint
      Location Constraints:
        Resource: S4H_ASCS20_group
          Constraint: cli-ban-S4H_ASCS20_group-on-node2
            Rule: boolean-op=and score=-INFINITY
              Expression: #uname eq string node1
              Expression: date lt xxxx-xx-xx xx:xx:xx +xx:xx
  • Recovery Procedure

    • The constraint shown above is cleared automatically when the date lt mentioned in the Expression is reached.
    • Alternatively, the constraint can be removed with the following command:

      [root@node1]# pcs resource clear S4H_ascs20

5.3. Testing failure of the ASCS instance

To verify that the pacemaker cluster takes necessary action when the enqueue server of the ASCS instance or the whole ASCS instance fails.

  • Test Preconditions

    • Both cluster nodes are up with the resource groups for the ASCS and ERS running:

      [root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node1
          * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node2
    • All failures for the resources and resource groups have been cleared and the failcounts have been reset.
  • Test Procedure

    • Identify the PID of the enqueue server on the node where ASCS is running.
    • Send a SIGKILL signal to the identified process.
  • Monitoring

    • Run the following command in a separate terminal during the test:

      [root@node2]# watch -n 1 pcs status
  • Expected behavior

    • Enqueue server process gets killed.
    • The pacemaker cluster takes the required action as per configuration, in this case moving the ASCS to the other node.
  • Test

    • Switch to the <sid>adm user on the node where ASCS is running:

      [root@node1]# su - s4hadm
    • Identify the PID of en.sap(NetWeaver) enq.sap(S/4HANA):

      node1:s4hadm 51> pgrep -af "(en|enq).sap"
      31464 enq.sapS4H_ASCS20 pf=/usr/sap/S4H/SYS/profile/S4H_ASCS20_s4ascs
    • Kill the identified process:

      node1:s4hadm 52> kill -9 31464
    • Notice the cluster Failed Resource Actions:

      [root@node2]# pcs status | grep "Failed Resource Actions" -A1
      Failed Resource Actions:
        * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec  6 15:37:24 2023
    • ASCS and ERS move to the other node:

      [root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2
          * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1
        * S4H_ascs20 2m-interval monitor on node1 returned 'not running' at Wed Dec  6 15:37:24 2023
  • Recovery Procedure

    • Clear the failed action:

      [root@node2]# pcs resource cleanup S4H_ascs20
      …
      Waiting for 1 reply from the controller
      ... got reply (done)

5.4. Testing failure of the ERS instance

To verify that the pacemaker cluster takes necessary action when the enqueue replication server (ERS) of the ASCS instance fails.

  • Test Preconditions

    • Both cluster nodes are up with the resource groups for the ASCS and ERS running:

      [root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node2
          * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node1
    • All failures for the resources and resource groups have been cleared and the failcounts have been reset.
  • Test Procedure

    • Identify the PID of the enqueue replication server process on the node where the ERS instance is running.
    • Send a SIGKILL signal to the identified process.
  • Monitoring

    • Run the following command in a separate terminal during the test:

      [root@node2]# watch -n 1 pcs status
  • Expected behavior

    • Enqueue Replication server process gets killed.
    • Pacemaker cluster takes the required action as per configuration, in this case, restarting the ERS instance on the same node.
  • Test

    • Switch to the <sid>adm user:

      [root@node1]# su - s4hadm
    • Identify the PID of enqr.sap:

      node1:s4hadm 56> pgrep -af enqr.sap
      532273 enqr.sapS4H_ERS29 pf=/usr/sap/S4H/SYS/profile/S4H_ERS29_s4ers
    • Kill the identified process:

      node1:s4hadm 58> kill -9 532273
    • Notice the cluster “Failed Resource Actions”:

      [root@node1]# pcs status | grep "Failed Resource Actions" -A1
      Failed Resource Actions:
        * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec  7 13:15:02 2023
    • ERS restarts on the same node without disturbing the ASCS already running on the other node:

      [root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node2
          * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node1
        * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec  7 13:15:02 2023
  • Recovery Procedure

    • Clear the failed action:

      [root@node1]# pcs resource cleanup S4H_ers29
      …
      Waiting for 1 reply from the controller
      ... got reply (done)

5.5. Failover of ASCS instance due to node crash

To verify that the ASCS instance moves correctly in case of a node crash.

  • Test Preconditions

    • Both cluster nodes are up with the resource groups for the ASCS and ERS running:

      [root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node2
          * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node1
    • All failures for the resources and resource groups have been cleared and the failcounts have been reset.
  • Test Procedure

    • Crash the node where ASCS is running.
  • Monitoring

    • Run the following command in a separate terminal on the other node during the test:

      [root@node1]# watch -n 1 pcs status
  • Expected behavior

    • Node where ASCS is running gets crashed and shuts down or restarts as per configuration.
    • Meanwhile ASCS moves to the other node.
    • ERS starts on the previously crashed node, after it comes back online.
  • Test

    • Run the following command as the root user on the node where ASCS is running:

      [root@node2]# echo c > /proc/sysrq-trigger
    • ASCS moves to the other node:

      [root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
          * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node1
    • ERS stops and moves to the previously crashed node once it comes back online:

      [root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
          * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Stopped
      
      
      [root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
          * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node2
  • Recovery Procedure

    • Clean up failed actions, if any:

      [root@node1]# pcs resource cleanup

5.6. Failure of ERS instance due to node crash

To verify that the ERS instance restarts on the same node.

  • Test Preconditions

    • Both cluster nodes are up with the resource groups for the ASCS and ERS running:

      [root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
          * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node2
    • All failures for the resources and resource groups have been cleared and the failcounts have been reset.
  • Test Procedure

    • Crash the node where ERS is running.
  • Monitoring

    • Run the following command in a separate terminal on the other node during the test:

      [root@nod1]# watch -n 1 pcs status
  • Expected behavior

    • Node where ERS is running gets crashed and shuts down or restarts as per configuration.
    • Meanwhile ASCS continues to run to the other node. ERS restarts on the crashed node, after it comes back online.
  • Test

    • Run the following command as the root user on the node where ERS is running:

      [root@node2]# echo c > /proc/sysrq-trigger
    • ERS restarts on the crashed node, after it comes back online, without disturbing the ASCS instance throughout the test:

      [root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node1
          * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node2
  • Recovery Procedure

    • Clean up failed actions if any:

      [root@node2]# pcs resource cleanup

5.7. Failure of ASCS Instance due to node crash (ENSA2)

In case of 3 node ENSA 2 cluster environment, the third node is considered during failover events of any instance.

  • Test Preconditions

    • A 3 node SAP S/4HANA cluster with the resource groups for the ASCS and ERS running.
    • The 3rd node has access to all the file systems and can provision the required instance specific IP addresses the same way as the first 2 nodes.
    • In the example setup, the underlying shared NFS filesystems are as follows:

      Node List:
        * Online: [ node1 node2 node3 ]
      
      Active Resources:
        * s4r9g2_fence        (stonith:fence_rhevm):   Started node1
        * Clone Set: s4h_fs_sapmnt-clone [fs_sapmnt]:
          * Started: [ node1 node2 node3 ]
        * Clone Set: s4h_fs_sap_trans-clone [fs_sap_trans]:
          * Started: [ node1 node2 node3 ]
        * Clone Set: s4h_fs_sap_SYS-clone [fs_sap_SYS]:
          * Started: [ node1 node2 node3 ]
        * Resource Group: S4H_ASCS20_group:
          * S4H_lvm_ascs20    (ocf:heartbeat:LVM-activate):    Started node1
          * S4H_fs_ascs20     (ocf:heartbeat:Filesystem):	 Started node1
          * S4H_vip_ascs20    (ocf:heartbeat:IPaddr2):         Started node1
          * S4H_ascs20        (ocf:heartbeat:SAPInstance):     Started node1
        * Resource Group: S4H_ERS29_group:
          * S4H_lvm_ers29     (ocf:heartbeat:LVM-activate):    Started node2
          * S4H_fs_ers29	(ocf:heartbeat:Filesystem):	 Started node2
          * S4H_vip_ers29     (ocf:heartbeat:IPaddr2):         Started node2
          * S4H_ers29 (ocf:heartbeat:SAPInstance):     Started node2
    • All failures for the resources and resource groups have been cleared and the failcounts have been reset.
  • Test Procedure

    • Crash the node where ASCS is running.
  • Monitoring

    • Run the following command in a separate terminal on one of the nodes where the ASCS group is currently not running during the test:

      [root@node2]# watch -n 1 pcs status
  • Expected behavior

    • ASCS moves to the 3rd node.
    • ERS continues to run on the same node where it is already running.
  • Test

    • Crash the node where the ASCS group is currently running:

      [root@node1]# echo c > /proc/sysrq-trigger
    • ASCS moves to the 3rd node without disturbing the already running ERS instance on 2nd node:

      [root@node2]# pcs status | egrep -e "S4H_ascs20|S4H_ers29"
          * S4H_ascs20	(ocf:heartbeat:SAPInstance):	 Started node3
          * S4H_ers29	(ocf:heartbeat:SAPInstance):	 Started node2
  • Recovery Procedure

    • Clean up failed actions if any:

      [root@node2]# pcs resource cleanup
Red Hat logoGithubRedditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez leBlog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

© 2024 Red Hat, Inc.