此内容没有您所选择的语言版本。

Chapter 7. Testing the setup


Test your new HANA HA cluster thoroughly before you enable it for production workloads.

Enhance the basic example test cases with your specific requirements.

7.1. Detecting the system replication state changes

You must monitor the sync state information in logs and cluster attributes when disrupting the system replication, to test the correct functionality of the HanaSR HA/DR provider.

In this test, you use the primary site for monitoring the system replication status and for verifying the log messages. On a secondary instance you freeze the indexserver process to simulate a system replication issue while the primary remains fully intact.

Prerequisites

  • You have configured the mandatory HanaSR HA/DR provider.
  • Your HANA instances are in a healthy state on all cluster nodes and the system replication is in sync.

Procedure

  1. As user <sid>adm go to the HANA Python directory on the primary site and check the current system replication state. Verify that it is ACTIVE and fully synced:

    rh1adm $ cdpy; python systemReplicationStatus.py
    …
    status system replication site "2": ACTIVE
    overall system replication status: ACTIVE
    …
  2. Verify that the srHook and srPoll cluster attributes are both SOK in the attributes summary of the secondary site. Run this command as the root user on any node in a separate terminal to keep track of the attribute changes:

    [root]# watch SAPHanaSR-showAttr
    ...
    Site lpt        lss mns      opMode    srHook srMode srPoll srr
    ----------------------------------------------------------------
    DC2  30         4   dc2hana1 logreplay SOK    sync   SOK    S
    DC1  1757076772 4   dc1hana1 logreplay PRIM   sync   PRIM   P
    ...

    You can use the watch command to run the command in a loop at a default interval of 2 seconds.

  3. On an instance on the secondary site, for example, dc2hana2, get the process ID (PID) of the hdbindexserver process. For example, you can get it from the PID column of the HDB info output as user <sid>adm:

    rh1adm $ HDB info
  4. On the same instance on the secondary site, use the PID to simulate a hanging hdbindexserver process by sending the STOP signal to the process. This freezes the process and blocks it from communicating and syncing the instance between the nodes:

    rh1adm $ kill -STOP <PID>

Verification

  1. On the primary site, watch the system replication status for the change on any primary instance. In the following example the system’s cut utility helps you limit the output to certain fields for readability. Remove it to see all columns of the table formatted text output. In the example we froze the indexserver on the secondary node dc2hana2, which results in a replication error with that node’s counterpart on the primary site, dc1hana2:

    rh1adm $ cdpy; watch "python systemReplicationStatus.py | cut -d '|' -f 1-3,5,9,13-"
    ...
    |Database |Host     |Service Name |Secondary |Secondary     |Replication |Replication |Replication                   |Secondary    |
    |         |         |             |Host      |Active Status |Mode        |Status      |Status Details                |Fully Synced |
    |-------- |-------- |------------ |--------- |------------- |----------- |----------- |----------------------------- |------------ |
    |RH1      |dc1hana2 |indexserver  |dc2hana2  |YES           |SYNC        |ERROR       |Log shipping timeout occurred |       False |
    |SYSTEMDB |dc1hana1 |nameserver   |dc2hana1  |YES           |SYNC        |ACTIVE      |                              |        True |
    |RH1      |dc1hana1 |xsengine     |dc2hana1  |YES           |SYNC        |ACTIVE      |                              |        True |
    |RH1      |dc1hana1 |indexserver  |dc2hana1  |YES           |SYNC        |ACTIVE      |                              |        True |
    
    status system replication site "2": ERROR
    overall system replication status: ERROR
    ...

    The replication status changes to ERROR for the indexserver service after a bit. It can take a while to react on an idle instance, wait a minute or more.

  2. On the primary site’s master name server node, check the HANA nameserver process log for the related messages as the <sid>adm user:

    rh1adm $ cdtrace; grep -he 'HanaSR.srConnectionChanged.*' nameserver_*
    ha_dr_HanaSR     HanaSR.py(00056) : HanaSR 1.001.1 HanaSR.srConnectionChanged method called with Dict={'hostname': 'dc1hana2', 'port': '30003', 'volume': 4, 'service_name': 'indexserver', 'database': 'RH1', 'status': 11, 'database_status': 11, 'system_status': 11, 'timestamp': '2025-09-12T11:15:08.003728+00:00', 'is_in_sync': False, 'system_is_in_sync': False, 'reason': '', 'siteName': 'DC2'}
    ha_dr_HanaSR     HanaSR.py(00065) : HanaSR HanaSR.srConnectionChanged system_status=11 SID=RH1 in_sync=False reason=
    ha_dr_HanaSR     HanaSR.py(00091) : HanaSR.srConnectionChanged() CALLING CRM: <sudo /usr/sbin/crm_attribute -n hana_rh1_site_srHook_DC2  -v SFAIL -t crm_config -s SAPHanaSR> ret_code=

    The nameserver process log contains the event that HANA triggers with details. It also includes the sudo command that the HanaSR hook script runs to update the srHook cluster attribute.

  3. Verify that both cluster attributes for the system replication status, srHook and srPoll, show the SFAIL status of the secondary site. Run the following as the root user on any HANA node or use the open terminal from the previous steps to watch the changes:

    [root]# SAPHanaSR-showAttr
    ...
    Site lpt        lss mns      opMode    srHook srMode srPoll srr
    ----------------------------------------------------------------
    DC2  10         4   dc2hana1 logreplay SFAIL  sync   SFAIL  S
    DC1  1757079061 4   dc1hana1 logreplay PRIM   sync   PRIM   P
    ...
  4. Unblock the previously frozen hdbindexserver PID to enable it again. Run this on the secondary instance on which you blocked the hdbindexserver process for the test:

    rh1adm $ kill -CONT <PID>
  5. Repeat the previous checks to verify that the system replication recovers fully after a bit. The cluster does not trigger any actions during this test since the resources remain running. Ensure that the system replication status is healthy again, fully synced and the cluster attributes are set to SOK again for the secondary site.

7.2. Triggering the indexserver crash recovery

Test the functionality of the ChkSrv HA/DR provider by simulating the crash of an hdbindexserver process. You can run this on the primary or on the secondary site. The exact recovery actions depend on the overall configuration. The following steps demonstrate the activity when using action_on_lost = stop in the hook configuration..

Prerequisites

  • You have configured the ChkSrv HA/DR provider. Skip this test if you have not configured this optional hook.
  • Your HANA instances have a healthy HANA system replication.
  • You have no failures in the cluster status.

Procedure

  1. Use a separate terminal to monitor the HANA processes as user <sid>adm on the instance on which you run this test:

    rh1adm $ watch "sapcontrol -nr ${TINSTANCE} -function GetProcessList | column -s ',' -t"
  2. In another terminal on the same HANA instance, kill the hdbindexserver process:

    rh1adm $ kill <PID>

Verification

  1. Check the dedicated HANA nameserver trace log on the same instance and identify the event and related action, as user <sid>adm:

    rh1adm $ cdtrace; less nameserver_chksrv.trc
    ...
    ChkSrv version 1.001.1. Method srServiceStateChanged method called.
    ChkSrv srServiceStateChanged method called with Dict={'hostname': 'dc2hana2',
     'service_name': 'indexserver', 'service_port': '30203', 'service_status': 'stopping',
     'service_previous_status': 'yes', 'timestamp': '2025-09-15T15:07:09.353198+00:00',
     'daemon_status': 'yes', 'database_id': '3', 'database_name': 'RH1',
     'database_status': 'yes', 'details': ''}
    ChkSrv srServiceStateChanged method called with SAPSYSTEMNAME=RH1
    srv:indexserver-30203-stopping-yes db:RH1-3-yes daem:yes
    LOST: indexserver event looks like a lost indexserver (status=stopping)
    LOST: stop instance. action_on_lost=stop
    ...
  2. Check the cluster status for resource failure information on any cluster node, as user root:

    [root]# pcs status --full
    ...
    
    Failed Resource Actions:
      * rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc2hana1 'not running' (7): call=26, status='complete', ...
    
    ...
  3. Check the system log for the related cluster actions on the on the test node, for example, dc2hana2, as user root:

    [root]# grep rsc_SAPHanaCon_RH1_HDB02 /var/log/messages
    ...
    Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Result of monitor operation for rsc_SAPHanaCon_RH1_HDB02 on dc2hana1: not running
    Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: rsc_SAPHanaCon_RH1_HDB02_monitor_61000@dc2hana1 output [ 10 ]
    Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Transition 32 action 29 (rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc2hana1): expected 'ok' but got 'not running'
    Sep 15 15:08:17 dc2hana1 pacemaker-attrd[17043]: notice: Setting last-failure-rsc_SAPHanaCon_RH1_HDB02#monitor_61000[dc2hana1] in instance_attributes: (unset) -> 1757948897
    Sep 15 15:08:17 dc2hana1 pacemaker-attrd[17043]: notice: Setting fail-count-rsc_SAPHanaCon_RH1_HDB02#monitor_61000[dc2hana1] in instance_attributes: (unset) -> 1
    Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHanaCon_RH1_HDB02:2 on dc2hana1 at Sep 15 15:08:17 2025
    Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: notice: Actions: Recover    rsc_SAPHanaCon_RH1_HDB02:2     (             Unpromoted dc2hana1 )
    Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHanaCon_RH1_HDB02:2 on dc2hana1 at Sep 15 15:08:17 2025
    Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: notice: Actions: Recover    rsc_SAPHanaCon_RH1_HDB02:2     (             Unpromoted dc2hana1 )
    Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Initiating stop operation rsc_SAPHanaCon_RH1_HDB02_stop_0 locally on dc2hana1
    Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Requesting local execution of stop operation for rsc_SAPHanaCon_RH1_HDB02 on dc2hana1
    ...

    The next SAPHanaController resource monitor reports the unexpectedly stopped HANA instance as a failure and initiates the recovery steps according to the configuration. If PREFER_SITE_TAKEOVER is enabled and you executed the test on a primary instance, it triggers a HANA takeover to the secondary site.

Next steps

7.3. Triggering a HANA takeover using cluster commands

Use the cluster command to move the promoted resource to the other site and manually test the planned takeover of the primary to the secondary site.

Prerequisites

  • Your HANA instances have a healthy HANA system replication.
  • You have no failures in the cluster status.

Procedure

  • Switch the primary site to the secondary site. Run the cluster command as user root on any node:

    [root]# pcs resource move cln_SAPHanaCon_<SID>_HDB<instance>
    Location constraint to move resource 'cln_SAPHanaCon_RH1_HDB02' has been created
    Waiting for the cluster to apply configuration changes...
    Location constraint created to move resource 'cln_SAPHanaCon_RH1_HDB02' has been removed
    Waiting for the cluster to apply configuration changes...
    resource 'cln_SAPHanaCon_RH1_HDB02' is promoted on node 'dc2hana1'

Verification

  • Verify that the SAPHanaController resource is now promoted on the other site:

    [root]# pcs resource status cln_SAPHanaCon_RH1_HDB02
      * Clone Set: cln_SAPHanaCon_RH1_HDB02 [rsc_SAPHanaCon_RH1_HDB02] (promotable):
        * Promoted: [ dc2hana1 ]
        * Unpromoted: [ dc1hana2 dc2hana2 ]
        * Stopped: [ dc1hana1 ]

    The status of the previous primary instance depends on the AUTOMATED_REGISTER parameter of the SAPHanaController resource. The instance stops until manual intervention when AUTOMATED_REGISTER is false, otherwise the instance restarts automatically and reregisters as the new secondary instance.

Next steps

7.4. Triggering the SAPHanaFilesystem failure action

Block the write access to the monitored directory to test the correct behavior of the SAPHanaFilesystem resource. You can test this on any instance. Only a primary instance triggers a failure and recovery action. On a secondary node the resource does not trigger an action.

Prerequisites

  • You have configured the SAPHanaFilesystem resource. Skip this test if you have not configured this optional resource.

Procedure

  1. Create a temporary file on a local filesystem of the node you want to test:

    [root]# touch /tmp/test
  2. Set the local file to be immutable, which prevents write access:

    [root]# chattr +i /tmp/test
  3. Go to the hidden directory which the SAPHanaFilesystem resource uses to test read and write filesystem access and enter the node you want to test:

    [root]# cd /hana/shared/<SID>/.heartbeat_SAPHanaFilesystem/<node>
  4. Change the test file, which the SAPHanaFilesystem resource creates, to become a symbolic link that points to the temporary local file. Since the temporary target test file cannot be modified, the resource fails after the next monitor cycle:

    [root]# ln -sf /tmp/test test

    NFS filesystems do not support extended attributes by default. The symbolic link bridges this gap for the test.

  5. Verify the behavior during the simulated failure.

    1. You can check the /var/log/messages file for the related log message if the resource action is set to ignore:

      [root]# grep -e 'SAPHanaFil.*ON_FAIL_ACTION' /var/log/messages
      ... SAPHanaFilesystem(rsc_SAPHanaFil_RH1_HDB02)[715184]: INFO: -2- RA monitor() ON_FAIL_ACTION=ignore => ignore FS error, do not create poison pill file
    2. If the resource action is set to fence you can observe the fencing action:

      [root]# pcs status --full
      ...
      
      Failed Resource Actions:
        * rsc_SAPHanaFil_RH1_HDB02_stop_0 on dc1hana1 'error' (1): ...
      
      Pending Fencing Actions:
        * reboot of dc1hana1 pending: client=pacemaker-controld.1694, origin=dc1hana2
  6. You must remove the blocker again after the test. If the node is fenced, delete the symbolic link after the node is running again. The resource creates the regular test file again during the next check:

    [root]# rm -f /hana/shared/<SID>/.heartbeat_SAPHanaFilesystem/<node>/test
  7. Clean up the temporary local test file:

    [root]# chattr -i /tmp/test; rm -f /tmp/test

Next steps

7.5. Crashing the node with a primary instance

Simulate the crash of the cluster node on which a primary instance is running to test the behavior of your HANA cluster resources.

Prerequisites

  • Your HANA instances have a healthy HANA system replication.
  • You have no failures in the cluster status.

Procedure

  • Trigger a crash on a HANA node on the primary site. This command immediately causes a crash of the node with no further warning:

    [root]# echo c > /proc/sysrq-trigger

Verification

  • The cluster detects the failed node and fences it. You can watch the cluster activity on any of the remaining nodes:

    [root]# pcs status --full
    ...
    Pending Fencing Actions:
      * reboot of dc1hana1 pending: client=pacemaker-controld.1685, origin=dc1hana2
    ...
  • The secondary site takes over and becomes promoted as the new primary.
  • The fenced former primary node recovers according to your fencing and SAPHanaController resource configuration.

Next steps

7.6. Crashing the node with a secondary instance

Simulate the crash of the cluster node on which a secondary instance is running to test the behavior of your HANA cluster resources.

Procedure

  • Trigger a crash of a HANA node on the secondary site. This command immediately causes a crash of the node with no further warning:

    [root]# echo c > /proc/sysrq-trigger

Verification

  • The cluster detects the failed node and fences it. You can watch the cluster activity on any of the remaining nodes:

    [root]# pcs status --full
    ...
    Pending Fencing Actions:
      * reboot of dc2hana1 pending: client=pacemaker-controld.1694, origin=dc1hana1
    ...
  • The primary site remains running while the secondary node restarts and recovers. The fenced node recovery depends on your fencing configuration.

Next steps

7.7. Stopping the primary site using SAP commands

Test the behavior of the cluster when you manage the primary HANA site outside of the cluster using HANA commands.

Since the cluster is not aware of the execution of HANA commands, it detects the change as a failure and triggers the configured recovery actions.

Prerequisites

  • Your HANA instances have a healthy HANA system replication.
  • You have no failures in the cluster status.

Procedure

  • Stop the primary HANA site as the <sid>adm user outside of the cluster. Run on one HANA instance on the primary site:

    rh1adm $ sapcontrol -nr ${TINSTANCE} -function StopSystem HDB

Verification

  • The cluster notices the stopped instance as a failure and initiates the recovery of the primary site:

    [root]# pcs status --full
    ...
    Migration Summary:
      * Node: dc1hana1 (1):
        * rsc_SAPHanaCon_RH1_HDB02: migration-threshold=5000 fail-count=1 last-failure=...
    
    Failed Resource Actions:
      * rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc1hana1 'not running' ...

    If you configured and enabled both the PREFER_SITE_TAKEOVER and AUTOMATED_REGISTER parameters in the SAPHanaController resource, the cluster triggers a HANA takeover to the secondary site and automatically registers the failed primary as the new secondary. Otherwise it recovers the failed primary according to your configuration.

Next steps

7.8. Stopping the secondary site using SAP commands

Test the behavior of the cluster when you manage the secondary HANA site outside of the cluster using HANA commands.

Since the cluster is not aware of the execution of HANA commands, it detects the change as a failure and triggers the configured recovery actions.

Prerequisites

  • You have no failures in the cluster status.

Procedure

  • Stop the secondary HANA site as the <sid>adm user outside of the cluster. Run on one HANA instance on the secondary site:

    rh1adm $ sapcontrol -nr ${TINSTANCE} -function StopSystem HDB

Verification

  • The cluster notices the stopped instance as a failure and recovers the secondary site:

    [root]# pcs status --full
    ...
    Migration Summary:
      * Node: dc2hana1 (2):
        * rsc_SAPHanaCon_RH1_HDB02: migration-threshold=5000 fail-count=1 last-failure=...
    
    Failed Resource Actions:
      * rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc2hana1 'not running' ...

Next steps

Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2026 Red Hat
返回顶部