此内容没有您所选择的语言版本。

Chapter 7. Testing the setup

Test your new HANA HA cluster thoroughly before you enable it for production workloads.

Enhance the basic example test cases with your specific requirements.

7.1. Detecting the system replication state changes
复制链接

You must monitor the sync state information in logs and cluster attributes when disrupting the system replication, to test the correct functionality of the HanaSR HA/DR provider.

In this test, you use the primary site for monitoring the system replication status and for verifying the log messages. On a secondary instance you freeze the indexserver process to simulate a system replication issue while the primary remains fully intact.

Prerequisites

You have configured the mandatory HanaSR HA/DR provider.
Your HANA instances are in a healthy state on all cluster nodes and the system replication is in sync.

Procedure

As user <sid>adm go to the HANA Python directory on the primary site and check the current system replication state. Verify that it is ACTIVE and fully synced:
```
rh1adm $ cdpy; python systemReplicationStatus.py
…
status system replication site "2": ACTIVE
overall system replication status: ACTIVE
…
```
Verify that the srHook and srPoll cluster attributes are both SOK in the attributes summary of the secondary site. Run this command as the root user on any node in a separate terminal to keep track of the attribute changes:
```
[root]# watch SAPHanaSR-showAttr
...
Site lpt        lss mns      opMode    srHook srMode srPoll srr
----------------------------------------------------------------
DC2  30         4   dc2hana1 logreplay SOK    sync   SOK    S
DC1  1757076772 4   dc1hana1 logreplay PRIM   sync   PRIM   P
...
```
You can use the watch command to run the command in a loop at a default interval of 2 seconds.
On an instance on the secondary site, for example, dc2hana2, get the process ID (PID) of the hdbindexserver process. For example, you can get it from the PID column of the HDB info output as user <sid>adm:
```
rh1adm $ HDB info
```
On the same instance on the secondary site, use the PID to simulate a hanging hdbindexserver process by sending the STOP signal to the process. This freezes the process and blocks it from communicating and syncing the instance between the nodes:
```
rh1adm $ kill -STOP <PID>
```

Verification

On the primary site, watch the system replication status for the change on any primary instance. In the following example the system’s cut utility helps you limit the output to certain fields for readability. Remove it to see all columns of the table formatted text output. In the example we froze the indexserver on the secondary node dc2hana2, which results in a replication error with that node’s counterpart on the primary site, dc1hana2:

rh1adm $ cdpy; watch "python systemReplicationStatus.py | cut -d '|' -f 1-3,5,9,13-"
...
|Database |Host     |Service Name |Secondary |Secondary     |Replication |Replication |Replication                   |Secondary    |
|         |         |             |Host      |Active Status |Mode        |Status      |Status Details                |Fully Synced |
|-------- |-------- |------------ |--------- |------------- |----------- |----------- |----------------------------- |------------ |
|RH1      |dc1hana2 |indexserver  |dc2hana2  |YES           |SYNC        |ERROR       |Log shipping timeout occurred |       False |
|SYSTEMDB |dc1hana1 |nameserver   |dc2hana1  |YES           |SYNC        |ACTIVE      |                              |        True |
|RH1      |dc1hana1 |xsengine     |dc2hana1  |YES           |SYNC        |ACTIVE      |                              |        True |
|RH1      |dc1hana1 |indexserver  |dc2hana1  |YES           |SYNC        |ACTIVE      |                              |        True |

status system replication site "2": ERROR
overall system replication status: ERROR
...

The replication status changes to ERROR for the indexserver service after a bit. It can take a while to react on an idle instance, wait a minute or more.

On the primary site’s master name server node, check the HANA nameserver process log for the related messages as the <sid>adm user:

rh1adm $ cdtrace; grep -he 'HanaSR.srConnectionChanged.*' nameserver_*
ha_dr_HanaSR     HanaSR.py(00056) : HanaSR 1.001.1 HanaSR.srConnectionChanged method called with Dict={'hostname': 'dc1hana2', 'port': '30003', 'volume': 4, 'service_name': 'indexserver', 'database': 'RH1', 'status': 11, 'database_status': 11, 'system_status': 11, 'timestamp': '2025-09-12T11:15:08.003728+00:00', 'is_in_sync': False, 'system_is_in_sync': False, 'reason': '', 'siteName': 'DC2'}
ha_dr_HanaSR     HanaSR.py(00065) : HanaSR HanaSR.srConnectionChanged system_status=11 SID=RH1 in_sync=False reason=
ha_dr_HanaSR     HanaSR.py(00091) : HanaSR.srConnectionChanged() CALLING CRM: <sudo /usr/sbin/crm_attribute -n hana_rh1_site_srHook_DC2  -v SFAIL -t crm_config -s SAPHanaSR> ret_code=

The nameserver process log contains the event that HANA triggers with details. It also includes the sudo command that the HanaSR hook script runs to update the srHook cluster attribute.

Verify that both cluster attributes for the system replication status, srHook and srPoll, show the SFAIL status of the secondary site. Run the following as the root user on any HANA node or use the open terminal from the previous steps to watch the changes:

[root]# SAPHanaSR-showAttr
...
Site lpt        lss mns      opMode    srHook srMode srPoll srr
----------------------------------------------------------------
DC2  10         4   dc2hana1 logreplay SFAIL  sync   SFAIL  S
DC1  1757079061 4   dc1hana1 logreplay PRIM   sync   PRIM   P
...

Unblock the previously frozen hdbindexserver PID to enable it again. Run this on the secondary instance on which you blocked the hdbindexserver process for the test:
```
rh1adm $ kill -CONT <PID>
```
Repeat the previous checks to verify that the system replication recovers fully after a bit. The cluster does not trigger any actions during this test since the resources remain running. Ensure that the system replication status is healthy again, fully synced and the cluster attributes are set to SOK again for the secondary site.

7.2. Triggering the indexserver crash recovery
复制链接

Test the functionality of the ChkSrv HA/DR provider by simulating the crash of an hdbindexserver process. You can run this on the primary or on the secondary site. The exact recovery actions depend on the overall configuration. The following steps demonstrate the activity when using action_on_lost = stop in the hook configuration..

Prerequisites

You have configured the ChkSrv HA/DR provider. Skip this test if you have not configured this optional hook.
Your HANA instances have a healthy HANA system replication.
You have no failures in the cluster status.

Procedure

Use a separate terminal to monitor the HANA processes as user <sid>adm on the instance on which you run this test:
```
rh1adm $ watch "sapcontrol -nr ${TINSTANCE} -function GetProcessList | column -s ',' -t"
```
In another terminal on the same HANA instance, kill the hdbindexserver process:
```
rh1adm $ kill <PID>
```

Verification

Check the dedicated HANA nameserver trace log on the same instance and identify the event and related action, as user <sid>adm:

rh1adm $ cdtrace; less nameserver_chksrv.trc
...
ChkSrv version 1.001.1. Method srServiceStateChanged method called.
ChkSrv srServiceStateChanged method called with Dict={'hostname': 'dc2hana2',
 'service_name': 'indexserver', 'service_port': '30203', 'service_status': 'stopping',
 'service_previous_status': 'yes', 'timestamp': '2025-09-15T15:07:09.353198+00:00',
 'daemon_status': 'yes', 'database_id': '3', 'database_name': 'RH1',
 'database_status': 'yes', 'details': ''}
ChkSrv srServiceStateChanged method called with SAPSYSTEMNAME=RH1
srv:indexserver-30203-stopping-yes db:RH1-3-yes daem:yes
LOST: indexserver event looks like a lost indexserver (status=stopping)
LOST: stop instance. action_on_lost=stop
...

Check the cluster status for resource failure information on any cluster node, as user root:

[root]# pcs status --full
...

Failed Resource Actions:
  * rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc2hana1 'not running' (7): call=26, status='complete', ...

...

Check the system log for the related cluster actions on the on the test node, for example, dc2hana2, as user root:

[root]# grep rsc_SAPHanaCon_RH1_HDB02 /var/log/messages
...
Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Result of monitor operation for rsc_SAPHanaCon_RH1_HDB02 on dc2hana1: not running
Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: rsc_SAPHanaCon_RH1_HDB02_monitor_61000@dc2hana1 output [ 10 ]
Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Transition 32 action 29 (rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc2hana1): expected 'ok' but got 'not running'
Sep 15 15:08:17 dc2hana1 pacemaker-attrd[17043]: notice: Setting last-failure-rsc_SAPHanaCon_RH1_HDB02#monitor_61000[dc2hana1] in instance_attributes: (unset) -> 1757948897
Sep 15 15:08:17 dc2hana1 pacemaker-attrd[17043]: notice: Setting fail-count-rsc_SAPHanaCon_RH1_HDB02#monitor_61000[dc2hana1] in instance_attributes: (unset) -> 1
Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHanaCon_RH1_HDB02:2 on dc2hana1 at Sep 15 15:08:17 2025
Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: notice: Actions: Recover    rsc_SAPHanaCon_RH1_HDB02:2     (             Unpromoted dc2hana1 )
Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHanaCon_RH1_HDB02:2 on dc2hana1 at Sep 15 15:08:17 2025
Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: notice: Actions: Recover    rsc_SAPHanaCon_RH1_HDB02:2     (             Unpromoted dc2hana1 )
Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Initiating stop operation rsc_SAPHanaCon_RH1_HDB02_stop_0 locally on dc2hana1
Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Requesting local execution of stop operation for rsc_SAPHanaCon_RH1_HDB02 on dc2hana1
...

The next SAPHanaController resource monitor reports the unexpectedly stopped HANA instance as a failure and initiates the recovery steps according to the configuration. If PREFER_SITE_TAKEOVER is enabled and you executed the test on a primary instance, it triggers a HANA takeover to the secondary site.

Next steps

When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
Clear any failure notifications from the cluster that might be there from previous testing. For more information see Cleaning up the failure history.

7.3. Triggering a HANA takeover using cluster commands
复制链接

Use the cluster command to move the promoted resource to the other site and manually test the planned takeover of the primary to the secondary site.

Prerequisites

Your HANA instances have a healthy HANA system replication.
You have no failures in the cluster status.

Procedure

Switch the primary site to the secondary site. Run the cluster command as user root on any node:

[root]# pcs resource move cln_SAPHanaCon_<SID>_HDB<instance>
Location constraint to move resource 'cln_SAPHanaCon_RH1_HDB02' has been created
Waiting for the cluster to apply configuration changes...
Location constraint created to move resource 'cln_SAPHanaCon_RH1_HDB02' has been removed
Waiting for the cluster to apply configuration changes...
resource 'cln_SAPHanaCon_RH1_HDB02' is promoted on node 'dc2hana1'

Verification

Verify that the SAPHanaController resource is now promoted on the other site:
```
[root]# pcs resource status cln_SAPHanaCon_RH1_HDB02
  * Clone Set: cln_SAPHanaCon_RH1_HDB02 [rsc_SAPHanaCon_RH1_HDB02] (promotable):
    * Promoted: [ dc2hana1 ]
    * Unpromoted: [ dc1hana2 dc2hana2 ]
    * Stopped: [ dc1hana1 ]
```
The status of the previous primary instance depends on the AUTOMATED_REGISTER parameter of the SAPHanaController resource. The instance stops until manual intervention when AUTOMATED_REGISTER is false, otherwise the instance restarts automatically and reregisters as the new secondary instance.

Next steps

When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.

7.4. Triggering the SAPHanaFilesystem failure action
复制链接

Block the write access to the monitored directory to test the correct behavior of the SAPHanaFilesystem resource. You can test this on any instance. Only a primary instance triggers a failure and recovery action. On a secondary node the resource does not trigger an action.

Prerequisites

You have configured the SAPHanaFilesystem resource. Skip this test if you have not configured this optional resource.

Procedure

Create a temporary file on a local filesystem of the node you want to test:
```
[root]# touch /tmp/test
```
Set the local file to be immutable, which prevents write access:
```
[root]# chattr +i /tmp/test
```
Go to the hidden directory which the SAPHanaFilesystem resource uses to test read and write filesystem access and enter the node you want to test:
```
[root]# cd /hana/shared/<SID>/.heartbeat_SAPHanaFilesystem/<node>
```
Change the test file, which the SAPHanaFilesystem resource creates, to become a symbolic link that points to the temporary local file. Since the temporary target test file cannot be modified, the resource fails after the next monitor cycle:
```
[root]# ln -sf /tmp/test test
```
NFS filesystems do not support extended attributes by default. The symbolic link bridges this gap for the test.

Verify the behavior during the simulated failure.

You can check the /var/log/messages file for the related log message if the resource action is set to ignore:

[root]# grep -e 'SAPHanaFil.*ON_FAIL_ACTION' /var/log/messages
... SAPHanaFilesystem(rsc_SAPHanaFil_RH1_HDB02)[715184]: INFO: -2- RA monitor() ON_FAIL_ACTION=ignore => ignore FS error, do not create poison pill file

If the resource action is set to fence you can observe the fencing action:

[root]# pcs status --full
...

Failed Resource Actions:
  * rsc_SAPHanaFil_RH1_HDB02_stop_0 on dc1hana1 'error' (1): ...

Pending Fencing Actions:
  * reboot of dc1hana1 pending: client=pacemaker-controld.1694, origin=dc1hana2

You must remove the blocker again after the test. If the node is fenced, delete the symbolic link after the node is running again. The resource creates the regular test file again during the next check:
```
[root]# rm -f /hana/shared/<SID>/.heartbeat_SAPHanaFilesystem/<node>/test
```

Clean up the temporary local test file:

[root]# chattr -i /tmp/test; rm -f /tmp/test

Next steps

When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.

7.5. Crashing the node with a primary instance
复制链接

Simulate the crash of the cluster node on which a primary instance is running to test the behavior of your HANA cluster resources.

Prerequisites

Your HANA instances have a healthy HANA system replication.
You have no failures in the cluster status.

Procedure

Trigger a crash on a HANA node on the primary site. This command immediately causes a crash of the node with no further warning:
```
[root]# echo c > /proc/sysrq-trigger
```

Verification

The cluster detects the failed node and fences it. You can watch the cluster activity on any of the remaining nodes:

[root]# pcs status --full
...
Pending Fencing Actions:
  * reboot of dc1hana1 pending: client=pacemaker-controld.1685, origin=dc1hana2
...

The secondary site takes over and becomes promoted as the new primary.
The fenced former primary node recovers according to your fencing and SAPHanaController resource configuration.

Next steps

When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.

7.6. Crashing the node with a secondary instance
复制链接

Simulate the crash of the cluster node on which a secondary instance is running to test the behavior of your HANA cluster resources.

Procedure

Trigger a crash of a HANA node on the secondary site. This command immediately causes a crash of the node with no further warning:
```
[root]# echo c > /proc/sysrq-trigger
```

Verification

The cluster detects the failed node and fences it. You can watch the cluster activity on any of the remaining nodes:

[root]# pcs status --full
...
Pending Fencing Actions:
  * reboot of dc2hana1 pending: client=pacemaker-controld.1694, origin=dc1hana1
...

The primary site remains running while the secondary node restarts and recovers. The fenced node recovery depends on your fencing configuration.

Next steps

Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.

7.7. Stopping the primary site using SAP commands
复制链接

Test the behavior of the cluster when you manage the primary HANA site outside of the cluster using HANA commands.

Since the cluster is not aware of the execution of HANA commands, it detects the change as a failure and triggers the configured recovery actions.

Prerequisites

Your HANA instances have a healthy HANA system replication.
You have no failures in the cluster status.

Procedure

Stop the primary HANA site as the <sid>adm user outside of the cluster. Run on one HANA instance on the primary site:
```
rh1adm $ sapcontrol -nr ${TINSTANCE} -function StopSystem HDB
```

Verification

The cluster notices the stopped instance as a failure and initiates the recovery of the primary site:
```
[root]# pcs status --full
...
Migration Summary:
  * Node: dc1hana1 (1):
    * rsc_SAPHanaCon_RH1_HDB02: migration-threshold=5000 fail-count=1 last-failure=...

Failed Resource Actions:
  * rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc1hana1 'not running' ...
```
If you configured and enabled both the PREFER_SITE_TAKEOVER and AUTOMATED_REGISTER parameters in the SAPHanaController resource, the cluster triggers a HANA takeover to the secondary site and automatically registers the failed primary as the new secondary. Otherwise it recovers the failed primary according to your configuration.

Next steps

When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.

7.8. Stopping the secondary site using SAP commands
复制链接

Test the behavior of the cluster when you manage the secondary HANA site outside of the cluster using HANA commands.

Since the cluster is not aware of the execution of HANA commands, it detects the change as a failure and triggers the configured recovery actions.

Prerequisites

You have no failures in the cluster status.

Procedure

Stop the secondary HANA site as the <sid>adm user outside of the cluster. Run on one HANA instance on the secondary site:
```
rh1adm $ sapcontrol -nr ${TINSTANCE} -function StopSystem HDB
```

Verification

The cluster notices the stopped instance as a failure and recovers the secondary site:

[root]# pcs status --full
...
Migration Summary:
  * Node: dc2hana1 (2):
    * rsc_SAPHanaCon_RH1_HDB02: migration-threshold=5000 fail-count=1 last-failure=...

Failed Resource Actions:
  * rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc2hana1 'not running' ...

Next steps

Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.

此内容没有您所选择的语言版本。

Chapter 7. Testing the setup

7.1. Detecting the system replication state changes
复制链接

7.2. Triggering the indexserver crash recovery
复制链接

7.3. Triggering a HANA takeover using cluster commands
复制链接

7.4. Triggering the SAPHanaFilesystem failure action
复制链接

7.5. Crashing the node with a primary instance
复制链接

7.6. Crashing the node with a secondary instance
复制链接

7.7. Stopping the primary site using SAP commands
复制链接

7.8. Stopping the secondary site using SAP commands
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

此内容没有您所选择的语言版本。

Chapter 7. Testing the setup

7.1. Detecting the system replication state changes复制链接链接已复制到粘贴板!

7.2. Triggering the indexserver crash recovery复制链接链接已复制到粘贴板!

7.3. Triggering a HANA takeover using cluster commands复制链接链接已复制到粘贴板!

7.4. Triggering the SAPHanaFilesystem failure action复制链接链接已复制到粘贴板!

7.5. Crashing the node with a primary instance复制链接链接已复制到粘贴板!

7.6. Crashing the node with a secondary instance复制链接链接已复制到粘贴板!

7.7. Stopping the primary site using SAP commands复制链接链接已复制到粘贴板!

7.8. Stopping the secondary site using SAP commands复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

7.1. Detecting the system replication state changes
复制链接

7.2. Triggering the indexserver crash recovery
复制链接

7.3. Triggering a HANA takeover using cluster commands
复制链接

7.4. Triggering the SAPHanaFilesystem failure action
复制链接

7.5. Crashing the node with a primary instance
复制链接

7.6. Crashing the node with a secondary instance
复制链接

7.7. Stopping the primary site using SAP commands
复制链接

7.8. Stopping the secondary site using SAP commands
复制链接