此内容没有您所选择的语言版本。
Chapter 7. Testing the setup
Test your new HANA HA cluster thoroughly before you enable it for production workloads.
Enhance the basic example test cases with your specific requirements.
7.1. Detecting the system replication state changes 复制链接链接已复制到粘贴板!
You must monitor the sync state information in logs and cluster attributes when disrupting the system replication, to test the correct functionality of the HanaSR HA/DR provider.
In this test, you use the primary site for monitoring the system replication status and for verifying the log messages. On a secondary instance you freeze the indexserver process to simulate a system replication issue while the primary remains fully intact.
Prerequisites
-
You have configured the mandatory
HanaSRHA/DR provider. - Your HANA instances are in a healthy state on all cluster nodes and the system replication is in sync.
Procedure
As user
<sid>admgo to the HANA Python directory on the primary site and check the current system replication state. Verify that it isACTIVEand fully synced:rh1adm $ cdpy; python systemReplicationStatus.py … status system replication site "2": ACTIVE overall system replication status: ACTIVE …Verify that the
srHookandsrPollcluster attributes are bothSOKin the attributes summary of the secondary site. Run this command as therootuser on any node in a separate terminal to keep track of the attribute changes:[root]# watch SAPHanaSR-showAttr ... Site lpt lss mns opMode srHook srMode srPoll srr ---------------------------------------------------------------- DC2 30 4 dc2hana1 logreplay SOK sync SOK S DC1 1757076772 4 dc1hana1 logreplay PRIM sync PRIM P ...You can use the
watchcommand to run the command in a loop at a default interval of 2 seconds.On an instance on the secondary site, for example,
dc2hana2, get the process ID (PID) of thehdbindexserverprocess. For example, you can get it from thePIDcolumn of theHDB infooutput as user<sid>adm:rh1adm $ HDB infoOn the same instance on the secondary site, use the
PIDto simulate a hanginghdbindexserverprocess by sending theSTOPsignal to the process. This freezes the process and blocks it from communicating and syncing the instance between the nodes:rh1adm $ kill -STOP <PID>
Verification
On the primary site, watch the system replication status for the change on any primary instance. In the following example the system’s
cututility helps you limit the output to certain fields for readability. Remove it to see all columns of the table formatted text output. In the example we froze the indexserver on the secondary nodedc2hana2, which results in a replication error with that node’s counterpart on the primary site,dc1hana2:rh1adm $ cdpy; watch "python systemReplicationStatus.py | cut -d '|' -f 1-3,5,9,13-" ... |Database |Host |Service Name |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | |Host |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |-------- |------------ |--------- |------------- |----------- |----------- |----------------------------- |------------ | |RH1 |dc1hana2 |indexserver |dc2hana2 |YES |SYNC |ERROR |Log shipping timeout occurred | False | |SYSTEMDB |dc1hana1 |nameserver |dc2hana1 |YES |SYNC |ACTIVE | | True | |RH1 |dc1hana1 |xsengine |dc2hana1 |YES |SYNC |ACTIVE | | True | |RH1 |dc1hana1 |indexserver |dc2hana1 |YES |SYNC |ACTIVE | | True | status system replication site "2": ERROR overall system replication status: ERROR ...The replication status changes to
ERRORfor the indexserver service after a bit. It can take a while to react on an idle instance, wait a minute or more.On the primary site’s master name server node, check the HANA nameserver process log for the related messages as the
<sid>admuser:rh1adm $ cdtrace; grep -he 'HanaSR.srConnectionChanged.*' nameserver_* ha_dr_HanaSR HanaSR.py(00056) : HanaSR 1.001.1 HanaSR.srConnectionChanged method called with Dict={'hostname': 'dc1hana2', 'port': '30003', 'volume': 4, 'service_name': 'indexserver', 'database': 'RH1', 'status': 11, 'database_status': 11, 'system_status': 11, 'timestamp': '2025-09-12T11:15:08.003728+00:00', 'is_in_sync': False, 'system_is_in_sync': False, 'reason': '', 'siteName': 'DC2'} ha_dr_HanaSR HanaSR.py(00065) : HanaSR HanaSR.srConnectionChanged system_status=11 SID=RH1 in_sync=False reason= ha_dr_HanaSR HanaSR.py(00091) : HanaSR.srConnectionChanged() CALLING CRM: <sudo /usr/sbin/crm_attribute -n hana_rh1_site_srHook_DC2 -v SFAIL -t crm_config -s SAPHanaSR> ret_code=The nameserver process log contains the event that HANA triggers with details. It also includes the
sudocommand that theHanaSRhook script runs to update thesrHookcluster attribute.Verify that both cluster attributes for the system replication status,
srHookandsrPoll, show theSFAILstatus of the secondary site. Run the following as therootuser on any HANA node or use the open terminal from the previous steps to watch the changes:[root]# SAPHanaSR-showAttr ... Site lpt lss mns opMode srHook srMode srPoll srr ---------------------------------------------------------------- DC2 10 4 dc2hana1 logreplay SFAIL sync SFAIL S DC1 1757079061 4 dc1hana1 logreplay PRIM sync PRIM P ...Unblock the previously frozen
hdbindexserverPID to enable it again. Run this on the secondary instance on which you blocked thehdbindexserverprocess for the test:rh1adm $ kill -CONT <PID>-
Repeat the previous checks to verify that the system replication recovers fully after a bit. The cluster does not trigger any actions during this test since the resources remain running. Ensure that the system replication status is healthy again, fully synced and the cluster attributes are set to
SOKagain for the secondary site.
7.2. Triggering the indexserver crash recovery 复制链接链接已复制到粘贴板!
Test the functionality of the ChkSrv HA/DR provider by simulating the crash of an hdbindexserver process. You can run this on the primary or on the secondary site. The exact recovery actions depend on the overall configuration. The following steps demonstrate the activity when using action_on_lost = stop in the hook configuration..
Prerequisites
-
You have configured the
ChkSrvHA/DR provider. Skip this test if you have not configured this optional hook. - Your HANA instances have a healthy HANA system replication.
- You have no failures in the cluster status.
Procedure
Use a separate terminal to monitor the HANA processes as user
<sid>admon the instance on which you run this test:rh1adm $ watch "sapcontrol -nr ${TINSTANCE} -function GetProcessList | column -s ',' -t"In another terminal on the same HANA instance, kill the
hdbindexserverprocess:rh1adm $ kill <PID>
Verification
Check the dedicated HANA nameserver trace log on the same instance and identify the event and related action, as user
<sid>adm:rh1adm $ cdtrace; less nameserver_chksrv.trc ... ChkSrv version 1.001.1. Method srServiceStateChanged method called. ChkSrv srServiceStateChanged method called with Dict={'hostname': 'dc2hana2', 'service_name': 'indexserver', 'service_port': '30203', 'service_status': 'stopping', 'service_previous_status': 'yes', 'timestamp': '2025-09-15T15:07:09.353198+00:00', 'daemon_status': 'yes', 'database_id': '3', 'database_name': 'RH1', 'database_status': 'yes', 'details': ''} ChkSrv srServiceStateChanged method called with SAPSYSTEMNAME=RH1 srv:indexserver-30203-stopping-yes db:RH1-3-yes daem:yes LOST: indexserver event looks like a lost indexserver (status=stopping) LOST: stop instance. action_on_lost=stop ...Check the cluster status for resource failure information on any cluster node, as user
root:[root]# pcs status --full ... Failed Resource Actions: * rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc2hana1 'not running' (7): call=26, status='complete', ... ...Check the system log for the related cluster actions on the on the test node, for example,
dc2hana2, as userroot:[root]# grep rsc_SAPHanaCon_RH1_HDB02 /var/log/messages ... Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Result of monitor operation for rsc_SAPHanaCon_RH1_HDB02 on dc2hana1: not running Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: rsc_SAPHanaCon_RH1_HDB02_monitor_61000@dc2hana1 output [ 10 ] Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Transition 32 action 29 (rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc2hana1): expected 'ok' but got 'not running' Sep 15 15:08:17 dc2hana1 pacemaker-attrd[17043]: notice: Setting last-failure-rsc_SAPHanaCon_RH1_HDB02#monitor_61000[dc2hana1] in instance_attributes: (unset) -> 1757948897 Sep 15 15:08:17 dc2hana1 pacemaker-attrd[17043]: notice: Setting fail-count-rsc_SAPHanaCon_RH1_HDB02#monitor_61000[dc2hana1] in instance_attributes: (unset) -> 1 Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHanaCon_RH1_HDB02:2 on dc2hana1 at Sep 15 15:08:17 2025 Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: notice: Actions: Recover rsc_SAPHanaCon_RH1_HDB02:2 ( Unpromoted dc2hana1 ) Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHanaCon_RH1_HDB02:2 on dc2hana1 at Sep 15 15:08:17 2025 Sep 15 15:08:17 dc2hana1 pacemaker-schedulerd[17044]: notice: Actions: Recover rsc_SAPHanaCon_RH1_HDB02:2 ( Unpromoted dc2hana1 ) Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Initiating stop operation rsc_SAPHanaCon_RH1_HDB02_stop_0 locally on dc2hana1 Sep 15 15:08:17 dc2hana1 pacemaker-controld[17045]: notice: Requesting local execution of stop operation for rsc_SAPHanaCon_RH1_HDB02 on dc2hana1 ...The next
SAPHanaControllerresource monitor reports the unexpectedly stopped HANA instance as a failure and initiates the recovery steps according to the configuration. IfPREFER_SITE_TAKEOVERis enabled and you executed the test on a primary instance, it triggers a HANA takeover to the secondary site.
Next steps
- When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
- Clear any failure notifications from the cluster that might be there from previous testing. For more information see Cleaning up the failure history.
7.3. Triggering a HANA takeover using cluster commands 复制链接链接已复制到粘贴板!
Use the cluster command to move the promoted resource to the other site and manually test the planned takeover of the primary to the secondary site.
Prerequisites
- Your HANA instances have a healthy HANA system replication.
- You have no failures in the cluster status.
Procedure
Switch the primary site to the secondary site. Run the cluster command as user
rooton any node:[root]# pcs resource move cln_SAPHanaCon_<SID>_HDB<instance> Location constraint to move resource 'cln_SAPHanaCon_RH1_HDB02' has been created Waiting for the cluster to apply configuration changes... Location constraint created to move resource 'cln_SAPHanaCon_RH1_HDB02' has been removed Waiting for the cluster to apply configuration changes... resource 'cln_SAPHanaCon_RH1_HDB02' is promoted on node 'dc2hana1'
Verification
Verify that the
SAPHanaControllerresource is now promoted on the other site:[root]# pcs resource status cln_SAPHanaCon_RH1_HDB02 * Clone Set: cln_SAPHanaCon_RH1_HDB02 [rsc_SAPHanaCon_RH1_HDB02] (promotable): * Promoted: [ dc2hana1 ] * Unpromoted: [ dc1hana2 dc2hana2 ] * Stopped: [ dc1hana1 ]The status of the previous primary instance depends on the
AUTOMATED_REGISTERparameter of theSAPHanaControllerresource. The instance stops until manual intervention whenAUTOMATED_REGISTERisfalse, otherwise the instance restarts automatically and reregisters as the new secondary instance.
Next steps
- When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
- Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.
7.4. Triggering the SAPHanaFilesystem failure action 复制链接链接已复制到粘贴板!
Block the write access to the monitored directory to test the correct behavior of the SAPHanaFilesystem resource. You can test this on any instance. Only a primary instance triggers a failure and recovery action. On a secondary node the resource does not trigger an action.
Prerequisites
-
You have configured the
SAPHanaFilesystemresource. Skip this test if you have not configured this optional resource.
Procedure
Create a temporary file on a local filesystem of the node you want to test:
[root]# touch /tmp/testSet the local file to be immutable, which prevents write access:
[root]# chattr +i /tmp/testGo to the hidden directory which the
SAPHanaFilesystemresource uses to test read and write filesystem access and enter the node you want to test:[root]# cd /hana/shared/<SID>/.heartbeat_SAPHanaFilesystem/<node>Change the
testfile, which theSAPHanaFilesystemresource creates, to become a symbolic link that points to the temporary local file. Since the temporary target test file cannot be modified, the resource fails after the next monitor cycle:[root]# ln -sf /tmp/test testNFS filesystems do not support extended attributes by default. The symbolic link bridges this gap for the test.
Verify the behavior during the simulated failure.
You can check the
/var/log/messagesfile for the related log message if the resource action is set toignore:[root]# grep -e 'SAPHanaFil.*ON_FAIL_ACTION' /var/log/messages ... SAPHanaFilesystem(rsc_SAPHanaFil_RH1_HDB02)[715184]: INFO: -2- RA monitor() ON_FAIL_ACTION=ignore => ignore FS error, do not create poison pill fileIf the resource action is set to
fenceyou can observe the fencing action:[root]# pcs status --full ... Failed Resource Actions: * rsc_SAPHanaFil_RH1_HDB02_stop_0 on dc1hana1 'error' (1): ... Pending Fencing Actions: * reboot of dc1hana1 pending: client=pacemaker-controld.1694, origin=dc1hana2
You must remove the blocker again after the test. If the node is fenced, delete the symbolic link after the node is running again. The resource creates the regular test file again during the next check:
[root]# rm -f /hana/shared/<SID>/.heartbeat_SAPHanaFilesystem/<node>/testClean up the temporary local test file:
[root]# chattr -i /tmp/test; rm -f /tmp/test
Next steps
- When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
- Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.
7.5. Crashing the node with a primary instance 复制链接链接已复制到粘贴板!
Simulate the crash of the cluster node on which a primary instance is running to test the behavior of your HANA cluster resources.
Prerequisites
- Your HANA instances have a healthy HANA system replication.
- You have no failures in the cluster status.
Procedure
Trigger a crash on a HANA node on the primary site. This command immediately causes a crash of the node with no further warning:
[root]# echo c > /proc/sysrq-trigger
Verification
The cluster detects the failed node and fences it. You can watch the cluster activity on any of the remaining nodes:
[root]# pcs status --full ... Pending Fencing Actions: * reboot of dc1hana1 pending: client=pacemaker-controld.1685, origin=dc1hana2 ...- The secondary site takes over and becomes promoted as the new primary.
-
The fenced former primary node recovers according to your fencing and
SAPHanaControllerresource configuration.
Next steps
- When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
- Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.
7.6. Crashing the node with a secondary instance 复制链接链接已复制到粘贴板!
Simulate the crash of the cluster node on which a secondary instance is running to test the behavior of your HANA cluster resources.
Procedure
Trigger a crash of a HANA node on the secondary site. This command immediately causes a crash of the node with no further warning:
[root]# echo c > /proc/sysrq-trigger
Verification
The cluster detects the failed node and fences it. You can watch the cluster activity on any of the remaining nodes:
[root]# pcs status --full ... Pending Fencing Actions: * reboot of dc2hana1 pending: client=pacemaker-controld.1694, origin=dc1hana1 ...- The primary site remains running while the secondary node restarts and recovers. The fenced node recovery depends on your fencing configuration.
Next steps
- Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.
7.7. Stopping the primary site using SAP commands 复制链接链接已复制到粘贴板!
Test the behavior of the cluster when you manage the primary HANA site outside of the cluster using HANA commands.
Since the cluster is not aware of the execution of HANA commands, it detects the change as a failure and triggers the configured recovery actions.
Prerequisites
- Your HANA instances have a healthy HANA system replication.
- You have no failures in the cluster status.
Procedure
Stop the primary HANA site as the
<sid>admuser outside of the cluster. Run on one HANA instance on the primary site:rh1adm $ sapcontrol -nr ${TINSTANCE} -function StopSystem HDB
Verification
The cluster notices the stopped instance as a failure and initiates the recovery of the primary site:
[root]# pcs status --full ... Migration Summary: * Node: dc1hana1 (1): * rsc_SAPHanaCon_RH1_HDB02: migration-threshold=5000 fail-count=1 last-failure=... Failed Resource Actions: * rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc1hana1 'not running' ...If you configured and enabled both the
PREFER_SITE_TAKEOVERandAUTOMATED_REGISTERparameters in theSAPHanaControllerresource, the cluster triggers a HANA takeover to the secondary site and automatically registers the failed primary as the new secondary. Otherwise it recovers the failed primary according to your configuration.
Next steps
- When necessary, depending on the configuration, manually reregister the stopped former primary HANA site and start it using HANA tools. For more information, refer to Registering the former primary HANA site as a secondary HANA site after a takeover.
- Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.
7.8. Stopping the secondary site using SAP commands 复制链接链接已复制到粘贴板!
Test the behavior of the cluster when you manage the secondary HANA site outside of the cluster using HANA commands.
Since the cluster is not aware of the execution of HANA commands, it detects the change as a failure and triggers the configured recovery actions.
Prerequisites
- You have no failures in the cluster status.
Procedure
Stop the secondary HANA site as the
<sid>admuser outside of the cluster. Run on one HANA instance on the secondary site:rh1adm $ sapcontrol -nr ${TINSTANCE} -function StopSystem HDB
Verification
The cluster notices the stopped instance as a failure and recovers the secondary site:
[root]# pcs status --full ... Migration Summary: * Node: dc2hana1 (2): * rsc_SAPHanaCon_RH1_HDB02: migration-threshold=5000 fail-count=1 last-failure=... Failed Resource Actions: * rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on dc2hana1 'not running' ...
Next steps
- Clear any failure notifications from the cluster that may be there from previous testing. For more information see Cleaning up the failure history.