Chapter 10. Troubleshooting
10.1. The srHook cluster attribute value is incorrect Copy linkLink copied to clipboard!
When the srHook attribute value does not match the actual HANA system replication status, it can lead to unexpected behavior in the cluster when a failure of a primary instance occurs.
Check and correct your sudo configuration when the srHook attribute of the secondary site and the HANA system replication status do not match:
-
The
srHookcluster attribute of the secondary is empty. -
The
srHookcluster attribute of the secondary is set toSOKwhile the HANA system replication is not healthy. -
The
srHookcluster attribute of the secondary is set toSFAILwhile the system replication is inACTIVEstate.
The primary site receives the events of HANA system replication changes and stores the result as a cluster attribute for the secondary site.
Procedure
Check for
crm_attributeupdate errors in thesecurelog, since the command is executed usingsudo. The log shows the command that the hook script tries to execute, but potentially fails. Check on the primary instance node for an error likecommand not allowed, like in this example:[root]# grep crm_attribute /var/log/secure... rh1adm : command not allowed ; PWD=/hana/shared/RH1/HDB02/<node> ; USER=root ; COMMAND=/usr/sbin/crm_attribute -n hana_rh1_site_srHook_DC2 -v SFAIL -t crm_config -s SAPHanaSRCompare the logged
COMMANDto yoursudoersconfiguration. Check thoroughly and fix thesudoersfile, so that you have a sudo entry that matches the command. As a temporary measure you can ensure that the sudo entry as such works by simplifying it with a wildcard to exclude typos in the command parameters as the cause:[root]# cat /etc/sudoers.d/20-saphanaDefaults:<sid>adm !requiretty <sid>adm ALL=(ALL) NOPASSWD: /usr/sbin/crm_attribute *-
Replace
<sid>with your lower-case HANA SID.
-
Replace
Verify that the command path is correct:
[root]# ls /usr/sbin/crm_attribute/usr/sbin/crm_attribute- Fix the sudo configuration. For more information, see Configuring the SAPHanaSR HA/DR provider for the srConnectionChanged() hook method.
- Repeat any fixing steps on all nodes. The sudo configuration must be identical on all instances.
10.2. The HANA instance does not start after hook changes Copy linkLink copied to clipboard!
You recently made changes in the global.ini in a HA/DR provider section and the HANA instance does not start anymore.
Procedure
Go to the HANA trace logs directory, as the
<sid>admuser:rh1adm$ cdtraceCheck for errors related to the HA/DR providers in the HANA nameserver process alert log:
rh1adm$ grep ha_dr_provider nameserver_alert_*.trc... ha_dr_provider PythonProxyImpl.cpp(00145) : import of saphanasr failed: No module named 'saphanasr' ... ha_dr_provider HADRProviderManager.cpp(00100) : could not load HA/DR Provider 'saphanasr' from /usr/share/SAPHanaSR-ScaleOutIdentify the root cause, for example a misspelled HA/DR
providername or a wrongpath. Check the path and the hook script name. In this example the HA/DR provider namesaphanasris not matching the hook script nameSAPHanaSR:rh1adm$ ls /usr/share/SAPHanaSR-ScaleOut/ChkSrv.py SAPHanaSR.py SAPHanaSrMultiTarget.py samplesCorrect the
SAPHanaSRHA/DR provider configuration:[ha_dr_provider_SAPHanaSR] provider = SAPHanaSR path = /usr/share/SAPHanaSR-ScaleOut execution_order = 1-
providermust match the name of the Python hook script. It is case-sensitive without the.pyfile suffix. -
pathmust be the path in which the hook script is stored.
-
10.3. A cluster node is reported as offline during maintenance Copy linkLink copied to clipboard!
When maintenance-mode is set for the cluster, for example, for a HANA update, it can still notice issues between the nodes, but does not trigger recovery actions yet.
If you encounter such a situation, you must first fix the cause of the issue before you lift the maintenance mode.
Example: the corosync communication between the nodes is blocked in a 8-node cluster
If the maintenance mode is removed in this situation, the cluster tries to recover the issue by itself. This can have a severe impact on your ongoing HANA maintenance activity.
...
* Resource management is DISABLED *
The cluster will not attempt to start, stop or recover services
Node List:
* Node dc2hana3: UNCLEAN (offline)
* Online: [ dc1hana1 dc1hana2 dc1hana3 dc1hana4 dc2hana1 dc2hana2 dc2hana4 dc3mm ]
Full List of Resources:
* rsc_fence (stonith:<fence agent>): Started dc1hana1 (maintenance)
* Clone Set: cln_SAPHanaTop_RH1_HDB02 [rsc_SAPHanaTop_RH1_HDB02] (maintenance):
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started dc1hana2 (maintenance)
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started dc1hana3 (maintenance)
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started dc1hana4 (maintenance)
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started dc2hana1 (maintenance)
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started dc2hana2 (maintenance)
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started dc2hana3 (UNCLEAN, maintenance)
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started dc2hana4 (maintenance)
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started dc1hana1 (maintenance)
* Stopped: [ dc2hana3 dc3mm ]
* Clone Set: cln_SAPHanaCon_RH1_HDB02 [rsc_SAPHanaCon_RH1_HDB02] (promotable, maintenance):
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Unpromoted dc1hana2 (maintenance)
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Unpromoted dc1hana3 (maintenance)
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Unpromoted dc1hana4 (maintenance)
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Unpromoted dc2hana1 (maintenance)
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Unpromoted dc2hana2 (maintenance)
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Unpromoted dc2hana3 (UNCLEAN, maintenance)
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Unpromoted dc2hana4 (maintenance)
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Promoted dc1hana1 (maintenance)
* Stopped: [ dc2hana3 dc3mm ]
* rsc_vip_RH1_HDB02_primary (ocf:heartbeat:IPaddr2): Started dc1hana1 (maintenance)
* rsc_vip_RH1_HDB02_readonly (ocf:heartbeat:IPaddr2): Started dc2hana1 (maintenance)
...
Identify the root cause of the issue, for example:
- Planned network maintenance on the cluster communication connection in parallel to your HANA maintenance.
- Unplanned outage of network connections due to network device failures or misconfiguration on operating system or network level.
- Firewall configuration blocking cluster communication ports.
Fix any issue to prevent the cluster from taking recovery measures when the cluster maintenance is removed.
10.4. The srHook attribute is SFAIL while the system replication is healthy Copy linkLink copied to clipboard!
An inconsistency between the actual HANA system replication state and the srHook cluster node attribute can occur, when the cluster is running on the primary instance node while the system replication fails, for example, during a maintenance. HANA triggers the hook that updates the srHook attribute with the SFAIL value. If the cluster is then stopped on the primary instance node and the HANA system replication recovers to a healthy state, the hook is correctly executed by HANA, but the update of the cluster node attribute fails.
The primary HANA instance only triggers the srConnectionChanged() hook when there is a new change of the system replication status.
The sync_state attribute is set based on an active check and functions as a fallback when the srHook value is empty. However, when the values are different, then the SAPHanaController resource uses the srHook attribute to take the decision if a takeover is possible or not. As a result, if the srHook attribute is SFAIL despite a healthy HANA system replication state, the cluster will not trigger the takeover to the secondary site at the next failure on the primary site.
To solve this conflict, you can delete the incorrect srHook attribute. Afterwards the cluster uses the sync_state attribute for decisions, and the srHook attribute is updated and used again after the next change of the HANA system replication status.
Procedure
Use the
systemReplicationStatus.pyscript to check the status of the HANA system replication on the primary site:[root]# su - <sid>adm -c "HDBSettings.sh systemReplicationStatus.py \ --sapcontrol=1 | grep -i replication_status="service/dc1hana3/30203/REPLICATION_STATUS=ACTIVE service/dc1hana2/30203/REPLICATION_STATUS=ACTIVE service/dc1hana1/30201/REPLICATION_STATUS=ACTIVE service/dc1hana1/30207/REPLICATION_STATUS=ACTIVE service/dc1hana1/30203/REPLICATION_STATUS=ACTIVE site/2/REPLICATION_STATUS=ACTIVE overall_replication_status=ACTIVEBefore you proceed, ensure that the system replication is healthy and reported as
ACTIVE.Review the
sync_stateandsrHookattributes and the node score values during the conflict:[root]# SAPHanaSR-showAttrGlobal cib-time prim sec srHook sync_state upd --------------------------------------------------------------- RH1 Fri Dec 19 11:12:42 2025 DC1 DC1 SFAIL SOK ok Sites lpt lss mns srr ---------------------------------- DC1 1766142750 4 dc1hana1 P DC2 10 4 dc2hana1 S Hosts clone_state gra node_state roles score site --------------------------------------------------------------------------------- dc1hana1 PROMOTED 2.0 online master1:master:worker:master 150 DC1 dc1hana2 DEMOTED 2.0 online master2:slave:worker:slave 140 DC1 dc1hana3 DEMOTED 2.0 online slave:slave:worker:slave -10000 DC1 dc1hana4 DEMOTED 2.0 online master3:slave:standby:standby 140 DC1 dc2hana1 DEMOTED 2.0 online master1:master:worker:master -INFINITY DC2 dc2hana2 DEMOTED 2.0 online master2:slave:worker:slave -32300 DC2 dc2hana3 DEMOTED 2.0 online slave:slave:worker:slave -22200 DC2 dc2hana4 DEMOTED 2.0 online master3:slave:standby:standby -32300 DC2 dc3mm onlineIn this state, the
sync_stateattribute is correct, but thesrHookattribute takes precedence. Therefore, the secondary site is excluded from taking over if the primary site fails.Delete the
srHookattribute to solve the conflict:[root]# crm_attribute --type crm_config -n hana_<sid>_glob_srHook --deleteDeleted crm_config option: id=SAPHanaSR-hana_rh1_glob_srHook name=hana_rh1_glob_srHook
Verification
Check the attributes summary and note, that the
srHookattribute is missing and that the node scores are updated to enable an automatic takeover again using thesync_stateattribute status:[root]# SAPHanaSR-showAttrGlobal cib-time prim sec sync_state upd -------------------------------------------------------- RH1 Fri Dec 19 11:17:59 2025 DC1 DC1 SOK ok Sites lpt lss mns srr ---------------------------------- DC1 1766143077 4 dc1hana1 P DC2 30 4 dc2hana1 S Hosts clone_state gra node_state roles score site ------------------------------------------------------------------------------ dc1hana1 PROMOTED 2.0 online master1:master:worker:master 150 DC1 dc1hana2 DEMOTED 2.0 online master2:slave:worker:slave 140 DC1 dc1hana3 DEMOTED 2.0 online slave:slave:worker:slave -10000 DC1 dc1hana4 DEMOTED 2.0 online master3:slave:standby:standby 140 DC1 dc2hana1 DEMOTED 2.0 online master1:master:worker:master 100 DC2 dc2hana2 DEMOTED 2.0 online master2:slave:worker:slave 80 DC2 dc2hana3 DEMOTED 2.0 online slave:slave:worker:slave -12200 DC2 dc2hana4 DEMOTED 2.0 online master3:slave:standby:standby 80 DC2 dc3mm online