Chapter 10. Troubleshooting


When the srHook attribute value does not match the actual HANA system replication status, it can lead to unexpected behavior in the cluster when a failure of a primary instance occurs.

Check and correct your sudo configuration when the srHook attribute of the secondary site and the HANA system replication status do not match:

  • The srHook cluster attribute of the secondary is empty.
  • The srHook cluster attribute of the secondary is set to SOK while the HANA system replication is not healthy.
  • The srHook cluster attribute of the secondary is set to SFAIL while the system replication is in ACTIVE state.

The primary site receives the events of HANA system replication changes and stores the result as a cluster attribute for the secondary site.

Procedure

  1. Check for crm_attribute update errors in the secure log, since the command is executed using sudo. The log shows the command that the hook script tries to execute, but potentially fails. Check on the primary instance node for an error like command not allowed, like in this example:

    [root]# grep crm_attribute /var/log/secure
    ... rh1adm : command not allowed ; PWD=/hana/shared/RH1/HDB02/<node> ; USER=root ; COMMAND=/usr/sbin/crm_attribute -n hana_rh1_site_srHook_DC2 -v SFAIL -t crm_config -s SAPHanaSR
  2. Compare the logged COMMAND to your sudoers configuration. Check thoroughly and fix the sudoers file, so that you have a sudo entry that matches the command. As a temporary measure you can ensure that the sudo entry as such works by simplifying it with a wildcard to exclude typos in the command parameters as the cause:

    [root]# cat /etc/sudoers.d/20-saphana
    Defaults:<sid>adm !requiretty
    <sid>adm ALL=(ALL) NOPASSWD: /usr/sbin/crm_attribute *
    • Replace <sid> with your lower-case HANA SID.
  3. Verify that the command path is correct:

    [root]# ls /usr/sbin/crm_attribute
    /usr/sbin/crm_attribute
  4. Fix the sudo configuration. For more information, see Configuring the SAPHanaSR HA/DR provider for the srConnectionChanged() hook method.
  5. Repeat any fixing steps on all nodes. The sudo configuration must be identical on all instances.

You recently made changes in the global.ini in a HA/DR provider section and the HANA instance does not start anymore.

Procedure

  1. Go to the HANA trace logs directory, as the <sid>adm user:

    rh1adm$ cdtrace
  2. Check for errors related to the HA/DR providers in the HANA nameserver process alert log:

    rh1adm$ grep ha_dr_provider nameserver_alert_*.trc
    ... ha_dr_provider   PythonProxyImpl.cpp(00145) : import of saphanasr failed: No module named 'saphanasr'
    ... ha_dr_provider   HADRProviderManager.cpp(00100) : could not load HA/DR Provider 'saphanasr' from /usr/share/SAPHanaSR-ScaleOut
  3. Identify the root cause, for example a misspelled HA/DR provider name or a wrong path. Check the path and the hook script name. In this example the HA/DR provider name saphanasr is not matching the hook script name SAPHanaSR:

    rh1adm$ ls /usr/share/SAPHanaSR-ScaleOut/
    ChkSrv.py  SAPHanaSR.py  SAPHanaSrMultiTarget.py  samples
  4. Correct the SAPHanaSR HA/DR provider configuration:

    [ha_dr_provider_SAPHanaSR]
    provider = SAPHanaSR
    path = /usr/share/SAPHanaSR-ScaleOut
    execution_order = 1
    • provider must match the name of the Python hook script. It is case-sensitive without the .py file suffix.
    • path must be the path in which the hook script is stored.

When maintenance-mode is set for the cluster, for example, for a HANA update, it can still notice issues between the nodes, but does not trigger recovery actions yet.

If you encounter such a situation, you must first fix the cause of the issue before you lift the maintenance mode.

Example: the corosync communication between the nodes is blocked in a 8-node cluster

If the maintenance mode is removed in this situation, the cluster tries to recover the issue by itself. This can have a severe impact on your ongoing HANA maintenance activity.

...
              * Resource management is DISABLED *
  The cluster will not attempt to start, stop or recover services

Node List:
  * Node dc2hana3: UNCLEAN (offline)
  * Online: [ dc1hana1 dc1hana2 dc1hana3 dc1hana4 dc2hana1 dc2hana2 dc2hana4 dc3mm ]

Full List of Resources:
  * rsc_fence       (stonith:<fence agent>):     Started dc1hana1 (maintenance)
  * Clone Set: cln_SAPHanaTop_RH1_HDB02 [rsc_SAPHanaTop_RH1_HDB02] (maintenance):
    * rsc_SAPHanaTop_RH1_HDB02  (ocf:heartbeat:SAPHanaTopology):         Started dc1hana2 (maintenance)
    * rsc_SAPHanaTop_RH1_HDB02  (ocf:heartbeat:SAPHanaTopology):         Started dc1hana3 (maintenance)
    * rsc_SAPHanaTop_RH1_HDB02  (ocf:heartbeat:SAPHanaTopology):         Started dc1hana4 (maintenance)
    * rsc_SAPHanaTop_RH1_HDB02  (ocf:heartbeat:SAPHanaTopology):         Started dc2hana1 (maintenance)
    * rsc_SAPHanaTop_RH1_HDB02  (ocf:heartbeat:SAPHanaTopology):         Started dc2hana2 (maintenance)
    * rsc_SAPHanaTop_RH1_HDB02  (ocf:heartbeat:SAPHanaTopology):         Started dc2hana3 (UNCLEAN, maintenance)
    * rsc_SAPHanaTop_RH1_HDB02  (ocf:heartbeat:SAPHanaTopology):         Started dc2hana4 (maintenance)
    * rsc_SAPHanaTop_RH1_HDB02  (ocf:heartbeat:SAPHanaTopology):         Started dc1hana1 (maintenance)
    * Stopped: [ dc2hana3 dc3mm ]
  * Clone Set: cln_SAPHanaCon_RH1_HDB02 [rsc_SAPHanaCon_RH1_HDB02] (promotable, maintenance):
    * rsc_SAPHanaCon_RH1_HDB02  (ocf:heartbeat:SAPHanaController):       Unpromoted dc1hana2 (maintenance)
    * rsc_SAPHanaCon_RH1_HDB02  (ocf:heartbeat:SAPHanaController):       Unpromoted dc1hana3 (maintenance)
    * rsc_SAPHanaCon_RH1_HDB02  (ocf:heartbeat:SAPHanaController):       Unpromoted dc1hana4 (maintenance)
    * rsc_SAPHanaCon_RH1_HDB02  (ocf:heartbeat:SAPHanaController):       Unpromoted dc2hana1 (maintenance)
    * rsc_SAPHanaCon_RH1_HDB02  (ocf:heartbeat:SAPHanaController):       Unpromoted dc2hana2 (maintenance)
    * rsc_SAPHanaCon_RH1_HDB02  (ocf:heartbeat:SAPHanaController):       Unpromoted dc2hana3 (UNCLEAN, maintenance)
    * rsc_SAPHanaCon_RH1_HDB02  (ocf:heartbeat:SAPHanaController):       Unpromoted dc2hana4 (maintenance)
    * rsc_SAPHanaCon_RH1_HDB02  (ocf:heartbeat:SAPHanaController):       Promoted dc1hana1 (maintenance)
    * Stopped: [ dc2hana3 dc3mm ]
  * rsc_vip_RH1_HDB02_primary   (ocf:heartbeat:IPaddr2):         Started dc1hana1 (maintenance)
  * rsc_vip_RH1_HDB02_readonly  (ocf:heartbeat:IPaddr2):         Started dc2hana1 (maintenance)


...

Identify the root cause of the issue, for example:

  • Planned network maintenance on the cluster communication connection in parallel to your HANA maintenance.
  • Unplanned outage of network connections due to network device failures or misconfiguration on operating system or network level.
  • Firewall configuration blocking cluster communication ports.

Fix any issue to prevent the cluster from taking recovery measures when the cluster maintenance is removed.

An inconsistency between the actual HANA system replication state and the srHook cluster node attribute can occur, when the cluster is running on the primary instance node while the system replication fails, for example, during a maintenance. HANA triggers the hook that updates the srHook attribute with the SFAIL value. If the cluster is then stopped on the primary instance node and the HANA system replication recovers to a healthy state, the hook is correctly executed by HANA, but the update of the cluster node attribute fails.

The primary HANA instance only triggers the srConnectionChanged() hook when there is a new change of the system replication status.

The sync_state attribute is set based on an active check and functions as a fallback when the srHook value is empty. However, when the values are different, then the SAPHanaController resource uses the srHook attribute to take the decision if a takeover is possible or not. As a result, if the srHook attribute is SFAIL despite a healthy HANA system replication state, the cluster will not trigger the takeover to the secondary site at the next failure on the primary site.

To solve this conflict, you can delete the incorrect srHook attribute. Afterwards the cluster uses the sync_state attribute for decisions, and the srHook attribute is updated and used again after the next change of the HANA system replication status.

Procedure

  1. Use the systemReplicationStatus.py script to check the status of the HANA system replication on the primary site:

    [root]# su - <sid>adm -c "HDBSettings.sh systemReplicationStatus.py \
    --sapcontrol=1 | grep -i replication_status="
    service/dc1hana3/30203/REPLICATION_STATUS=ACTIVE
    service/dc1hana2/30203/REPLICATION_STATUS=ACTIVE
    service/dc1hana1/30201/REPLICATION_STATUS=ACTIVE
    service/dc1hana1/30207/REPLICATION_STATUS=ACTIVE
    service/dc1hana1/30203/REPLICATION_STATUS=ACTIVE
    site/2/REPLICATION_STATUS=ACTIVE
    overall_replication_status=ACTIVE

    Before you proceed, ensure that the system replication is healthy and reported as ACTIVE.

  2. Review the sync_state and srHook attributes and the node score values during the conflict:

    [root]# SAPHanaSR-showAttr
    Global cib-time                 prim sec srHook sync_state upd
    ---------------------------------------------------------------
    RH1    Fri Dec 19 11:12:42 2025 DC1  DC1 SFAIL  SOK        ok
    
    Sites lpt        lss mns      srr
    ----------------------------------
    DC1   1766142750 4   dc1hana1 P
    DC2   10         4   dc2hana1 S
    
    Hosts    clone_state gra node_state roles                         score     site
    ---------------------------------------------------------------------------------
    dc1hana1 PROMOTED    2.0 online     master1:master:worker:master  150       DC1
    dc1hana2 DEMOTED     2.0 online     master2:slave:worker:slave    140       DC1
    dc1hana3 DEMOTED     2.0 online     slave:slave:worker:slave      -10000    DC1
    dc1hana4 DEMOTED     2.0 online     master3:slave:standby:standby 140       DC1
    dc2hana1 DEMOTED     2.0 online     master1:master:worker:master  -INFINITY DC2
    dc2hana2 DEMOTED     2.0 online     master2:slave:worker:slave    -32300    DC2
    dc2hana3 DEMOTED     2.0 online     slave:slave:worker:slave      -22200    DC2
    dc2hana4 DEMOTED     2.0 online     master3:slave:standby:standby -32300    DC2
    dc3mm                    online

    In this state, the sync_state attribute is correct, but the srHook attribute takes precedence. Therefore, the secondary site is excluded from taking over if the primary site fails.

  3. Delete the srHook attribute to solve the conflict:

    [root]# crm_attribute --type crm_config -n hana_<sid>_glob_srHook --delete
    Deleted crm_config option: id=SAPHanaSR-hana_rh1_glob_srHook name=hana_rh1_glob_srHook

Verification

  • Check the attributes summary and note, that the srHook attribute is missing and that the node scores are updated to enable an automatic takeover again using the sync_state attribute status:

    [root]# SAPHanaSR-showAttr
    Global cib-time                 prim sec sync_state upd
    --------------------------------------------------------
    RH1    Fri Dec 19 11:17:59 2025 DC1  DC1 SOK        ok
    
    Sites lpt        lss mns      srr
    ----------------------------------
    DC1   1766143077 4   dc1hana1 P
    DC2   30         4   dc2hana1 S
    
    Hosts    clone_state gra node_state roles                         score  site
    ------------------------------------------------------------------------------
    dc1hana1 PROMOTED    2.0 online     master1:master:worker:master  150    DC1
    dc1hana2 DEMOTED     2.0 online     master2:slave:worker:slave    140    DC1
    dc1hana3 DEMOTED     2.0 online     slave:slave:worker:slave      -10000 DC1
    dc1hana4 DEMOTED     2.0 online     master3:slave:standby:standby 140    DC1
    dc2hana1 DEMOTED     2.0 online     master1:master:worker:master  100    DC2
    dc2hana2 DEMOTED     2.0 online     master2:slave:worker:slave    80     DC2
    dc2hana3 DEMOTED     2.0 online     slave:slave:worker:slave      -12200 DC2
    dc2hana4 DEMOTED     2.0 online     master3:slave:standby:standby 80     DC2
    dc3mm                    online
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat Documentation

Legal Notice

Theme

© 2026 Red Hat
Back to top