Este conteúdo não está disponível no idioma selecionado.
Chapter 10. Troubleshooting
10.1. The srHook cluster attribute value is incorrect Copiar o linkLink copiado para a área de transferência!
When the srHook attribute value does not match the actual HANA system replication status, it can lead to unexpected behavior in the cluster when a failure of a primary instance occurs.
Check and correct your sudo configuration when the srHook attribute of the secondary site and the HANA system replication status do not match:
-
The
srHookcluster attribute of the secondary is empty. -
The
srHookcluster attribute of the secondary is set toSOKwhile the HANA system replication is not healthy. -
The
srHookcluster attribute of the secondary is set toSFAILwhile the system replication is inACTIVEstate.
The primary site receives the events of HANA system replication changes and stores the result as a cluster attribute for the secondary site.
Procedure
Check for
crm_attributeupdate errors in thesecurelog, since the command is executed usingsudo. The log shows the command that the hook script tries to execute, but potentially fails. Check on the primary instance node for an error likecommand not allowed, like in this example:[root]# grep crm_attribute /var/log/secure ... rh1adm : command not allowed ; PWD=/hana/shared/RH1/HDB02/<node> ; USER=root ; COMMAND=/usr/sbin/crm_attribute -n hana_rh1_site_srHook_DC2 -v SFAIL -t crm_config -s SAPHanaSRCompare the logged
COMMANDto yoursudoersconfiguration. Check thoroughly and fix thesudoersfile, so that you have a sudo entry that matches the command. As a temporary measure you can ensure that the sudo entry as such works by simplifying it with a wildcard to exclude typos in the command parameters as the cause:[root]# cat /etc/sudoers.d/20-saphana Defaults:<sid>adm !requiretty <sid>adm ALL=(ALL) NOPASSWD: /usr/sbin/crm_attribute *-
Replace
<sid>with your lower-case HANA SID.
-
Replace
Verify that the command path is correct:
[root]# ls /usr/sbin/crm_attribute /usr/sbin/crm_attribute- Fix the sudo configuration. For more information, see Configuring the HanaSR HA/DR provider for the srConnectionChanged() hook method.
- Repeat any fixing steps on all nodes. The sudo configuration must be identical on all instances.
10.2. The HANA instance does not start after hook changes Copiar o linkLink copiado para a área de transferência!
You recently made changes in the global.ini in a HA/DR provider section and the HANA instance does not start anymore.
Procedure
Go to the HANA trace logs directory, as the
<sid>admuser:rh1adm $ cdtraceCheck for errors related to the HA/DR providers in the HANA nameserver process alert log:
rh1adm $ grep ha_dr_provider nameserver_alert_*.trc ... ha_dr_provider PythonProxyImpl.cpp(00145) : import of hanasr failed: No module named 'hanasr' ... ha_dr_provider HADRProviderManager.cpp(00100) : could not load HA/DR Provider 'hanasr' from /usr/share/sap-hana-ha/Identify the root cause, for example a misspelled HA/DR
providername or a wrongpath. Check the path and the hook script name. In this example the HA/DR provider namehanasris not matching the hook script nameHanaSR:rh1adm $ ls /usr/share/sap-hana-ha/ ChkSrv.py HanaSR.py samplesCorrect the
HanaSRHA/DR provider configuration:[ha_dr_provider_hanasr] provider = HanaSR path = /usr/share/sap-hana-ha/ execution_order = 1-
providermust match the name of the Python hook script. It is case-sensitive without the.pyfile suffix. -
pathmust be the path in which the hook script is stored.
-
10.3. A cluster node is reported as offline during maintenance Copiar o linkLink copiado para a área de transferência!
When maintenance-mode is set for the cluster, for example, for a HANA update, it can still notice issues between the nodes, but does not trigger recovery actions yet.
If you encounter such a situation, you must first fix the cause of the issue before you lift the maintenance mode.
Example: the corosync communication between the nodes is blocked in a 2-node cluster
Both nodes report the other node as offline. If the maintenance mode is removed in this situation, the cluster tries to recover by fencing one node. This can have a severe impact on your ongoing HANA maintenance activity.
...
*** Resource management is DISABLED ***
The cluster will not attempt to start, stop or recover services
Node List:
* Node hana1 (1): online, feature set 3.19.0
* Node hana2 (2): UNCLEAN (offline)
Full List of Resources:
* Clone Set: cln_SAPHanaTop_RH1_HDB02 [rsc_SAPHanaTop_RH1_HDB02] (maintenance):
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started hana2 (UNCLEAN, maintenance)
* rsc_SAPHanaTop_RH1_HDB02 (ocf:heartbeat:SAPHanaTopology): Started hana1 (maintenance)
* Clone Set: cln_SAPHanaCon_RH1_HDB02 [rsc_SAPHanaCon_RH1_HDB02] (promotable, maintenance):
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Unpromoted hana2 (UNCLEAN, maintenance)
* rsc_SAPHanaCon_RH1_HDB02 (ocf:heartbeat:SAPHanaController): Promoted hana1 (maintenance)
* Clone Set: cln_SAPHanaFil_RH1_HDB02 [rsc_SAPHanaFil_RH1_HDB02] (maintenance):
* rsc_SAPHanaFil_RH1_HDB02 (ocf:heartbeat:SAPHanaFilesystem): Started hana2 (UNCLEAN, maintenance)
* rsc_SAPHanaFil_RH1_HDB02 (ocf:heartbeat:SAPHanaFilesystem): Started hana1 (maintenance)
* rsc_vip_RH1_HDB02_primary (ocf:heartbeat:IPaddr2): Started hana1 (maintenance)
* rsc_vip_RH1_HDB02_readonly (ocf:heartbeat:IPaddr2): Started hana2 (UNCLEAN, maintenance)
...
Identify the root cause of the issue, for example:
- Planned network maintenance on the cluster communication connection in parallel to your HANA maintenance.
- Unplanned outage of network connections due to network device failures or misconfiguration on operating system or network level.
- Firewall configuration blocking cluster communication ports.
Fix any issue to prevent the cluster from taking recovery measures when the cluster maintenance is removed.