6.2. 触发 indexserver 崩溃恢复
通过模拟 hdbindexserver 进程崩溃来测试 ChkSrv HA/DR 供应商的功能。您可以在主或辅助实例上执行此操作。确切的恢复操作取决于整个配置。
先决条件
-
您已配置了
ChkSrvHA/DR 供应商。如果您尚未配置此可选 hook,请跳过此测试。 - 您的 HANA 实例具有健康的 HANA 系统复制。
- 集群状态没有失败。
流程
以 <
sid>adm用户身份,使用单独的终端来监控 HANA 实例进程:rh1adm $ watch "sapcontrol -nr ${TINSTANCE} -function GetProcessList | column -s ',' -t"在另一个终端中,终止
hdbindexserver进程:rh1adm $ kill <PID>
验证
检查同一实例上的专用 HANA 跟踪日志,识别事件和相关操作,用户 <
sid>adm:rh1adm $ cdtrace; less nameserver_chksrv.trc ... ChkSrv srServiceStateChanged method called with SAPSYSTEMNAME=RH1 srv:indexserver-30203-stopping-yes db:RH1-3-yes daem:yes LOST: indexserver event looks like a lost indexserver (status=stopping) LOST: stop instance. action_on_lost=stop ChkSrv version 1.001.1. Method srServiceStateChanged method called. ...以
root用户身份检查资源故障信息的集群状态:[root]# pcs status --full ... Failed Resource Actions: * rsc_SAPHanaCon_RH1_HDB02_start_0 on node1 'not running' (7): call=61, status='complete', ... ...以
root用户身份,检查集群侧的相关操作:[root]# grep rsc_SAPHanaCon_RH1_HDB02 /var/log/messages ... SAPHanaController(rsc_SAPHanaCon_RH1_HDB02)[149199]: INFO: ##-2-## DEC: PRIMDEFECT (in DEMOTED status) SAPHanaController(rsc_SAPHanaCon_RH1_HDB02)[149206]: INFO: ##-2-## RA ==== end action monitor_clone with rc=7 (1.2.7) (3s; times=0m0.051s 0m0.079s 0m1.325s 0m1.137s)==== pacemaker-controld[1694]: notice: Result of monitor operation for rsc_SAPHanaCon_RH1_HDB02 on node1: not running pacemaker-controld[1694]: notice: Transition 142 action 18 (rsc_SAPHanaCon_RH1_HDB02_monitor_61000 on node1): expected 'ok' but got 'not running' pacemaker-attrd[1692]: notice: Setting last-failure-rsc_SAPHanaCon_RH1_HDB02#monitor_61000[node1] in instance_attributes: (unset) -> 1746703980 pacemaker-attrd[1692]: notice: Setting fail-count-rsc_SAPHanaCon_RH1_HDB02#monitor_61000[node1] in instance_attributes: (unset) -> 1 pacemaker-attrd[1692]: notice: Setting master-rsc_SAPHanaCon_RH1_HDB02[node2] in instance_attributes: 100 -> 145 pacemaker-schedulerd[1693]: warning: Unexpected result (not running) was recorded for monitor of rsc_SAPHanaCon_RH1_HDB02:0 on node1 at ... pacemaker-schedulerd[1693]: notice: Actions: Recover rsc_SAPHanaCon_RH1_HDB02:0 ( Unpromoted node1 ) pacemaker-controld[1694]: notice: Initiating monitor operation rsc_SAPHanaCon_RH1_HDB02_monitor_59000 on node2 pacemaker-controld[1694]: notice: Initiating stop operation rsc_SAPHanaCon_RH1_HDB02_stop_0 locally on node1 pacemaker-controld[1694]: notice: Requesting local execution of stop operation for rsc_SAPHanaCon_RH1_HDB02 on node1 ...下一个
SAPHanaController资源监控器会报告意外停止的 HANA 实例作为故障,并根据配置启动恢复步骤。如果启用了PREFER_SITE_TAKEOVER,且测试是在主实例上执行的,它会触发 HANA 接管次要实例。
后续步骤
- 清除集群中可能来自之前测试的任何故障通知。如需更多信息,请参阅 清理失败历史记录。
- 根据需要,根据配置手动重新注册停止的前一个 HANA 实例,并使用 HANA 工具启动它。如需更多信息,请参阅 接管后注册前的主要内容。