5.4. 测试 ERS 实例失败
当 ASCS 实例的 enqueue 复制服务器(ERS)失败时,pacemaker 集群是否采取必要的操作。
测试先决条件
两个集群节点都有运行
ASCS和ERS的资源组:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 已清除资源和资源组的所有故障,并且已重置故障计数。
测试步骤
-
识别运行
ERS实例的节点上 enqueue 复制服务器进程的 PID。 - 向确定的进程发送 SIGKILL 信号。
-
识别运行
监控
在测试过程中在一个单独的终端中运行以下命令:
watch -n 1 pcs status
[root@node2]# watch -n 1 pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow
预期行为
- 排队复制服务器进程被终止。
-
Pacemaker 集群根据每个配置采取必要的操作,在这种情况下,重启同一节点上的
ERS实例。
测试
切换到 <
sid>adm用户:su - s4hadm
[root@node1]# su - s4hadmCopy to Clipboard Copied! Toggle word wrap Toggle overflow 识别
enqr.sap的 PID:node1:s4hadm 56> pgrep -af enqr.sap 532273 enqr.sapS4H_ERS29 pf=/usr/sap/S4H/SYS/profile/S4H_ERS29_s4ers
node1:s4hadm 56> pgrep -af enqr.sap 532273 enqr.sapS4H_ERS29 pf=/usr/sap/S4H/SYS/profile/S4H_ERS29_s4ersCopy to Clipboard Copied! Toggle word wrap Toggle overflow 终止识别的进程:
node1:s4hadm 58> kill -9 532273
node1:s4hadm 58> kill -9 532273Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意集群 "Failed Resource Actions":
pcs status | grep "Failed Resource Actions" -A1 Failed Resource Actions: * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023
[root@node1]# pcs status | grep "Failed Resource Actions" -A1 Failed Resource Actions: * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023Copy to Clipboard Copied! Toggle word wrap Toggle overflow ERS在同一节点上重启,而不干扰ASCS已在其他节点上运行:pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023[root@node1]# pcs status | egrep -e "S4H_ascs20|S4H_ers29" * S4H_ascs20 (ocf:heartbeat:SAPInstance): Started node2 * S4H_ers29 (ocf:heartbeat:SAPInstance): Started node1 * S4H_ers29 2m-interval monitor on node1 returned 'not running' at Thu Dec 7 13:15:02 2023Copy to Clipboard Copied! Toggle word wrap Toggle overflow
恢复过程
清除失败的操作:
pcs resource cleanup S4H_ers29 … Waiting for 1 reply from the controller ... got reply (done)
[root@node1]# pcs resource cleanup S4H_ers29 … Waiting for 1 reply from the controller ... got reply (done)Copy to Clipboard Copied! Toggle word wrap Toggle overflow