5.6. 测试 4:将主节点故障转移到第一个站点
| 测试的主题 | 主切换回一个集群节点。 故障恢复并再次启用集群。 将第三个站点重新注册为次要站点。 |
| 测试先决条件 |
|
| 测试步骤 | 检查集群的预期主要信息。 从 DC3 节点故障转移到 DC1 节点。 检查前期次要是否已切换到新主设备。 重新注册 remotehost3 作为新的次要。
设置 cluster |
| 监控测试 | 在新的主启动时:
在 secondary 启动时:
|
| 启动测试 |
检查集群的预期主要内容: VIP 和提升 SAP HANA 资源应该在同一节点上运行,而这是潜在的新主要资源。
在这个潜在的主要主机上运行为 重新将前一个主重新注册为新次要:
在设置 |
| 预期结果 | 新主要是启动 SAP HANA。 复制状态将显示所有 3 个站点复制。 第二个集群站点会自动重新注册到新主站点。 DR 站点成为数据库的额外副本。 |
| 返回初始状态的方法 | 运行测试 3。 |
详细描述
检查集群是否已设置为
maintenance-mode:[root@clusternode1]# pcs property config maintenance-mode Cluster Properties: maintenance-mode: true如果
maintenance-mode不是 true,您可以使用以下方法对其进行设置:[root@clusternode1]# pcs property set maintenance-mode=true检查系统复制状态,并发现所有节点上的主数据库。
首先,使用以下命令发现主数据库:
clusternode1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"输出应如下:
在 clusternode1 上:
clusternode1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters" mode: syncmem primary masters: remotehost3在 clusternode2 上:
clusternode2:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters" mode: syncmem primary masters: remotehost3在 remotehost3:
remotehost3:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters" mode: primary在所有三个节点上,主数据库为 remotehost3。
在这个主数据库中,您必须确保所有三个节点的系统复制状态处于活跃状态,返回码为 15:
remotehost3:rh2adm> python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |remotehost3 |30201 |nameserver | 1 | 3 |DC3 |clusternode2 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30207 |xsengine | 2 | 3 |DC3 |clusternode2 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30203 |indexserver | 3 | 3 |DC3 |clusternode2 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |SYSTEMDB |remotehost3 |30201 |nameserver | 1 | 3 |DC3 |clusternode1 | 30201 | 1 |DC1 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30207 |xsengine | 2 | 3 |DC3 |clusternode1 | 30207 | 1 |DC1 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30203 |indexserver | 3 | 3 |DC3 |clusternode1 | 30203 | 1 |DC1 |YES |SYNCMEM |ACTIVE | | True | status system replication site "2": ACTIVE status system replication site "1": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 3 site name: DC3 [rh2adm@remotehost3: python_support]# echo $? 15检查所有三个
sr_states是否都一致。请在所有三个节点上运行
hdbnsutil -sr_state --sapcontrol=1 |grep site prerequisitesMode:clusternode1:rh2adm>hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode clusternode2:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode remotehost3:rh2adm>hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode所有节点上的输出应该相同:
siteReplicationMode/DC1=primary siteReplicationMode/DC3=async siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC3=logreplay siteOperationMode/DC2=logreplay在单独的窗口中启动监控。
在 clusternode1 上,启动:
clusternode1:rh2adm> watch "python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"在 remotehost3 上,启动:
remotehost3:rh2adm>watch "python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"在 clusternode2 上,启动:
clusternode2:rh2adm> watch "hdbnsutil -sr_state --sapcontrol=1 |grep siteReplicationMode"启动测试。
要切换到 clusternode1,在 clusternode1 上启动:
clusternode1:rh2adm> hdbnsutil -sr_takeover done.检查 monitor 的输出。
clusternode1 上的监控器将更改为:
Every 2.0s: python systemReplicationStatus.py; echo $? clusternode1: Mon Sep 4 23:34:30 2023 |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |clusternode2 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |clusternode2 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |clusternode2 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: DC1 15重要信息也是返回代码 15。
clusternode2 上的监控器将更改为:
Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode clusternode2: Mon Sep 4 23:35:18 2023 siteReplicationMode/DC1=primary siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC2=logreplayDC3 已消失,需要重新注册。
在 remotehost3 上,
systemReplicationStatus报告了错误,并且返回码更改为 11。检查集群节点是否已重新注册:
clusternode1:rh2adm> hdbnsutil -sr_state System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~ online: true mode: primary operation mode: primary site id: 1 site name: DC1 is source system: true is secondary/consumer system: false has secondaries/consumers attached: true is a takeover active: false is primary suspended: false Host Mappings: ~~~~~~~~~~~~~~ clusternode1 -> [DC2] clusternode2 clusternode1 -> [DC1] clusternode1 Site Mappings: ~~~~~~~~~~~~~~ DC1 (primary/primary) |---DC2 (syncmem/logreplay) Tier of DC1: 1 Tier of DC2: 2 Replication mode of DC1: primary Replication mode of DC2: syncmem Operation mode of DC1: primary Operation mode of DC2: logreplay Mapping: DC1 -> DC2 done.站点映射显示 clusternode2 (DC2)被重新注册。
检查或启用 vip 资源:
[root@clusternode1]# pcs resource * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02] (unmanaged): * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode2 (unmanaged) * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode1 (unmanaged) * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable, unmanaged): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master clusternode2 (unmanaged) * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave clusternode1 (unmanaged) * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Stopped (disabled, unmanaged)vip 资源
vip_RH2_02_MASTER已停止。要再次运行它:
[root@clusternode1]# pcs resource enable vip_RH2_02_MASTER Warning: 'vip_RH2_02_MASTER' is unmanaged警告正确,因为集群不会启动任何资源,除非
maintenance-mode=false。停止集群
维护模式。在停止
maintenance-mode之前,我们应在单独的窗口中启动两个 monitor 以查看更改。在 clusternode2 上运行:
[root@clusternode2]# watch pcs status --full在 clusternode1 上,运行:
clusternode1:rh2adm> watch "python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo $?"现在,您可以运行以下命令,在 clusternode1 上取消设置
maintenance-mode:[root@clusternode1]# pcs property set maintenance-mode=falseclusternode1 上的 monitor 应该显示一切现在都如预期运行:
Every 2.0s: pcs status --full clusternode1: Tue Sep 5 00:01:17 2023 Cluster name: cluster1 Cluster Summary: * Stack: corosync * Current DC: clusternode1 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum * Last updated: Tue Sep 5 00:01:17 2023 * Last change: Tue Sep 5 00:00:30 2023 by root via crm_attribute on clusternode1 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ clusternode1 (1) clusternode2 (2) ] Full List of Resources: * auto_rhevm_fence1 (stonith:fence_rhevm): Started clusternode1 * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]: * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode2 * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode1 * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave clusternode2 * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master clusternode1 * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Started clusternode1 Node Attributes: * Node: clusternode1 (1): * hana_rh2_clone_state : PROMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : clusternode2 * hana_rh2_roles : 4:P:master1:master:worker:master * hana_rh2_site : DC1 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : PRIM * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : clusternode1 * lpa_rh2_lpt : 1693872030 * master-SAPHana_RH2_02 : 150 * Node: clusternode2 (2): * hana_rh2_clone_state : DEMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : clusternode1 * hana_rh2_roles : 4:S:master1:master:worker:master * hana_rh2_site : DC2 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : SOK * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : clusternode2 * lpa_rh2_lpt : 30 * master-SAPHana_RH2_02 : 100 Migration Summary: Tickets: PCSD Status: clusternode1: Online clusternode2: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled手动交互后,始终可以清理集群建议,如 Cluster Cleanup 所述。
将 remotehost3 重新注册到 clusternode1 上的新主卷。
需要重新注册 Remotehost3。要监控进度,请在 clusternode1 上启动:
con_cluster_cleanupclusternode1:rh2adm> watch -n 5 'python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'在 remotehost3 上,请启动:
remotehost3:rh2adm> watch 'hdbnsutil -sr_state --sapcontrol=1 |grep siteReplicationMode'现在,您可以使用这个命令重新注册 remotehost3:
remotehost3:rh2adm> hdbnsutil -sr_register --remoteHost=clusternode1 --remoteInstance=${TINSTANCE} --replicationMode=async --name=DC3 --remoteName=DC1 --operationMode=logreplay --onlineclusternode1 上的监控器将更改为:
Every 5.0s: python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $? clusternode1: Tue Sep 5 00:14:40 2023 |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |remotehost3 | 30201 | 3 |DC3 |YES |ASYNC |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |remotehost3 | 30207 | 3 |DC3 |YES |ASYNC |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |remotehost3 | 30203 | 3 |DC3 |YES |ASYNC |ACTIVE | | True | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |clusternode2 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |clusternode2 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |clusternode2 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | status system replication site "3": ACTIVE status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: DC1 Status 15remotehost3 的监控器将更改为:
Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode remotehost3: Tue Sep 5 02:15:28 2023 siteReplicationMode/DC1=primary siteReplicationMode/DC3=syncmem siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC3=logreplay siteOperationMode/DC2=logreplay现在,我们再次有 3 个条目,remotehost3 (DC3)再次是从 clusternode1 (DC1)复制的次要站点。
检查所有节点是否是 clusternode1 上的系统复制状态的一部分。
请在所有三个节点上运行
hdbnsutil -sr_state --sapcontrol=1 |grep site prerequisitesMode:clusternode1:rh2adm> hdbnsutil -sr_state --sapcontrol=1 |grep site.*ModesiteReplicationMode clusternode2:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode remotehost3:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode在所有节点上,我们应该获得相同的输出:
siteReplicationMode/DC1=primary siteReplicationMode/DC3=syncmem siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC3=logreplay siteOperationMode/DC2=logreplay检查 pcs status --full 和 SOK。
运行:
[root@clusternode1]# pcs status --full| grep sync_state输出应该是 PRIM 或 SOK:
* hana_rh2_sync_state : PRIM * hana_rh2_sync_state : SOK最后,集群状态应如下所示,包括
sync_statePRIM 和 SOK:[root@clusternode1]# pcs status --full Cluster name: cluster1 Cluster Summary: * Stack: corosync * Current DC: clusternode1 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum * Last updated: Tue Sep 5 00:18:52 2023 * Last change: Tue Sep 5 00:16:54 2023 by root via crm_attribute on clusternode1 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ clusternode1 (1) clusternode2 (2) ] Full List of Resources: * auto_rhevm_fence1 (stonith:fence_rhevm): Started clusternode1 * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]: * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode2 * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode1 * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave clusternode2 * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master clusternode1 * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Started clusternode1 Node Attributes: * Node: clusternode1 (1): * hana_rh2_clone_state : PROMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : clusternode2 * hana_rh2_roles : 4:P:master1:master:worker:master * hana_rh2_site : DC1 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : PRIM * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : clusternode1 * lpa_rh2_lpt : 1693873014 * master-SAPHana_RH2_02 : 150 * Node: clusternode2 (2): * hana_rh2_clone_state : DEMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : clusternode1 * hana_rh2_roles : 4:S:master1:master:worker:master * hana_rh2_site : DC2 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : SOK * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : clusternode2 * lpa_rh2_lpt : 30 * master-SAPHana_RH2_02 : 100 Migration Summary: Tickets: PCSD Status: clusternode1: Online clusternode2: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled- 请参阅 检查集群状态和 Check 数据库,以验证所有是否再次工作。