5.6. 测试 4:将主节点故障转移到第一个站点
测试的主题 | 主切换回一个集群节点。 故障恢复并再次启用集群。 将第三个站点重新注册为次要站点。 |
测试先决条件 |
|
测试步骤 | 检查集群的预期主要信息。 从 DC3 节点故障转移到 DC1 节点。 检查前期次要是否已切换到新主设备。 重新注册 remotehost3 作为新的次要。
设置 cluster |
监控测试 | 在新的主启动时:
|
启动测试 |
检查集群的预期主要内容: VIP 和提升 SAP HANA 资源应该在同一节点上运行,而这是潜在的新主要资源。
在这个潜在的主要运行中: 重新将前一个主重新注册为新次要:
在设置 |
预期结果 | 新主要是启动 SAP HANA。 复制状态将显示所有 3 个站点复制。 第二个集群站点会自动重新注册到新主站点。 灾难恢复(DR)站点成为数据库的额外副本。 |
返回初始状态的方法 | 运行测试 3。 |
详细描述
检查集群是否已设置为
maintenance-mode
:[root@clusternode1]# pcs property config maintenance-mode Cluster Properties: maintenance-mode: true
如果
maintenance-mode
不是 true,您可以使用以下方法对其进行设置:[root@clusternode1]# pcs property set maintenance-mode=true
检查系统复制状态,并发现所有节点上的主数据库。首先使用以下命令发现主数据库:
clusternode1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"
其输出应如下所示:
在 clusternode1 上:
clusternode1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters" mode: syncmem primary masters: remotehost3
在 clusternode2 上:
clusternode2:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters" mode: syncmem primary masters: remotehost3
在 remotehost3:
remotehost3:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters" mode: primary
在所有三个节点上,主数据库为 remotehost3。在这个主数据库中,您必须确保所有三个节点的系统复制状态处于活跃状态,返回码为 15:
remotehost3:rh2adm> python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |remotehost3 |30201 |nameserver | 1 | 3 |DC3 |clusternode2 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30207 |xsengine | 2 | 3 |DC3 |clusternode2 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30203 |indexserver | 3 | 3 |DC3 |clusternode2 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |SYSTEMDB |remotehost3 |30201 |nameserver | 1 | 3 |DC3 |clusternode1 | 30201 | 1 |DC1 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30207 |xsengine | 2 | 3 |DC3 |clusternode1 | 30207 | 1 |DC1 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30203 |indexserver | 3 | 3 |DC3 |clusternode1 | 30203 | 1 |DC1 |YES |SYNCMEM |ACTIVE | | True | status system replication site "2": ACTIVE status system replication site "1": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 3 site name: DC3 [rh2adm@remotehost3: python_support]# echo $? 15
-
检查所有三个
sr_states
都一致。
请在所有三个节点上运行 hdbnsutil -sr_state --sapcontrol=1 |grep site prerequisitesMode
:
clusternode1:rh2adm>hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode
clusternode2:rh2adm>hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode
remotehost3:rh2adm>hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode
所有节点上的输出应该相同:
siteReplicationMode/DC1=primary siteReplicationMode/DC3=async siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC3=logreplay siteOperationMode/DC2=logreplay
- 在单独的窗口中启动监控。
在 clusternode1 启动时:
clusternode1:rh2adm> watch "python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"
在 remotehost3 启动时:
remotehost3:rh2adm> watch "python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"
在 clusternode2 启动时:
clusternode2:rh2adm> watch "hdbnsutil -sr_state --sapcontrol=1 |grep siteReplicationMode"
- 启动测试
在 clusternode1 上切换到 clusternode1 启动:
clusternode1:rh2adm> hdbnsutil -sr_takeover done.
- 检查 monitor 的输出。
clusternode1 上的监控器将更改为:
Every 2.0s: python systemReplicationStatus.py; echo $? clusternode1: Mon Sep 4 23:34:30 2023 |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |clusternode2 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |clusternode2 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |clusternode2 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: DC1 15
重要信息也是返回代码 15。clusternode2 上的监控器将更改为:
Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode clusternode2: Mon Sep 4 23:35:18 2023 siteReplicationMode/DC1=primary siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC2=logreplay
DC3 已消失,需要重新注册。在 remotehost3 上,systemReplicationStatus 会报告一个错误,并将 returncode 变为 11。
检查集群节点是否已重新注册。
clusternode1:rh2adm> hdbnsutil -sr_state System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~ online: true mode: primary operation mode: primary site id: 1 site name: DC1 is source system: true is secondary/consumer system: false has secondaries/consumers attached: true is a takeover active: false is primary suspended: false Host Mappings: ~~~~~~~~~~~~~~ clusternode1 -> [DC2] clusternode2 clusternode1 -> [DC1] clusternode1 Site Mappings: ~~~~~~~~~~~~~~ DC1 (primary/primary) |---DC2 (syncmem/logreplay) Tier of DC1: 1 Tier of DC2: 2 Replication mode of DC1: primary Replication mode of DC2: syncmem Operation mode of DC1: primary Operation mode of DC2: logreplay Mapping: DC1 -> DC2 done.
站点映射显示,clusternode2 (DC2)被重新注册。
检查或启用 vip 资源:
[root@clusternode1]# pcs resource * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02] (unmanaged): * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode2 (unmanaged) * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode1 (unmanaged) * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable, unmanaged): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master clusternode2 (unmanaged) * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave clusternode1 (unmanaged) * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Stopped (disabled, unmanaged)
vip 资源
vip_RH2_02_MASTER
已停止。要再次运行它:
[root@clusternode1]# pcs resource enable vip_RH2_02_MASTER Warning: 'vip_RH2_02_MASTER' is unmanaged
警告正确,因为集群不会启动任何资源,除非 maintenance-mode=false
。
-
停止集群
维护模式
。
在我们停止 maintenance-mode
之前,我们应在单独的窗口中启动两个 monitor 以查看更改。在 clusternode2 上运行:
[root@clusternode2]# watch pcs status --full
在 clusternode1 上运行:
clusternode1:rh2adm> watch "python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo $?"
现在,您可以运行以下命令,在 clusternode1 上取消设置 maintenance-mode
:
[root@clusternode1]# pcs property set maintenance-mode=false
clusternode2 上的 monitor 应该显示一切现在都如预期运行:
Every 2.0s: pcs status --full clusternode1: Tue Sep 5 00:01:17 2023 Cluster name: cluster1 Cluster Summary: * Stack: corosync * Current DC: clusternode1 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum * Last updated: Tue Sep 5 00:01:17 2023 * Last change: Tue Sep 5 00:00:30 2023 by root via crm_attribute on clusternode1 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ clusternode1 (1) clusternode2 (2) ] Full List of Resources: * auto_rhevm_fence1 (stonith:fence_rhevm): Started clusternode1 * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]: * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode2 * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode1 * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave clusternode2 * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master clusternode1 * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Started clusternode1 Node Attributes: * Node: clusternode1 (1): * hana_rh2_clone_state : PROMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : clusternode2 * hana_rh2_roles : 4:P:master1:master:worker:master * hana_rh2_site : DC1 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : PRIM * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : clusternode1 * lpa_rh2_lpt : 1693872030 * master-SAPHana_RH2_02 : 150 * Node: clusternode2 (2): * hana_rh2_clone_state : DEMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : clusternode1 * hana_rh2_roles : 4:S:master1:master:worker:master * hana_rh2_site : DC2 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : SOK * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : clusternode2 * lpa_rh2_lpt : 30 * master-SAPHana_RH2_02 : 100 Migration Summary: Tickets: PCSD Status: clusternode1 : Online clusternode2: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
手动交互后,最好清理集群,如 Cluster Cleanup 所述。
- 将 remotehost3 重新注册到 clusternode1 上的新主卷。
需要重新注册 Remotehost3。要监控进度,请在 clusternode1 上启动:
clusternode1:rh2adm> watch -n 5 'python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'
在 remotehost3 上,请开始:
remotehost3:rh2adm> watch 'hdbnsutil -sr_state --sapcontrol=1 |grep siteReplicationMode'
现在,您可以使用这个命令重新注册 remotehost3:
remotehost3:rh2adm> hdbnsutil -sr_register --remoteHost=clusternode1 --remoteInstance=${TINSTANCE} --replicationMode=async --name=DC3 --remoteName=DC1 --operationMode=logreplay --online
clusternode1 上的监控器将更改为:
Every 5.0s: python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $? clusternode1: Tue Sep 5 00:14:40 2023 |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |remotehost3 | 30201 | 3 |DC3 |YES |ASYNC |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |remotehost3 | 30207 | 3 |DC3 |YES |ASYNC |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |remotehost3 | 30203 | 3 |DC3 |YES |ASYNC |ACTIVE | | True | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |clusternode2 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |clusternode2 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |clusternode2 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | status system replication site "3": ACTIVE status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: DC1 Status 15
remotehost3 的监控器将更改为:
Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode remotehost3: Tue Sep 5 02:15:28 2023 siteReplicationMode/DC1=primary siteReplicationMode/DC3=syncmem siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC3=logreplay siteOperationMode/DC2=logreplay
现在,我们再次有 3 个条目,remotehost3 (DC3)再次是从 clusternode1 (DC1)复制的次要站点。
- 检查所有节点是否是 clusternode1 上的系统复制状态的一部分。
请在所有三个节点上运行 hdbnsutil -sr_state --sapcontrol=1 |grep site prerequisitesMode
:
clusternode1:rh2adm> hdbnsutil -sr_state --sapcontrol=1 |grep site.*ModesiteReplicationMode
clusternode2:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode
remotehost3:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode
在所有节点上,我们应该获得相同的输出:
siteReplicationMode/DC1=primary siteReplicationMode/DC3=syncmem siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC3=logreplay siteOperationMode/DC2=logreplay
-
检查
pcs status --full
和SOK
。运行:
[root@clusternode1]# pcs status --full| grep sync_state
输出应该是 PRIM
或 SOK
:
* hana_rh2_sync_state : PRIM * hana_rh2_sync_state : SOK
最后,集群状态应如下所示,包括 sync_state
PRIM
和 SOK
:
[root@clusternode1]# pcs status --full Cluster name: cluster1 Cluster Summary: * Stack: corosync * Current DC: clusternode1 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum * Last updated: Tue Sep 5 00:18:52 2023 * Last change: Tue Sep 5 00:16:54 2023 by root via crm_attribute on clusternode1 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ clusternode1 (1) clusternode2 (2) ] Full List of Resources: * auto_rhevm_fence1 (stonith:fence_rhevm): Started clusternode1 * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]: * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode2 * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode1 * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave clusternode2 * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master clusternode1 * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Started clusternode1 Node Attributes: * Node: clusternode1 (1): * hana_rh2_clone_state : PROMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : clusternode2 * hana_rh2_roles : 4:P:master1:master:worker:master * hana_rh2_site : DC1 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : PRIM * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : clusternode1 * lpa_rh2_lpt : 1693873014 * master-SAPHana_RH2_02 : 150 * Node: clusternode2 (2): * hana_rh2_clone_state : DEMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : clusternode1 * hana_rh2_roles : 4:S:master1:master:worker:master * hana_rh2_site : DC2 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : SOK * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : clusternode2 * lpa_rh2_lpt : 30 * master-SAPHana_RH2_02 : 100 Migration Summary: Tickets: PCSD Status: clusternode1 : Online clusternode2: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
请参阅 检查集群状态和 Check 数据库,以验证所有是否再次工作。