5.6. 测试 4:主站点的反馈到第一个站点
| 测试的主题 | 主切换回一个集群节点。 故障恢复并再次启用集群。 将第三个站点重新注册为次要站点。 |
| 测试前提条件 |
|
| 测试步骤 | 检查集群的预期主设备。 从 DC3 节点故障转移到 DC1 节点。 检查前一个二级是否已切换到新的主。 重新注册 az3n1 作为新次要。
设置 cluster |
| 监控测试 | 在新的主启动时:
在二级启动时:
|
| 启动测试 |
检查集群的预期主设备: VIP 和提升的 SAP HANA 资源应在同一节点上运行,这是潜在的新主节点。
在这个潜在的主要上运行为 将前一个主重新注册为新的次要:
设置 |
| 预期结果 | 新主要是启动 SAP HANA。 复制状态显示所有 3 个站点。 第二个集群站点会自动重新注册到新的主站点。 DR 站点成为数据库的额外副本。 |
| 返回初始状态的方法 | 运行测试 3。 |
详细描述
检查集群是否
被置于维护模式:[root@az1n1]# pcs property config maintenance-mode Cluster Properties: maintenance-mode: true如果
maintenance-mode不是 true,您可以使用以下内容设置它:[root@az1n1]# pcs property set maintenance-mode=true检查系统复制状态,并发现所有节点上的主数据库。
首先,使用以下命令发现主数据库:
az1n1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"输出应如下所示:
在 az1n1 上:
az1n1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters" mode: syncmem primary masters: az3n1在 az2n1 上:
az2n1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters" mode: syncmem primary masters: az3n1在 az3n1 上:
az3n1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters" mode: primary在所有三个节点上,主数据库是 az3n1。
在这个主数据库中,您必须确保所有这三个节点的系统复制状态都处于活跃状态,返回码为 15:
az3n1:rh2adm> python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |az3n1 |30201 |nameserver | 1 | 3 |DC3 |az2n1 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |az3n1 |30207 |xsengine | 2 | 3 |DC3 |az2n1 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |az3n1 |30203 |indexserver | 3 | 3 |DC3 |az2n1 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |SYSTEMDB |az3n1 |30201 |nameserver | 1 | 3 |DC3 |az1n1 | 30201 | 1 |DC1 |YES |SYNCMEM |ACTIVE | | True | |RH2 |az3n1 |30207 |xsengine | 2 | 3 |DC3 |az1n1 | 30207 | 1 |DC1 |YES |SYNCMEM |ACTIVE | | True | |RH2 |az3n1 |30203 |indexserver | 3 | 3 |DC3 |az1n1 | 30203 | 1 |DC1 |YES |SYNCMEM |ACTIVE | | True | status system replication site "2": ACTIVE status system replication site "1": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 3 site name: DC3 [rh2adm@az3n1: python_support]# echo $? 15检查所有三个
sr_states是否都一致。在所有三个节点上运行
hdbnsutil -sr_state --sapcontrol=1 |grep site unsetMode:az1n1:rh2adm>hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode az2n1:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode az3n1:rh2adm>hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode所有节点上的输出都应该相同:
siteReplicationMode/DC1=primary siteReplicationMode/DC3=async siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC3=logreplay siteOperationMode/DC2=logreplay在单独的窗口中启动监控。
在 az1n1 上,启动:
az1n1:rh2adm> watch "python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"在 az3n1 上,启动:
az3n1:rh2adm>watch "python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"在 az2n1 上,启动:
az2n1:rh2adm> watch "hdbnsutil -sr_state --sapcontrol=1 |grep siteReplicationMode"启动测试。
要故障转移到 az1n1,在 az1n1 上启动:
az1n1:rh2adm> hdbnsutil -sr_takeover done.检查监视器的输出。
az1n1 上的监控器发生了以下变化:
Every 2.0s: python systemReplicationStatus.py; echo $? az1n1: Mon Sep 4 23:34:30 2023 |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |az1n1 |30201 |nameserver | 1 | 1 |DC1 |az2n1 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |az1n1 |30207 |xsengine | 2 | 1 |DC1 |az2n1 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |az1n1 |30203 |indexserver | 3 | 1 |DC1 |az2n1 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: DC1 15重要的也是返回代码 15。
az2n1 上的监控器更改为:
Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode az2n1: Mon Sep 4 23:35:18 2023 siteReplicationMode/DC1=primary siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC2=logreplayDC3 已消失,需要重新注册。
在 az3n1 上,
systemReplicationStatus报告错误,返回码更改为 11。检查集群节点是否已重新注册:
az1n1:rh2adm> hdbnsutil -sr_state System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~ online: true mode: primary operation mode: primary site id: 1 site name: DC1 is source system: true is secondary/consumer system: false has secondaries/consumers attached: true is a takeover active: false is primary suspended: false Host Mappings: ~~~~~~~~~~~~~~ az1n1 -> [DC2] az2n1 az1n1 -> [DC1] az1n1 Site Mappings: ~~~~~~~~~~~~~~ DC1 (primary/primary) |---DC2 (syncmem/logreplay) Tier of DC1: 1 Tier of DC2: 2 Replication mode of DC1: primary Replication mode of DC2: syncmem Operation mode of DC1: primary Operation mode of DC2: logreplay Mapping: DC1 -> DC2 done.Site Mapping 显示 az2n1 (DC2)已重新注册。
检查或启用 vip 资源:
[root@az1n1]# pcs resource * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02] (unmanaged): * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started az2n1 (unmanaged) * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started az1n1 (unmanaged) * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable, unmanaged): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master az2n1 (unmanaged) * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave az1n1 (unmanaged) * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Stopped (disabled, unmanaged)vip 资源
vip_RH2_02_MASTER已停止。要再次启动它:
[root@az1n1]# pcs resource enable vip_RH2_02_MASTER Warning: 'vip_RH2_02_MASTER' is unmanaged警告是正确的,因为集群不会启动任何资源,除非
maintenance-mode=false除外。停止集群
维护模式。在停止
maintenance-mode前,应在单独的窗口中启动两个 monitor 以查看更改。在 az2n1 上运行:
[root@az2n1]# watch pcs status --full在 az1n1 上运行:
az1n1:rh2adm> watch "python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo $?"运行这个命令在 az1n1 上取消设置
maintenance-mode:[root@az1n1]# pcs property set maintenance-mode=falseaz1n1 上的监控器应该显示所有按预期运行:
Every 2.0s: pcs status --full az1n1: Tue Sep 5 00:01:17 2023 Cluster name: cluster1 Cluster Summary: * Stack: corosync * Current DC: az1n1 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum * Last updated: Tue Sep 5 00:01:17 2023 * Last change: Tue Sep 5 00:00:30 2023 by root via crm_attribute on az1n1 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ az1n1 (1) az2n1 (2) ] Full List of Resources: * auto_rhevm_fence1 (stonith:fence_rhevm): Started az1n1 * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]: * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started az2n1 * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started az1n1 * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave az2n1 * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master az1n1 * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Started az1n1 Node Attributes: * Node: az1n1 (1): * hana_rh2_clone_state : PROMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : az2n1 * hana_rh2_roles : 4:P:master1:master:worker:master * hana_rh2_site : DC1 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : PRIM * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : az1n1 * lpa_rh2_lpt : 1693872030 * master-SAPHana_RH2_02 : 150 * Node: az2n1 (2): * hana_rh2_clone_state : DEMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : az1n1 * hana_rh2_roles : 4:S:master1:master:worker:master * hana_rh2_site : DC2 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : SOK * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : az2n1 * lpa_rh2_lpt : 30 * master-SAPHana_RH2_02 : 100 Migration Summary: Tickets: PCSD Status: az1n1: Online az2n1: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled手动交互后,务必要清理集群,如 清理集群 中所述。
将 az3n1 重新注册到 az1n1 上的新主位置。
需要重新注册 az3n1。要监控进度,请在 az1n1 上启动:
az1n1:rh2adm> watch -n 5 'python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'在 az3n1 上,启动:
az3n1:rh2adm> watch 'hdbnsutil -sr_state --sapcontrol=1 |grep siteReplicationMode'现在,您可以使用以下命令重新注册 az3n1 :
az3n1:rh2adm> hdbnsutil -sr_register --remoteHost=az1n1 --remoteInstance=${TINSTANCE} --replicationMode=async --name=DC3 --remoteName=DC1 --operationMode=logreplay --onlineaz1n1 上的监控器发生了以下变化:
Every 5.0s: python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $? az1n1: Tue Sep 5 00:14:40 2023 |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |az1n1 |30201 |nameserver | 1 | 1 |DC1 |az3n1 | 30201 | 3 |DC3 |YES |ASYNC |ACTIVE | | True | |RH2 |az1n1 |30207 |xsengine | 2 | 1 |DC1 |az3n1 | 30207 | 3 |DC3 |YES |ASYNC |ACTIVE | | True | |RH2 |az1n1 |30203 |indexserver | 3 | 1 |DC1 |az3n1 | 30203 | 3 |DC3 |YES |ASYNC |ACTIVE | | True | |SYSTEMDB |az1n1 |30201 |nameserver | 1 | 1 |DC1 |az2n1 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |az1n1 |30207 |xsengine | 2 | 1 |DC1 |az2n1 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |az1n1 |30203 |indexserver | 3 | 1 |DC1 |az2n1 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | status system replication site "3": ACTIVE status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: DC1 Status 15az3n1 的监控器变为:
Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode az3n1: Tue Sep 5 02:15:28 2023 siteReplicationMode/DC1=primary siteReplicationMode/DC3=syncmem siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC3=logreplay siteOperationMode/DC2=logreplay现在,我们再次有 3 个条目,而 az3n1 (DC3)再次是一个从 az1n1 (DC1)复制的辅助站点。
检查所有节点都是 az1n1 上的系统复制状态的一部分。
在所有三个节点上运行,
hdbnsutil -sr_state --sapcontrol=1 |grep site targetedMode:az1n1:rh2adm> hdbnsutil -sr_state --sapcontrol=1 |grep site.*ModesiteReplicationMode az2n1:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode az3n1:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode在所有节点上,我们应该获得相同的输出:
siteReplicationMode/DC1=primary siteReplicationMode/DC3=syncmem siteReplicationMode/DC2=syncmem siteOperationMode/DC1=primary siteOperationMode/DC3=logreplay siteOperationMode/DC2=logreplay检查 pcs status --full 和 SOK。
运行:
[root@az1n1]# pcs status --full| grep sync_state输出应该是 PRIM 或 SOK :
* hana_rh2_sync_state : PRIM * hana_rh2_sync_state : SOK最后,集群状态应该类似如下,包括
sync_statePRIM 和 SOK :[root@az1n1]# pcs status --full Cluster name: cluster1 Cluster Summary: * Stack: corosync * Current DC: az1n1 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum * Last updated: Tue Sep 5 00:18:52 2023 * Last change: Tue Sep 5 00:16:54 2023 by root via crm_attribute on az1n1 * 2 nodes configured * 6 resource instances configured Node List: * Online: [ az1n1 (1) az2n1 (2) ] Full List of Resources: * auto_rhevm_fence1 (stonith:fence_rhevm): Started az1n1 * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]: * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started az2n1 * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started az1n1 * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave az2n1 * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master az1n1 * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Started az1n1 Node Attributes: * Node: az1n1 (1): * hana_rh2_clone_state : PROMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : az2n1 * hana_rh2_roles : 4:P:master1:master:worker:master * hana_rh2_site : DC1 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : PRIM * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : az1n1 * lpa_rh2_lpt : 1693873014 * master-SAPHana_RH2_02 : 150 * Node: az2n1 (2): * hana_rh2_clone_state : DEMOTED * hana_rh2_op_mode : logreplay * hana_rh2_remoteHost : az1n1 * hana_rh2_roles : 4:S:master1:master:worker:master * hana_rh2_site : DC2 * hana_rh2_sra : - * hana_rh2_srah : - * hana_rh2_srmode : syncmem * hana_rh2_sync_state : SOK * hana_rh2_version : 2.00.062.00 * hana_rh2_vhost : az2n1 * lpa_rh2_lpt : 30 * master-SAPHana_RH2_02 : 100 Migration Summary: Tickets: PCSD Status: az1n1: Online az2n1: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled- 请参阅 检查集群状态并检查 数据库, 以验证所有是否可以正常工作。