5.5. 测试 3:将主站点故障切换到第三个站点
测试的主题 | 将主站点故障转移到第三个站点。 辅助将重新注册到第三个站点。 |
测试先决条件 |
|
测试步骤 |
将集群设置为
使用 |
启动测试 |
执行 SAP HANA 命令on remotehost3: |
监控测试 |
在第三个站点 run: |
预期结果 |
|
返回初始状态的方法 |
详细描述
检查数据库是否使用 Check 数据库 运行,并检查复制状态:
clusternode2:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"
例如,输出为:
mode: syncmem primary masters: clusternode1
在这种情况下,主数据库是 clusternode1。如果在 clusternode1 上运行这个命令,您将获得:
mode: primary
在这个主节点上,您还可以显示系统复制状态。它应该类似如下:
clusternode1:rh2adm> cdpy clusternode1:rh2adm> python systemReplicationStatus.py |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |remotehost3 | 30201 | 3 |DC3 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |remotehost3 | 30207 | 3 |DC3 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |remotehost3 | 30203 | 3 |DC3 |YES |SYNCMEM |ACTIVE | | True | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |clusternode2 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |clusternode2 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |clusternode2 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | status system replication site "3": ACTIVE status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: DC1
- 现在,我们有一个适当的环境,我们可以开始监控所有 3 个节点上的系统复制状态。
在测试启动前,应启动 3 个监视器。执行测试时,输出将改变。因此,只要测试还没有完成,使它们保持运行。
在旧的主节点上,clusternode1 在测试过程中在一个单独的窗口中运行:
clusternode1:rh2adm> watch -n 5 'python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?`
clusternode1 上的输出将是:
Every 5.0s: python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicati... clusternode1: Tue XXX XX HH:MM:SS 2023 |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary | Replication |Replication |Replication |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status | Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- | ----------- |----------- |-------------- |------------ | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |remotehost3 | 30201 | 3 |DC3 |YES | ASYNC |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |remotehost3 | 30207 | 3 |DC3 |YES | ASYNC |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |remotehost3 | 30203 | 3 |DC3 |YES | ASYNC |ACTIVE | | True | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |clusternode2 | 30201 | 2 |DC2 |YES | SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |clusternode2 | 30207 | 2 |DC2 |YES | SYNCMEM |ACTIVE | | True | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |clusternode2 | 30203 | 2 |DC2 |YES | SYNCMEM |ACTIVE | | True | status system replication site "3": ACTIVE status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: DC1 Status 15
在 remotehost3 上运行相同的命令:
remotehost3:rh2adm> watch -n 5 'python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'
响应将是:
this system is either not running or not primary system replication site
测试启动故障转移后,输出将改变。在测试启动前,输出类似于主节点的示例。
在第二个节点上启动:
clusternode2:rh2adm> watch -n 10 'hdbnsutil -sr_state | grep masters'
这将显示当前 master clusternode1,并在启动故障转移后立即切换。
-
为确保一切配置正确,还要检查
global.ini
。 -
在 DC1、DC2 和 DC3 上检查
global.ini
。
在所有三个节点上,global.ini
应该包含:
[persistent] log_mode=normal [system_replication] register_secondaries_on_takeover=true
您可以使用以下内容编辑 global.ini:
clusternode1:rh2adm>vim /usr/sap/$SAPSYSTEMNAME/SYS/global/hdb/custom/config/global.ini
[可选] 将集群置于
maintenance-mode
:[root@clusternode1]# pcs property set maintenance-mode=true
在测试过程中,您会发现故障转移将使用 和,而不设置
maintenance-mode
。因此,您可以在不使用它的情况下运行第一个测试。在恢复它时,我只想向您显示,而是使用。这是主要方面的一个选项,不能访问。开始测试:故障切换到 DC3。在 remotehost3 上,请运行:
remotehost3:rh2adm> hdbnsutil -sr_takeover done.
测试已启动,现在请检查之前启动的 monitor 的输出。
在 clusternode1 上,系统复制状态将丢失与 remotehost3 和 clusternode2 (DC2)的关系:
Every 5.0s: python /usr/sap/RH2/HDB02/exe/python_support/systemReplicationStatus.py ; echo Status $? clusternode1: Mon Sep 4 11:52:16 2023 |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replic ation |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |------ ---------------------- |------------ | |SYSTEMDB |clusternode1 |30201 |nameserver | 1 | 1 |DC1 |clusternode2 | 30201 | 2 |DC2 |YES |SYNCMEM |ERROR |Commun ication channel closed | False | |RH2 |clusternode1 |30207 |xsengine | 2 | 1 |DC1 |clusternode2 | 30207 | 2 |DC2 |YES |SYNCMEM |ERROR |Commun ication channel closed | False | |RH2 |clusternode1 |30203 |indexserver | 3 | 1 |DC1 |clusternode2 | 30203 | 2 |DC2 |YES |SYNCMEM |ERROR |Commun ication channel closed | False | status system replication site "2": ERROR overall system replication status: ERROR Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 1 site name: DC1 Status 11
集群仍然不会注意到此行为。如果您检查系统复制状态的返回码,则返回码 11 表示错误,这告诉您出现问题。如果您有访问权限,现在最好进入 maintenance-mode
。
remotehost3 变为新的主主,而 clusternode2 (DC2)会在 remotehost3 上自动注册为新主。
remotehost3 的系统复制状态的输出示例:
Every 5.0s: python /usr/sap/RH2/HDB02/exe/python_support/systemReplicationStatus.py ; echo Status $? remotehost3: Mon Sep 4 13:55:29 2023 |Database |Host |Port |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary |Replication |Replication |Replic ation |Secondary | | | | | | | | |Host |Port |Site ID |Site Name |Active Status |Mode |Status |Status Details |Fully Synced | |-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |------ -------- |------------ | |SYSTEMDB |remotehost3 |30201 |nameserver | 1 | 3 |DC3 |clusternode2 | 30201 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30207 |xsengine | 2 | 3 |DC3 |clusternode2 | 30207 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | |RH2 |remotehost3 |30203 |indexserver | 3 | 3 |DC3 |clusternode2 | 30203 | 2 |DC2 |YES |SYNCMEM |ACTIVE | | True | status system replication site "2": ACTIVE overall system replication status: ACTIVE Local System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mode: PRIMARY site id: 3 site name: DC3 Status 15
returncode 15 还表示一切都是 okay,但缺少 clusternode1。这必须手动重新注册。前一个主 clusternode1 没有被列出,因此复制关系会丢失。
-
设置
maintenance-mode
。
如果还没有在集群的一个节点上设置 maintenance-mode
之前完成,使用以下命令:
[root@clusternode1]# pcs property set maintenance-mode=true
您可以运行以下命令来检查 maintenance-mode
是否活跃:
[root@clusternode1]# pcs resource * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02] (unmanaged): * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode2 (unmanaged) * SAPHanaTopology_RH2_02 (ocf::heartbeat:SAPHanaTopology): Started clusternode1 (unmanaged) * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable, unmanaged): * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Slave clusternode2 (unmanaged) * SAPHana_RH2_02 (ocf::heartbeat:SAPHana): Master clusternode1 (unmanaged) * vip_RH2_02_MASTER (ocf::heartbeat:IPaddr2): Started clusternode1 (unmanaged)
资源显示非受管状态,这表示集群处于 maintenance-mode=true
。虚拟 IP 地址仍然在 clusternode1 上启动。如果要在另一个节点上使用此 IP,请在设置 maintanence-mode=true
前禁用 vip_RH2_02_MASTER。
[root@clusternode1]# pcs resource disable vip_RH2_02_MASTER
当我们检查 clusternode1 上的 sr_state 时,您将只看到一个与 DC2 的关系:
clusternode1:rh2adm> hdbnsutil -sr_state System Replication State ~~~~~~~~~~~~~~~~~~~~~~~~ online: true mode: primary operation mode: primary site id: 1 site name: DC1 is source system: true is secondary/consumer system: false has secondaries/consumers attached: true is a takeover active: false is primary suspended: false Host Mappings: ~~~~~~~~~~~~~~ clusternode1 -> [DC2] clusternode2 clusternode1 -> [DC1] clusternode1 Site Mappings: ~~~~~~~~~~~~~~ DC1 (primary/primary) |---DC2 (syncmem/logreplay) Tier of DC1: 1 Tier of DC2: 2 Replication mode of DC1: primary Replication mode of DC2: syncmem Operation mode of DC1: primary Operation mode of DC2: logreplay Mapping: DC1 -> DC2 done.
但是,当检查 DC2 时,主数据库服务器为 DC3。因此,DC1 的信息不正确。
clusternode2:rh2adm> hdbnsutil -sr_state
如果我们在 DC1 上检查系统复制状态,则返回码为 12 (未知)。因此,需要重新注册 DC1。
您可以使用此命令将以前的主 clusternode1 注册为 remotehost3 的新辅助设备:
clusternode1:rh2adm> hdbnsutil -sr_register --remoteHost=remotehost3 --remoteInstance=${TINSTANCE} --replicationMode=async --name=DC1 --remoteName=DC3 --operationMode=logreplay --online
注册完成后,您将在 remotehost3 上看到复制三个站点,状态(重新代码)将变为 15。如果此操作失败,则必须手动删除 DC1 和 DC3 上的复制关系。请按照 Register Secondary 中描述的说明进行操作。例如,列出现有关系:
hdbnsutil -sr_state
删除您可以使用的现有关系示例:
clusternode1:rh2adm> hdbnsutil -sr_unregister --name=DC2
这通常不需要这样做。
我们假定测试 4 将在测试 3 后执行。因此,恢复步骤是运行测试 4。