5.5. 测试 3：将主节点故障切换到第三个站点

PDF

测试的主题	将主站点故障转移到第三个站点...第三个站点成为主要站点。辅助将重新注册到第三个站点。
测试先决条件	DC1、DC2、DC3 上的 SAP HANA 正在运行。集群已启动并运行，且没有错误或警告。系统复制已就位并处于同步状态（检查 `% python systemReplicationStatus.py`）。
测试步骤	将集群设置为 `maintenance-mode` 以便能够恢复。使用 `% hdbnsuttil -sr_takeover`组成第三个节点
启动测试	在 remotehost3:rh2adm>: `hdbnsutil -sr_takeover`上执行 SAP HANA 命令
监控测试	在第三个站点中，作为 `sidadm% watch hdbnsutil -sr_state`运行
预期结果	第三个节点将成为主要节点。辅助节点将主 master 更改为 remotehost3。以前的主节点需要重新注册到新主节点。
返回初始状态的方法	运行 Test 4：将主节点故障转移到第一个站点。

详细描述

检查数据库是否使用 Check 数据库运行，并检查复制状态：

clusternode2:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"

输出是：

mode: syncmem
primary masters: clusternode1

在这种情况下，主数据库是 clusternode1。如果在 clusternode1 上运行这个命令，您将获得：

mode: primary

在这个主节点上，您还可以显示系统复制状态。它应该类似如下：

clusternode1:rh2adm> cdpy
clusternode1:rh2adm> python systemReplicationStatus.py
|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replication    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ |
|SYSTEMDB |clusternode1 |30201 |nameserver   |        1 |      1 |DC1       |remotehost3    |    30201 |        3 |DC3       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30207 |xsengine     |        2 |      1 |DC1       |remotehost3    |    30207 |        3 |DC3       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30203 |indexserver  |        3 |      1 |DC1       |remotehost3    |    30203 |        3 |DC3       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|SYSTEMDB |clusternode1 |30201 |nameserver   |        1 |      1 |DC1       |clusternode2    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30207 |xsengine     |        2 |      1 |DC1       |clusternode2    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30203 |indexserver  |        3 |      1 |DC1       |clusternode2    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |

status system replication site "3": ACTIVE
status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1

现在，我们有一个适当的环境，我们可以开始监控所有 3 个节点上的系统复制状态。在测试启动前，应启动 3 个监视器。执行测试时，输出将改变。因此，只要测试还没有完成，使它们保持运行。

在旧的主节点上，clusternode1 在测试过程中在一个单独的窗口中运行：

clusternode1:rh2adm> watch -n 5 'python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'

clusternode1 上的输出将是：

Every 5.0s: python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicati...  clusternode1: Tue XXX XX HH:MM:SS 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |
Replication |Replication |Replication    |Secondary    |
|         |	  |	 |             |          |        |          |Host	 |Port      |Site ID   |Site Name |Active Status |
Mode        |Status	 |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |
----------- |----------- |-------------- |------------ |
|SYSTEMDB |clusternode1 |30201 |nameserver   |        1 |	 1 |DC1       |remotehost3    |    30201 |        3 |DC3	  |YES		 |
ASYNC       |ACTIVE	 |               |        True |
|RH2	  |clusternode1 |30207 |xsengine     |        2 |	 1 |DC1       |remotehost3    |    30207 |        3 |DC3	  |YES		 |
ASYNC       |ACTIVE	 |               |        True |
|RH2	  |clusternode1 |30203 |indexserver  |        3 |	 1 |DC1       |remotehost3    |    30203 |        3 |DC3	  |YES		 |
ASYNC       |ACTIVE	 |               |        True |
|SYSTEMDB |clusternode1 |30201 |nameserver   |        1 |	 1 |DC1       |clusternode2    |    30201 |        2 |DC2	  |YES		 |
SYNCMEM     |ACTIVE	 |               |        True |
|RH2	  |clusternode1 |30207 |xsengine     |        2 |	 1 |DC1       |clusternode2    |    30207 |        2 |DC2	  |YES		 |
SYNCMEM     |ACTIVE	 |               |        True |
|RH2	  |clusternode1 |30203 |indexserver  |        3 |	 1 |DC1       |clusternode2    |    30203 |        2 |DC2	  |YES		 |
SYNCMEM     |ACTIVE	 |               |        True |

status system replication site "3": ACTIVE
status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1
Status 15

在 remotehost3 上，运行相同的命令：

remotehost3:rh2adm> watch -n 5 'python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'

响应将是：

this system is either not running or is not primary system replication site

这将在测试启动故障转移后改变。在测试启动前，输出类似于主节点的示例。

在第二个节点上，启动：

clusternode2:rh2adm> watch -n 10 'hdbnsutil -sr_state | grep masters'

这将显示当前 master clusternode1，并在启动故障转移后立即切换。

为确保一切配置正确，还要检查 global.ini。

在 DC1、DC2 和 DC3 上检查 global.ini ：

在所有三个节点上，global.ini 应包含：

[persistent]
log_mode=normal
[system_replication]
register_secondaries_on_takeover=true

您可以使用以下方法编辑 global.ini ：

clusternode1:rh2adm>vim /usr/sap/${SAPSYSTEMNAME}/SYS/global/hdb/custom/config/global.ini

[可选] 将集群置于 maintenance-mode ：
```
[root@clusternode1]# pcs property set maintenance-mode=true
```
在测试过程中，您会发现故障转移将使用和，而不设置 maintenance-mode。因此，您可以在不使用它的情况下运行第一个测试。在恢复时，应该完成它；我只想向您展示它使用和不使用。如果主设备无法访问，则此选项是一个选项。

开始测试：故障切换到 DC3。在 remotehost3 上，请运行：

remotehost3:rh2adm> hdbnsutil -sr_takeover
done.

测试已启动，现在请检查之前启动的 monitor 的输出。在 clusternode1 上，系统复制状态将丢失与 remotehost3 和 clusternode2 (DC2)的关系：

Every 5.0s: python /usr/sap/RH2/HDB02/exe/python_support/systemReplicationStatus.py ; echo Status $?                               clusternode1: Mon Sep  4 11:52:16 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replic
ation                  |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status
 Details               |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |------
---------------------- |------------ |
|SYSTEMDB |clusternode1 |30201 |nameserver   |        1 |      1 |DC1       |clusternode2    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ERROR       |Commun
ication channel closed |       False |
|RH2      |clusternode1 |30207 |xsengine     |        2 |      1 |DC1       |clusternode2    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ERROR       |Commun
ication channel closed |       False |
|RH2      |clusternode1 |30203 |indexserver  |        3 |      1 |DC1       |clusternode2    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ERROR       |Commun
ication channel closed |       False |

status system replication site "2": ERROR
overall system replication status: ERROR

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1
Status 11

集群仍然不会注意到此行为。如果您检查系统复制状态的返回码，则返回码 11 表示错误，这告诉您出现问题。如果您有访问权限，现在最好进入 maintenance-mode。

remotehost3 变为新主设备，而 clusternode2 (DC2)会在 remotehost3 上自动注册为新主。

remotehost3 的系统复制状态的输出示例：

Every 5.0s: python /usr/sap/RH2/HDB02/exe/python_support/systemReplicationStatus.py ; echo Status $?                               remotehost3: Mon Sep  4 13:55:29 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replic
ation    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status
 Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |------
-------- |------------ |
|SYSTEMDB |remotehost3 |30201 |nameserver   |        1 |      3 |DC3       |clusternode2    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |
         |        True |
|RH2      |remotehost3 |30207 |xsengine     |        2 |      3 |DC3       |clusternode2    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |
         |        True |
|RH2      |remotehost3 |30203 |indexserver  |        3 |      3 |DC3       |clusternode2    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |
         |        True |

status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 3
site name: DC3
Status 15

returncode 15 还表示一切都是 okay，但 clusternode1 缺失。这必须手动重新注册。前一个主 clusternode1 没有被列出，因此复制关系会丢失。

设置 maintenance-mode。

如果之前还没有完成，使用以下命令在集群的一个节点上设置 maintenance-mode ：

[root@clusternode1]# pcs property  set maintenance-mode=true

您可以运行以下命令来检查 maintenance-mode 是否活跃：

[root@clusternode1]# pcs resource
  * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02] (unmanaged):
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started clusternode2node2 (unmanaged)
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started clusternode1node1 (unmanaged)
  * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable, unmanaged):
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Slave clusternode2node2 (unmanaged)
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Master clusternode1node1 (unmanaged)
  * vip_RH2_02_MASTER   (ocf::heartbeat:IPaddr2):        Started clusternode1node1 (unmanaged)

资源显示非受管状态，这表示集群处于 maintenance-mode=true。虚拟 IP 地址仍然在 clusternode1 上启动。如果要在另一个节点上使用此 IP，请在设置 maintanence-mode=true 前禁用 vip_RH2_02_MASTER。

[root@clusternode1]# pcs resource disable vip_RH2_02_MASTER

重新注册 clusternode1。

当检查 clusternode1 上的 sr_state 时，您会看到仅与 DC2 的关系：

clusternode1:rh2adm> hdbnsutil -sr_state

System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~

online: true

mode: primary
operation mode: primary
site id: 1
site name: DC1

is source system: true
is secondary/consumer system: false
has secondaries/consumers attached: true
is a takeover active: false
is primary suspended: false

Host Mappings:
~~~~~~~~~~~~~~

clusternode1 -> [DC2] clusternode2
clusternode1 -> [DC1] clusternode1


Site Mappings:
~~~~~~~~~~~~~~
DC1 (primary/primary)
    |---DC2 (syncmem/logreplay)

Tier of DC1: 1
Tier of DC2: 2

Replication mode of DC1: primary
Replication mode of DC2: syncmem

Operation mode of DC1: primary
Operation mode of DC2: logreplay

Mapping: DC1 -> DC2
done.

但是，当检查 DC2 时，主数据库服务器为 DC3。因此，DC1 的信息不正确。

clusternode2:rh2adm> hdbnsutil -sr_state

如果我们在 DC1 上检查系统复制状态，则返回码为 12，这未知。因此，需要重新注册 DC1。

您可以使用此命令将以前的主 clusternode1 注册为 remotehost3 的新次要。

clusternode1:rh2adm> hdbnsutil -sr_register --remoteHost=remotehost3 --remoteInstance=${TINSTANCE} --replicationMode=asyncsyncmem --name=DC1 --remoteName=DC3 --operationMode=logreplay --online

注册完成后，您将在 remotehost3 上看到复制三个站点，状态（返回代码）将变为 15。

如果此操作失败，则必须手动删除 DC1 和 DC3 上的复制关系。请按照 Register Secondary 中描述的说明进行操作。

例如，列出现有关系：

clusternode1:rh2adm> hdbnsutil -sr_state

要删除您可以使用的现有关系：

clusternode1:rh2adm> hdbnsutil -sr_unregister --name=DC2`

这通常不是必须的。我们假定测试 4 将在测试 3 后执行。因此，恢复步骤是运行测试 4。

5.5. 测试 3：将主节点故障切换到第三个站点

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Red Hat legal and privacy links

Red Hat legal and privacy links