5.6. 测试 4:将主节点故障转移到第一个站点


测试的主题

主切换回一个集群节点。

故障恢复并再次启用集群。

将第三个站点重新注册为次要站点。

测试先决条件

  • SAP HANA 主节点在第三个站点上运行。
  • 集群部分正在运行。
  • 集群 进入维护模式
  • 前一个集群主要可以访问。

测试步骤

检查集群的预期主要信息。

从 DC3 节点故障转移到 DC1 节点。

检查前期次要是否已切换到新主设备。

重新注册 remotehost3 作为新的次要。

设置 cluster maintenance-mode=false,集群将继续工作。

监控测试

在新的主启动时:

remotehost3:rh2adm> watch python $DIR_EXECUTABLE/python_support/systemReplicationStatus.py [root@clusternode1]# watch pcs status --full On the secondary start: clusternode:rh2adm> watch hdbnsutil -sr_state

启动测试

检查集群的预期主要内容: [root@clusternode1]# pcs resource

VIP 和提升 SAP HANA 资源应该在同一节点上运行,而这是潜在的新主要资源。

在这个潜在的主要运行中: clusternode1:rh2adm> hdbnsutil -sr_takeover

重新将前一个主重新注册为新次要:

clusternode1:rh2adm> hdbnsutil -sr_register \ --remoteHost=clusternode1 \ --remoteInstance=${TINSTANCE} \ --replicationMode=syncmem \ --name=DC3 \ --remoteName=DC1 \ --operationMode=logreplay \ --force_full_replica \ --online

在设置 maintenance-mode=false 后集群将继续工作。

预期结果

新主要是启动 SAP HANA。

复制状态将显示所有 3 个站点复制。

第二个集群站点会自动重新注册到新主站点。

灾难恢复(DR)站点成为数据库的额外副本。

返回初始状态的方法

运行测试 3。

详细描述

  • 检查集群是否已设置为 maintenance-mode

    [root@clusternode1]# pcs property config maintenance-mode
    Cluster Properties:
     maintenance-mode: true

    如果 maintenance-mode 不是 true,您可以使用以下方法对其进行设置:

    [root@clusternode1]# pcs property set  maintenance-mode=true
  • 检查系统复制状态,并发现所有节点上的主数据库。首先使用以下命令发现主数据库:

    clusternode1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"

其输出应如下所示:

在 clusternode1 上:

clusternode1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"
mode: syncmem
primary masters: remotehost3

在 clusternode2 上:

clusternode2:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"
mode: syncmem
primary masters: remotehost3

在 remotehost3:

remotehost3:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"
mode: primary

在所有三个节点上,主数据库为 remotehost3。在这个主数据库中,您必须确保所有三个节点的系统复制状态处于活跃状态,返回码为 15:

remotehost3:rh2adm> python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py
|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replication    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ |
|SYSTEMDB |remotehost3 |30201 |nameserver   |        1 |      3 |DC3       |clusternode2    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |remotehost3 |30207 |xsengine     |        2 |      3 |DC3       |clusternode2    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |remotehost3 |30203 |indexserver  |        3 |      3 |DC3       |clusternode2    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|SYSTEMDB |remotehost3 |30201 |nameserver   |        1 |      3 |DC3       |clusternode1    |    30201 |        1 |DC1       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |remotehost3 |30207 |xsengine     |        2 |      3 |DC3       |clusternode1    |    30207 |        1 |DC1       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |remotehost3 |30203 |indexserver  |        3 |      3 |DC3       |clusternode1    |    30203 |        1 |DC1       |YES           |SYNCMEM     |ACTIVE      |               |        True |

status system replication site "2": ACTIVE
status system replication site "1": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 3
site name: DC3
[rh2adm@remotehost3: python_support]# echo $?
15
  • 检查所有三个 sr_states 都一致。

请在所有三个节点上运行 hdbnsutil -sr_state --sapcontrol=1 |grep site prerequisitesMode:

clusternode1:rh2adm>hdbnsutil -sr_state --sapcontrol=1 |grep  site.*Mode
clusternode2:rh2adm>hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode
remotehost3:rh2adm>hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode

所有节点上的输出应该相同:

siteReplicationMode/DC1=primary
siteReplicationMode/DC3=async
siteReplicationMode/DC2=syncmem
siteOperationMode/DC1=primary
siteOperationMode/DC3=logreplay
siteOperationMode/DC2=logreplay
  • 在单独的窗口中启动监控。

在 clusternode1 启动时:

clusternode1:rh2adm> watch "python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"

在 remotehost3 启动时:

remotehost3:rh2adm> watch "python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"

在 clusternode2 启动时:

clusternode2:rh2adm> watch "hdbnsutil -sr_state --sapcontrol=1 |grep  siteReplicationMode"
  • 启动测试

在 clusternode1 上切换到 clusternode1 启动:

clusternode1:rh2adm> hdbnsutil -sr_takeover
done.
  • 检查 monitor 的输出。

clusternode1 上的监控器将更改为:

Every 2.0s: python systemReplicationStatus.py; echo $?                                                                                                                                                            clusternode1: Mon Sep  4 23:34:30 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replication    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ |
|SYSTEMDB |clusternode1 |30201 |nameserver   |        1 |      1 |DC1       |clusternode2    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30207 |xsengine     |        2 |      1 |DC1       |clusternode2    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30203 |indexserver  |        3 |      1 |DC1       |clusternode2    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |

status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1
15

重要信息也是返回代码 15。clusternode2 上的监控器将更改为:

Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep  site.*Mode                                                                clusternode2: Mon Sep  4 23:35:18 2023

siteReplicationMode/DC1=primary
siteReplicationMode/DC2=syncmem
siteOperationMode/DC1=primary
siteOperationMode/DC2=logreplay

DC3 已消失,需要重新注册。在 remotehost3 上,systemReplicationStatus 会报告一个错误,并将 returncode 变为 11。

  • 检查集群节点是否已重新注册。

    clusternode1:rh2adm>  hdbnsutil -sr_state
    
    System Replication State
    ~~~~~~~~~~~~~~~~~~~~~~~~
    
    online: true
    
    mode: primary
    operation mode: primary
    site id: 1
    site name: DC1
    
    is source system: true
    is secondary/consumer system: false
    has secondaries/consumers attached: true
    is a takeover active: false
    is primary suspended: false
    
    Host Mappings:
    ~~~~~~~~~~~~~~
    
    clusternode1 -> [DC2] clusternode2
    clusternode1 -> [DC1] clusternode1
    
    
    Site Mappings:
    ~~~~~~~~~~~~~~
    DC1 (primary/primary)
        |---DC2 (syncmem/logreplay)
    
    Tier of DC1: 1
    Tier of DC2: 2
    
    Replication mode of DC1: primary
    Replication mode of DC2: syncmem
    
    Operation mode of DC1: primary
    Operation mode of DC2: logreplay
    
    Mapping: DC1 -> DC2
    done.

    站点映射显示,clusternode2 (DC2)被重新注册。

  • 检查或启用 vip 资源:

    [root@clusternode1]# pcs resource
      * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02] (unmanaged):
        * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started clusternode2 (unmanaged)
        * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started clusternode1
       (unmanaged)
      * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable, unmanaged):
        * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Master clusternode2 (unmanaged)
        * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Slave clusternode1
       (unmanaged)
      * vip_RH2_02_MASTER   (ocf::heartbeat:IPaddr2):        Stopped (disabled, unmanaged)

    vip 资源 vip_RH2_02_MASTER 已停止。要再次运行它:

[root@clusternode1]# pcs resource enable vip_RH2_02_MASTER
Warning: 'vip_RH2_02_MASTER' is unmanaged

警告正确,因为集群不会启动任何资源,除非 maintenance-mode=false

  • 停止集群 维护模式

在我们停止 maintenance-mode 之前,我们应在单独的窗口中启动两个 monitor 以查看更改。在 clusternode2 上运行:

[root@clusternode2]# watch pcs status --full

在 clusternode1 上运行:

clusternode1:rh2adm> watch "python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo $?"

现在,您可以运行以下命令,在 clusternode1 上取消设置 maintenance-mode

[root@clusternode1]# pcs property set maintenance-mode=false

clusternode2 上的 monitor 应该显示一切现在都如预期运行:

Every 2.0s: pcs status --full                                                                                                                                                                                     clusternode1: Tue Sep  5 00:01:17 2023

Cluster name: cluster1
Cluster Summary:
  * Stack: corosync
  * Current DC: clusternode1
 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum
  * Last updated: Tue Sep  5 00:01:17 2023
  * Last change:  Tue Sep  5 00:00:30 2023 by root via crm_attribute on clusternode1

  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ clusternode1
 (1) clusternode2 (2) ]

Full List of Resources:
  * auto_rhevm_fence1   (stonith:fence_rhevm):   Started clusternode1

  * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]:
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started clusternode2
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started clusternode1

  * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable):
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Slave clusternode2
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Master clusternode1

  * vip_RH2_02_MASTER   (ocf::heartbeat:IPaddr2):        Started clusternode1


Node Attributes:
  * Node: clusternode1
 (1):
    * hana_rh2_clone_state              : PROMOTED
    * hana_rh2_op_mode                  : logreplay
    * hana_rh2_remoteHost               : clusternode2
    * hana_rh2_roles                    : 4:P:master1:master:worker:master
    * hana_rh2_site                     : DC1
    * hana_rh2_sra                      : -
    * hana_rh2_srah                     : -
    * hana_rh2_srmode                   : syncmem
    * hana_rh2_sync_state               : PRIM
    * hana_rh2_version                  : 2.00.062.00
    * hana_rh2_vhost                    : clusternode1

    * lpa_rh2_lpt                       : 1693872030
    * master-SAPHana_RH2_02             : 150
  * Node: clusternode2 (2):
    * hana_rh2_clone_state              : DEMOTED
    * hana_rh2_op_mode                  : logreplay
    * hana_rh2_remoteHost               : clusternode1

    * hana_rh2_roles                    : 4:S:master1:master:worker:master
    * hana_rh2_site                     : DC2
    * hana_rh2_sra                      : -
    * hana_rh2_srah                     : -
    * hana_rh2_srmode                   : syncmem
    * hana_rh2_sync_state               : SOK
    * hana_rh2_version                  : 2.00.062.00
    * hana_rh2_vhost                    : clusternode2
    * lpa_rh2_lpt                       : 30
    * master-SAPHana_RH2_02             : 100

Migration Summary:

Tickets:

PCSD Status:
  clusternode1
: Online
  clusternode2: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

手动交互后,最好清理集群,如 Cluster Cleanup 所述。

  • 将 remotehost3 重新注册到 clusternode1 上的新主卷。

需要重新注册 Remotehost3。要监控进度,请在 clusternode1 上启动:

clusternode1:rh2adm> watch -n 5 'python
/usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'

在 remotehost3 上,请开始:

remotehost3:rh2adm> watch 'hdbnsutil -sr_state --sapcontrol=1 |grep  siteReplicationMode'

现在,您可以使用这个命令重新注册 remotehost3:

remotehost3:rh2adm> hdbnsutil -sr_register --remoteHost=clusternode1 --remoteInstance=${TINSTANCE} --replicationMode=async --name=DC3 --remoteName=DC1 --operationMode=logreplay --online

clusternode1 上的监控器将更改为:

Every 5.0s: python /usr/sap/$SAPSYSTEMNAME/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?                                                                                         clusternode1: Tue Sep  5 00:14:40 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replication    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ |
|SYSTEMDB |clusternode1 |30201 |nameserver   |        1 |      1 |DC1       |remotehost3    |    30201 |        3 |DC3       |YES           |ASYNC     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30207 |xsengine     |        2 |      1 |DC1       |remotehost3    |    30207 |        3 |DC3       |YES           |ASYNC     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30203 |indexserver  |        3 |      1 |DC1       |remotehost3    |    30203 |        3 |DC3       |YES           |ASYNC     |ACTIVE      |               |        True |
|SYSTEMDB |clusternode1 |30201 |nameserver   |        1 |      1 |DC1       |clusternode2    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30207 |xsengine     |        2 |      1 |DC1       |clusternode2    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |clusternode1 |30203 |indexserver  |        3 |      1 |DC1       |clusternode2    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |

status system replication site "3": ACTIVE
status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1
Status 15

remotehost3 的监控器将更改为:

Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep  site.*Mode                                                                remotehost3: Tue Sep  5 02:15:28 2023

siteReplicationMode/DC1=primary
siteReplicationMode/DC3=syncmem
siteReplicationMode/DC2=syncmem
siteOperationMode/DC1=primary
siteOperationMode/DC3=logreplay
siteOperationMode/DC2=logreplay

现在,我们再次有 3 个条目,remotehost3 (DC3)再次是从 clusternode1 (DC1)复制的次要站点。

  • 检查所有节点是否是 clusternode1 上的系统复制状态的一部分。

请在所有三个节点上运行 hdbnsutil -sr_state --sapcontrol=1 |grep site prerequisitesMode:

clusternode1:rh2adm> hdbnsutil -sr_state --sapcontrol=1 |grep  site.*ModesiteReplicationMode
clusternode2:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode
remotehost3:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode

在所有节点上,我们应该获得相同的输出:

siteReplicationMode/DC1=primary
siteReplicationMode/DC3=syncmem
siteReplicationMode/DC2=syncmem
siteOperationMode/DC1=primary
siteOperationMode/DC3=logreplay
siteOperationMode/DC2=logreplay
  • 检查 pcs status --fullSOK。运行:
[root@clusternode1]# pcs status --full| grep sync_state

输出应该是 PRIMSOK

 * hana_rh2_sync_state             	: PRIM
    * hana_rh2_sync_state             	: SOK

最后,集群状态应如下所示,包括 sync_state PRIMSOK

[root@clusternode1]# pcs status --full
Cluster name: cluster1
Cluster Summary:
  * Stack: corosync
  * Current DC: clusternode1
 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum
  * Last updated: Tue Sep  5 00:18:52 2023
  * Last change:  Tue Sep  5 00:16:54 2023 by root via crm_attribute on clusternode1

  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ clusternode1
 (1) clusternode2 (2) ]

Full List of Resources:
  * auto_rhevm_fence1   (stonith:fence_rhevm):   Started clusternode1

  * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]:
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started clusternode2
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started clusternode1

  * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable):
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Slave clusternode2
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Master clusternode1

  * vip_RH2_02_MASTER   (ocf::heartbeat:IPaddr2):        Started clusternode1


Node Attributes:
  * Node: clusternode1
 (1):
    * hana_rh2_clone_state              : PROMOTED
    * hana_rh2_op_mode                  : logreplay
    * hana_rh2_remoteHost               : clusternode2
    * hana_rh2_roles                    : 4:P:master1:master:worker:master
    * hana_rh2_site                     : DC1
    * hana_rh2_sra                      : -
    * hana_rh2_srah                     : -
    * hana_rh2_srmode                   : syncmem
    * hana_rh2_sync_state               : PRIM
    * hana_rh2_version                  : 2.00.062.00
    * hana_rh2_vhost                    : clusternode1

    * lpa_rh2_lpt                       : 1693873014
    * master-SAPHana_RH2_02             : 150
  * Node: clusternode2 (2):
    * hana_rh2_clone_state              : DEMOTED
    * hana_rh2_op_mode                  : logreplay
    * hana_rh2_remoteHost               : clusternode1

    * hana_rh2_roles                    : 4:S:master1:master:worker:master
    * hana_rh2_site                     : DC2
    * hana_rh2_sra                      : -
    * hana_rh2_srah                     : -
    * hana_rh2_srmode                   : syncmem
    * hana_rh2_sync_state               : SOK
    * hana_rh2_version                  : 2.00.062.00
    * hana_rh2_vhost                    : clusternode2
    * lpa_rh2_lpt                       : 30
    * master-SAPHana_RH2_02             : 100

Migration Summary:

Tickets:

PCSD Status:
  clusternode1
: Online
  clusternode2: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

请参阅 检查集群状态和 Check 数据库,以验证所有是否再次工作。

Red Hat logoGithubRedditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

© 2024 Red Hat, Inc.