Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 5. Test cases

After finishing the installation, it is recommended to run some basic tests to check the installation and verify how SAP HANA multitarget system replication is working and how it recovers from a failure. It is always a good practice to run these test cases before starting production. If possible, you can also prepare a test environment to verify the changes before applying them in production.

All cases describes:

Subject of the test
Test preconditions
Test steps
Monitoring the test
Starting the test
Expected result(s)
Ways to return to an initial state

To automatically register a former primary HANA replication site as a new secondary HANA replication site on the HANA instances that the cluster manages, you can use the option AUTOMATED_REGISTER=true in the SAPHana resource.

Note

What happens to the former SAP HANA primary database after the takeover completes and the constraint is removed depends on the setting of the AUTOMATED_REGISTER parameter of the SAPHana resource: If Automated_REGISTER=true, then the former SAP HANA primary database is registered as the new secondary, and SAP HANA system replication becomes active again. If AUTOMATED_REGISTER=false, then it is up to the operator to decide what should happen with the former SAP HANA primary database after the takeover.

The names of the HA cluster nodes and the HANA replication sites (in brackets) used in the examples are:

az1n1 (DC1)
az1n2 (DC1)
az2n1 (DC2)
az3n1 (DC3)
az3n2 (DC3)

The following parameters are used for configuring the HANA instances and the cluster:

SID=RH2
INSTANCENUMBER=02
CLUSTERNAME=cluster1

You can use az1n1, az1n2, az2n1, az2n2, az3n1 and az3n2 as aliases in the /etc/hosts of all nodes in your test environment.

The tests are described in more detail, including examples and additional checks of preconditions. At the end, there are examples of how to clean up the environment to be prepared for further testing.

In some cases, if the distance between DC1, DC2 and DC3 is too long, you should use –replcationMode=async instead of –replicationMode=syncmem. You must also ask your SAP HANA administrator before choosing the right option.

5.1. Preparing the tests
Link kopieren

Before we run a test, the complete environment needs to be in a correct and healthy state. We have to check the cluster and the database via:

pcs status --full
python systemReplicationStatus.py
df -h

An example for pcs status --full can be found in Checking cluster status with pcs status. If there are warnings or previous failures in the "Migration Summary", you should clean up the cluster before you start your test.

[root@az1n1]# pcs resource clear SAPHana_RH2_02-clone

Cleaning up cluster describes some more ways to do it. It is important that the cluster and all the resources be started.

Besides the cluster, the database should also be up and running and in sync. The easiest way to verify the proper status of the database is to check the system replication status. See also Checking the replication status. This should be checked on one instance of the primary database.

To discover the primary node, you can check Discovering primary database or use:

pcs status | grep -E "Promoted|Master"
hdbnsutil -sr_stateConfiguration

Check if there is enough space on the file systems:

# df -h

Follow the guidelines for system check before you continue. If the environment is clean, it is ready to run the tests. During the test, monitoring is helpful to observe progress.

5.2. Monitoring the environment
Link kopieren

In this section, we are focusing on monitoring the environment during the tests. This section only covers the necessary monitors to see the changes. It is recommended to run the monitors from a dedicated terminal. To be able to detect changes during the test, it is recommended to start monitoring before starting the test.

In the Useful commands section, more examples are shown.

5.2.1. Discovering the primary node
Link kopieren

You need to discover the primary node to monitor a failover or run certain commands that only provide information about the replication status when executed on the primary node.

To discover the primary node, you can run the following commands as the <sid>adm user:

az1n1:rh2adm> watch -n 5 'hdbnsutil -sr_stateConfiguration | egrep -e "primary masters|^mode"'

If the node is running an instance of the primary database the output looks like:

mode: primary

If the node is running an instance of the secondary database the output looks like:

mode: sync
primary masters: az1n1

5.2.2. Checking the replication status
Link kopieren

The replication status shows the relationship between primary and secondary database nodes and the current status of the replication.

To discover the replication status, you can run as the <sid>adm user:

az1n1:rh2adm> hdbnsutil -sr_stateConfiguration

If you want to permanently monitor changes in the system replication status, run the following command on the nodes which run the primary database:

az1n1:rh2adm> watch -n 5 'python
/usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'

This example repeatedly captures the replication status and also determines the current return code.
As long as the return code (status) is 15, the replication status is fine. The other return codes are:

10: NoHSR
11: Error
12: Unknown
13: Initializing
14: Syncing
15: Active

If you register a new secondary, you can run it in a separate window on one of the primary nodes, and you can see the progress of the replication. If you want to monitor a failover, you can run it in parallel on the old primary as well as on the new primary database servers. For more information, refer to Checking SAP HANA system replication status.

5.2.3. Checking /var/log/messages entries
Link kopieren

Pacemaker is writing a lot of information into the /var/log/messages and /var/log/pacemaker/pacemaker.log files. During a failover, a huge number of messages are written into this message file. To be able to follow only the important messages depending on the SAP HANA resource agent, it is useful to filter the detailed activities of the pacemaker SAP resources. It is enough to check the message file on a single cluster node.

For example, you can use this alias:

# alias tmsl='tail -1000f /var/log/messages | egrep -s "Setting master-rsc_SAPHana_${SAPSYSTEMNAME}_HDB${TINSTANCE}|sr_register|WAITING4LPA|PROMOTED|DEMOTED|UNDEFINED|master_walk|SWAIT|WaitforStopped|FAILED|LPT"'

Run this alias in a separate window to monitor the progress of the test. Also check the example Monitoring failover and sync state.

5.2.4. Cluster status
Link kopieren

There are several ways to check the cluster status.

Check if the cluster is running:
- pcs cluster status
Check the cluster and all resources:
- pcs status
Check the cluster, all resources and all node attributes:
- pcs status --full
Check the resources only:
- pcs resource

The pcs status --full command gives you all the necessary information. To monitor changes, you can run this command together with watch.

# pcs status --full

If you want to see changes, you can run, in a separate window, the command watch:

# watch pcs status --full

An output example and further options can be found in Checking cluster status.

5.2.5. Discovering leftovers
Link kopieren

To ensure that your environment is ready to run the next test, leftovers from previous tests need to be fixed or removed.

stonith is used to fence a node in the cluster:
- Detect: [root@az1n1]# pcs stonith history
- Fix: [root@az1n1]# pcs stonith cleanup
Multiple primary databases:
- Detect: az1n1:rh2adm> hdbnsutil -sr_stateConfiguration | grep -i primary
  All nodes with the same primary need to be identified.
- Fix: az1n1:rh2adm> re-register the wrong primary with option --force_full_replica
Location Constraints caused by move:
- Detect: [root@az1n1]# pcs constraint location
  Check the warning section.
- Fix: [root@az1n1]# pcs resource clear <clone-resource-which was moved>
Secondary replication relationship:
- Detect: on one of the nodes running a primary database instance az1n1:rh2adm> python ${DIR_EXECUTABLES}/python_support/systemReplicationStatus.py
- Fix: unregister and re-register the secondary databases.
Check siteReplicationMode (same output on all SAP HANA nodes
- az1n1:rh2adm> hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode
Pcs property:
- Detect: [root@az1n1]# pcs property config
- Fix: [root@az1n1]# pcs property set <key=value>
- Clear maintenance_mode
- [root@az1n1]# pcs property set maintenance-mode=false
log_mode:
- Detect: az1n1:rh2adm> python systemReplicationStatus.py
  Responds in the replication status that log_mode normal is required. log_mode can be detected as described in Using hdbsql to check Inifile contents.
- Fix: change the log_mode to normal and restart the primary database.
CIB entries:
- Detect: SFAIL entries in the cluster information base.
  Refer to Checking cluster consistency, to find and remove CIB entries.
Cleanup/clear:
- Detect: [root@az1n1]# pcs status --full
  Sometimes it shows errors or warnings. You can cleanup/clear resources and if everything is fine, nothing happens. Before running the next test, you can cleanup your environment. Check if all nodes are online and not offline or in standby mode.
- Examples to cleanup:
  [root@az1n1]# pcs resource clear <name-of-the-clone-resource>
  [root@az1n1]# pcs resource cleanup <name-of-the-clone-resource>
PCS resource status
- Update: [root@az1b1]# pcs resource refresh

This is also useful if you want to check if there is an issue in an existing environment. For more information, refer to Useful commands.

5.3. Test 1:Failover of the primary site with an active third site
Link kopieren

Expand

Subject of the test	Automatic re-registration of the third site. Sync state changes to SOK after clearing.
Test preconditions	SAP HANA on DC1, DC2, DC3 are running. Cluster is up and running without errors or warnings.
Test steps	Move the SAPHana resource using the `[root@az1n1]# pcs resource move <sap-clone-resource> <target-node>` command.
Monitoring the test	On the third site run as `sidadm` the command provided at the end of table.(*) On the secondary node run as root: `[root@az1n1]# watch pcs status --full`
Starting the test	Execute the cluster command: `[root@az1n1] pcs move resource SAPHana_RH2_02-clone` `[root@az1n1]# pcs resource clear SAPHana_RH2_02-clone`
Expected result	`az3n1:rh2adm> hdbnsutil -sr_state --sapcontrol=1 \| egrep -e 'site.*Mode\|primary masters'` This should show a change of the output from: primary masters=az1n1 to primary masters=az2n1
Ways to return to an initial state	Run the test twice.

(*)

az3n1:rh2adm>
watch hdbnsutil -sr_state
[root@az1n1]# tail -1000f /var/log/messages |egrep -e ‘SOK|SWAIT|SFAIL’

Detailed description

Check the initial state of your cluster as root on az1n1 or az2n1:
```
[root@az1n1]# pcs status --full
```
This command shows for example
Cluster name
Cluster summary with the DC=Designated Controller
Node List
Full List of Resources
Node Attributes
PCSD Status
Daemon Status

The Node Attributes are showing the promoted clone state and the Node List shows if a node is for example stopped.

+ This output shows you that HANA is promoted on az1n1 which is the primary SAP HANA server, and that the name of the clone resource is SAPHana_RH2_02-clone, which is promotable.

+ You can run this in a separate window during the test to see the changes.

[root@az1n1]# watch pcs status --full

Another way to identify the name of the SAP HANA clone resource is:

[root@az2n1]# pcs resource
  * rsc_ip_MASTER1	(ocf:heartbeat:IPaddr2):	 Started az1n1
  * Clone Set: rsc_SAPHanaTopology_RH1_10-clone [rsc_SAPHanaTopology_RH1_10]:
    * Started: [ az1n1 az1n2 az2n1 az2n2 ]
  * Clone Set: rsc_SAPHanaFilesystem_RH1_10-clone [rsc_SAPHanaController_RH1_10] (promotable):
    * Promoted: [ az1n1 ]
    * Unpromoted: [ az1n2 az2n1 az2n2 ]

To see the change of the primary server start monitoring on az3n1 on a separate terminal window before you start the test.

az3n1:rh2adm> watch 'hdbnsutil -sr_state | grep "primary masters"

The output looks like:

Every 2.0s: hdbnsutil -sr_state | grep "primary masters"                                                                                 az3n1: Mon Sep  4 08:47:21 2023

primary masters: az1n1

During the test the expected output changes to az2n1.

Move the clone resource discovered above to az2n1, to start the test:
```
[root@az1n1]# pcs resource move SAPhana_RH2_02-clone az2n1
```
The output of the monitor on az3n1 changes to:
```
Every 2.0s: hdbnsutil -sr_state | grep "primary masters"                                                                                 primary masters: az2n1
```
Pacemaker creates a location constraint for moving the clone resource. This needs to be manually removed. You can see the constraint using:
```
[root@az1n1]# pcs constraint location
```
Execute the following steps to remove this constraint.

Clear the clone resource to remove the location constraint:

[root@az1n1]# pcs resource clear SAPhana_RH2_02-clone
Removing constraint: cli-prefer-SAPHana_RH2_02-clone

Cleanup the resource:

[root@az1n1]# pcs resource cleanup SAPHana_RH2_02-clone
Cleaned up SAPHana_RH2_02:0 on az2n1
Cleaned up SAPHana_RH2_02:1 on az1n1
Waiting for 1 reply from the controller
... got reply (done)

Result of the test

The “primary masters” monitor on az3n1 should show an immediate switch to the new primary node.
If you check the cluster status, the former secondary is promoted, the former primary gets re-registered, and the Clone_State changes from Promoted to Undefined to WAITINGFORLPA to DEMOTED.
The secondary changes the sync_state to SFAIL when the SAPHana monitor is started for the first time after the failover. Because of existing location constraints, the resource needs to be cleared, and after a short time, the sync_state of the secondary changes to SOK again.
Secondary gets promoted.

To restore the initial state you can simply run the next test. After finishing the tests, run a Cleaning up cluster.

5.4. Test 2:Failover of the primary node with passive third site
Link kopieren

Expand

Subject of the test	No registration of the third site. Failover works even if the third site is down.
Test preconditions	SAP HANA on DC1, DC2 is running and is stopped on DC3. Cluster is up and running without errors or warnings.
Test steps	Move the SAPHana resource using the `pcs move` command.
Starting the test	Execute the cluster command: `[root@az1n1]# pcs move resource SAPHana_RH2_02-clone`
Monitoring the test	On the third site run as `sidadm`: `% watch hdbnsutil -sr_stateConfiguration` On the cluster nodes run as root: `[root@az1n1]# watch pcs status`
Expected result	No change on DC3. Replication stays on old relationship.
Ways to return to an initial state	Re-register DC3 on new primary and start SAP HANA.

Detailed description

Check the initial state of your cluster as root on az1n1 or az2n1:
```
[root@az1n1]# pcs status --full
```
This output of this example shows you that HANA is promoted on az1n1, which is the primary SAP HANA server, and that the name of the clone resource is SAPHana_RH2_02-clone, which is promotable. If you run test 3 before HANA, it might be promoted on az2n1.

Stop the database on az3n1:

az3n1:rh2adm> HDB stop
hdbdaemon will wait maximal 300 seconds for NewDB services finishing.
Stopping instance using: /usr/sap/RH2/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 02 -function Stop 400

12.07.2023 11:33:14
Stop
OK
Waiting for stopped instance using:
/usr/sap/RH2/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 02 -function WaitforStopped 600 2

12.07.2023 11:33:30
WaitforStopped
OK

Check the primary database on az3n1:

az3n1:rh2adm> hdbnsutil -sr_stateConfiguration| grep -i "primary masters"

primary masters: az2n1

Check the current primary in the cluster on a cluster node:

[root@az1n1]# pcs resource | grep Masters
    * Masters: [ az2n1 ]

Check the sr_state to see the SAP HANA system replication relationships:

az2n1:rh2adm> hdbnsutil -sr_state

System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~

online: true

mode: primary
operation mode: primary
site id: 2
site name: DC1

is source system: true
is secondary/consumer system: false
has secondaries/consumers attached: true
is a takeover active: false
is primary suspended: false

Host Mappings:
~~~~~~~~~~~~~~

az1n1 -> [DC3] az3n1
az1n1 -> [DC1] az1n1
az1n1 -> [DC2] az2n1


Site Mappings:
~~~~~~~~~~~~~~
DC1 (primary/primary)
    |---DC3 (syncmem/logreplay)
    |---DC2 (syncmem/logreplay)

Tier of DC1: 1
Tier of DC3: 2
Tier of DC2: 2

Replication mode of DC1: primary
Replication mode of DC3: syncmem
Replication mode of DC2: syncmem

Operation mode of DC1: primary
Operation mode of DC3: logreplay
Operation mode of DC2: logreplay

Mapping: DC1 -> DC3
Mapping: DC1 -> DC2
done.

The SAP HANA system replication relations still have one primary (DC1), which is replicated to DC2 and DC3.

The replication relationship on az3n1, which is down, can be displayed using:

az3n1:rh2adm> hdbnsutil -sr_stateConfiguration

System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~

mode: syncmem
site id: 3
site name: DC3
active primary site: 1

primary masters: az1n1
done.

The database on az3n1 which is offline checks the entries in the global.ini file.

Starting the test: Initiate a failover in the cluster, moving the SAPHana-clone-resource example:
```
[root@az1n1]# pcs resource move SAPHana_RH2_02-clone az2n1
```
Note
If SAPHana is promoted on az2n1, you have to move the clone resource to az1n1. The example expects that SAPHana is promoted on az1n1.
There is no output. Similar to the former test, a location constraint is created, which can be displayed with:
```
[root@az1n1]# pcs constraint location
Location Constraints:
  Resource: SAPHana_RH2_02-clone
    Enabled on:
      Node: az1n1 (score:INFINITY) (role:Started)
```
Even if the cluster looks fine again, this constraint avoids another failover unless the constraint is removed. One way is to clear the resource.

Clear the resource:

[root@az1n1]# pcs constraint location
Location Constraints:
  Resource: SAPHana_RH2_02-clone
    Enabled on:
      Node: az1n1 (score:INFINITY) (role:Started)
[root@az1n1]# pcs resource clear SAPHana_RH2_02-clone

Cleanup the resource:

[root@az1n1]# pcs resource cleanup SAPHana_RH2_02-clone
Cleaned up SAPHana_RH2_02:0 on az2n1
Cleaned up SAPHana_RH2_02:1 on az1n1
Waiting for 1 reply from the controller
... got reply (done)

Check the current status.

There are three ways to display the replication status, which needs to be in sync. Starting with the primary on az3n1:

az3n1:rh2adm>  hdbnsutil -sr_stateConfiguration| grep -i primary
active primary site: 1
primary masters: az1n1

The output shows site 1 or az1n1, which was the primary before starting the test to move the primary to az2n1.

Next check the system replication status on the new primary.

First detect the new primary:

[root@az1n1]# pcs resource | grep  Master
    * Masters: [ az2n1 ]

Here we have an inconsistency, which requires us to re-register az3n1. You might think that if we run the test again, we might switch the primary back to the original az1n1. In this case, we have a third way to identify if system replication is working. On the primary node run:

az2n1:rh2adm> cdpy
az2n1:rh2adm> python ${DIR_EXECUTABLES}/python_support/systemReplicationStatus.py
|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replication    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ |
|SYSTEMDB |az2n1 |30201 |nameserver   |        1 |      2 |DC2       |az1n1    |    30201 |        1 |DC1       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az2n1 |30207 |xsengine     |        2 |      2 |DC2       |az1n1    |    30207 |        1 |DC1       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az2n1 |30203 |indexserver  |        3 |      2 |DC2       |az1n1    |    30203 |        1 |DC1       |YES           |SYNCMEM     |ACTIVE      |               |        True |

status system replication site "1": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 2
site name: DC2

If you don’t see az3n1 in this output, you have to re-register az3n1. Before registering, run the following on the primary node to watch the progress of the registration:

az2n1:rh2adm> watch python
${DIR_EXECUTABLES}/python_support/systemReplicationStatus.py

Now you can re-register az3n1 using this command:

az3n1:rh2adm> hdbnsutil -sr_register --remoteHost=az2n1 --remoteInstance=${TINSTANCE} --replicationMode=async --name=DC3 --remoteName=DC2 --operation
Mode=logreplay --online
adding site ...
collecting information ...
updating local ini files ...
done.

Even if the database on az3n1 is not started yet, you are able to see the third site in the system replication status output. Start the database on az3n1 to finish the registration:

az3n1:rh2adm> HDB start


StartService
Impromptu CCC initialization by 'rscpCInit'.
  See SAP note 1266393.
OK
OK
Starting instance using: /usr/sap/RH2/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 02 -function StartWait 2700 2


04.09.2023 11:36:47
Start
OK

The monitor started above immediately shows the synchronization of az3n1.

To switch back, run the test again. One optional test is to switch the primary to the node, which is configured on the global.ini on az3n1 and then starting the database. The database might come up, but it never shows in the output of the system replication status unless it is re-registered.
The missing entry is immediately created, and the system replication starts as soon as the SAP HANA database is started.

Execute the following to check this:

sidadm@az1n1% hdbnsutil -sr_state
sidadm@az1n1% python systemReplicationStatus.py ; echo $?

You can find more information in Checking SAP HANA system replication status.

5.5. Test 3:Failover of the primary database to the third site
Link kopieren

Expand

Subject of the test	Failover the primary to the third site.. Third site becomes primary. Secondary re-registers to third site.
Test preconditions	SAP HANA on DC1, DC2, DC3 is running. Cluster is up and running without errors or warnings. System replication is in place and in sync (check `% python systemReplicationStatus.py`).
Test steps	Put the cluster into `maintenance-mode` to be able to recover. Takeover the HANA database from the third node using: `% hdbnsuttil -sr_takeover`
Starting the test	Execute the SAP HANA command on az3n1:rh2adm>: `hdbnsutil -sr_takeover`
Monitoring the test	On the third site run as `sidadm% watch hdbnsutil -sr_state`
Expected result	Third site runs the primary database. Secondary site changes the primary master to az3n1. Former primary site needs to be re-registered to the new primary site.
Ways to return to an initial state	Run Test 4: Failback of the primary node to the first site.

Detailed description

Check if the databases are running using Checking database and check the replication status:

az2n1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"

The output is, for example:

mode: syncmem
primary masters: az1n1

In this case, the primary database is az1n1. If you run this command on az1n1, you get:

mode: primary

On this primary site, you can also display the system replication status. It should look like this:

az1n1:rh2adm> cdpy
az1n1:rh2adm> python systemReplicationStatus.py
|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replication    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ |
|SYSTEMDB |az1n1 |30201 |nameserver   |        1 |      1 |DC1       |az3n1    |    30201 |        3 |DC3       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az1n1 |30207 |xsengine     |        2 |      1 |DC1       |az3n1    |    30207 |        3 |DC3       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az1n1 |30203 |indexserver  |        3 |      1 |DC1       |az3n1    |    30203 |        3 |DC3       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|SYSTEMDB |az1n1 |30201 |nameserver   |        1 |      1 |DC1       |az2n1    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az1n1 |30207 |xsengine     |        2 |      1 |DC1       |az2n1    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az1n1 |30203 |indexserver  |        3 |      1 |DC1       |az2n1    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |

status system replication site "3": ACTIVE
status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1

Now we have a proper environment, and we can start monitoring the system replication status on all 3 nodes in separate windows. The 3 monitors should be started before the test is started. The output changes when the test executes. So keep them running as long as the test is not completed.

On the old primary node, az1n1 ran in a separate window during the test:

az1n1:rh2adm> watch -n 5 'python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'

The output on az1n1 is:

Every 5.0s: python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicati...  az1n1: Tue XXX XX HH:MM:SS 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |
Replication |Replication |Replication    |Secondary    |
|         |	  |	 |             |          |        |          |Host	 |Port      |Site ID   |Site Name |Active Status |
Mode        |Status	 |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |
----------- |----------- |-------------- |------------ |
|SYSTEMDB |az1n1 |30201 |nameserver   |        1 |	 1 |DC1       |az3n1    |    30201 |        3 |DC3	  |YES		 |
ASYNC       |ACTIVE	 |               |        True |
|RH2	  |az1n1 |30207 |xsengine     |        2 |	 1 |DC1       |az3n1    |    30207 |        3 |DC3	  |YES		 |
ASYNC       |ACTIVE	 |               |        True |
|RH2	  |az1n1 |30203 |indexserver  |        3 |	 1 |DC1       |az3n1    |    30203 |        3 |DC3	  |YES		 |
ASYNC       |ACTIVE	 |               |        True |
|SYSTEMDB |az1n1 |30201 |nameserver   |        1 |	 1 |DC1       |az2n1    |    30201 |        2 |DC2	  |YES		 |
SYNCMEM     |ACTIVE	 |               |        True |
|RH2	  |az1n1 |30207 |xsengine     |        2 |	 1 |DC1       |az2n1    |    30207 |        2 |DC2	  |YES		 |
SYNCMEM     |ACTIVE	 |               |        True |
|RH2	  |az1n1 |30203 |indexserver  |        3 |	 1 |DC1       |az2n1    |    30203 |        2 |DC2	  |YES		 |
SYNCMEM     |ACTIVE	 |               |        True |

status system replication site "3": ACTIVE
status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1
Status 15

On az3n1, run the same command:

az3n1:rh2adm> watch -n 5 'python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'

The response is:

this system is either not running or is not primary system replication site

This changes after the test initiates the failover. The output looks similar to the example of the primary node before the test was started.

On the second node, start:

az2n1:rh2adm> watch -n 10 'hdbnsutil -sr_state | grep masters'

This shows the current master az1n1 and switches immediately after the failover is initiated.

To ensure that everything is configured correctly, check the global.ini.

Check global.ini on DC1, DC2, and DC3:

On all three nodes, the global.ini should contain:

[persistent]
log_mode=normal
[system_replication]
register_secondaries_on_takeover=true

You can edit the global.ini with:

az1n1:rh2adm>vi /usr/sap/${SAPSYSTEMNAME}/SYS/global/hdb/custom/config/global.ini

[Optional] Put the cluster into maintenance-mode:
```
[root@az1n1]# pcs property set maintenance-mode=true
```
During the tests, you will find out that the failover works with and without setting the maintenance-mode. So you can run the first test without it. While recovering, it should be done; I just want to show you that it works with and without. This is an option if the primary is not accessible.

Start the test: Failover to DC3. On az3n1, run:

az3n1:rh2adm> hdbnsutil -sr_takeover
done.

The test has started, and now check the output of the previously started monitors. On the az1n1, the system replication status losees its relationship to az3n1 and az2n1 (DC2):

Every 5.0s: python /usr/sap/RH2/HDB02/exe/python_support/systemReplicationStatus.py ; echo Status $?                               az1n1: Mon Sep  4 11:52:16 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replic
ation                  |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status
 Details               |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |------
---------------------- |------------ |
|SYSTEMDB |az1n1 |30201 |nameserver   |        1 |      1 |DC1       |az2n1    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ERROR       |Commun
ication channel closed |       False |
|RH2      |az1n1 |30207 |xsengine     |        2 |      1 |DC1       |az2n1    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ERROR       |Commun
ication channel closed |       False |
|RH2      |az1n1 |30203 |indexserver  |        3 |      1 |DC1       |az2n1    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ERROR       |Commun
ication channel closed |       False |

status system replication site "2": ERROR
overall system replication status: ERROR

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1
Status 11

The cluster still doesn’t notice this behavior. If you check the return code of the system replication status, Returncode 11 means error, which tells you something is wrong. If you have access, it is a good idea to enter maintenance-mode now.

The az3n1 becomes the new primary, and az2n1 (DC2) gets automatically registered as the new primary on the az3n1.

Example output of the system replication state of az3n1:

Every 5.0s: python /usr/sap/RH2/HDB02/exe/python_support/systemReplicationStatus.py ; echo Status $?                               az3n1: Mon Sep  4 13:55:29 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replic
ation    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status
 Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |------
-------- |------------ |
|SYSTEMDB |az3n1 |30201 |nameserver   |        1 |      3 |DC3       |az2n1    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |
         |        True |
|RH2      |az3n1 |30207 |xsengine     |        2 |      3 |DC3       |az2n1    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |
         |        True |
|RH2      |az3n1 |30203 |indexserver  |        3 |      3 |DC3       |az2n1    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |
         |        True |

status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 3
site name: DC3
Status 15

The returncode 15 also says everything is okay, but az1n1 is missing. This must be re-registered manually. The former primary az1n1 is not listed, so the replication relationship is lost.

Set maintenance-mode.

If not already done before, set maintenance-mode on the cluster on one node of the cluster with the command:

[root@az1n1]# pcs property  set maintenance-mode=true

Run this command to check if the maintenance-mode is active:

[root@az1n1]# pcs resource
  * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02] (unmanaged):
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started az2n1node2 (unmanaged)
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started az1n1node1 (unmanaged)
  * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable, unmanaged):
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Slave az2n1node2 (unmanaged)
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Master az1n1node1 (unmanaged)
  * vip_RH2_02_MASTER   (ocf::heartbeat:IPaddr2):        Started az1n1node1 (unmanaged)

The resources are displaying unmanaged, this indicates that the cluster is in maintenance-mode=true. The virtual IP address is still started on az1n1. If you want to use this IP on another node, disable vip_RH2_02_MASTER before you set maintanence-mode=true.

[root@az1n1]# pcs resource disable vip_RH2_02_MASTER

Re-register az1n1.

When we check the sr_state on az1n1, you see a relationship only to DC2:

az1n1:rh2adm> hdbnsutil -sr_state

System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~

online: true

mode: primary
operation mode: primary
site id: 1
site name: DC1

is source system: true
is secondary/consumer system: false
has secondaries/consumers attached: true
is a takeover active: false
is primary suspended: false

Host Mappings:
~~~~~~~~~~~~~~

az1n1 -> [DC2] az2n1
az1n1 -> [DC1] az1n1


Site Mappings:
~~~~~~~~~~~~~~
DC1 (primary/primary)
    |---DC2 (syncmem/logreplay)

Tier of DC1: 1
Tier of DC2: 2

Replication mode of DC1: primary
Replication mode of DC2: syncmem

Operation mode of DC1: primary
Operation mode of DC2: logreplay

Mapping: DC1 -> DC2
done.

But when we check DC2, the primary database server is DC3. So the information from DC1 is not correct.

az2n1:rh2adm> hdbnsutil -sr_state

If we check the system replication status on DC1, the returncode is 12, which is unknown. So DC1 needs to be re-registered.

You can use this command to register the former primary az1n1 as a new secondary of az3n1.

az1n1:rh2adm> hdbnsutil -sr_register --remoteHost=az3n1 --remoteInstance=${TINSTANCE} --replicationMode=asyncsyncmem --name=DC1 --remoteName=DC3 --operationMode=logreplay --online

After the registration is done, you see on az3n1 all three sites replicated, and the status (return code) changes to 15.

If this fails, you have to manually remove the replication relationships on DC1 and DC3. Follow the instructions described in Registering secondary node.

For example, list the existing relationships with:

az1n1:rh2adm> hdbnsutil -sr_state

To remove the existing relationships you can use:

az1n1:rh2adm> hdbnsutil -sr_unregister --name=DC2`

This may not usually be necessary. We assume that test 4 will be performed after test 3. So the recovery step is to run test 4.

5.6. Test 4:Failback of the primary site to the first site
Link kopieren

Expand

Subject of the test	Primary switch back to a cluster node. Failback and enable the cluster again. Re-register the third site as secondary.
Test preconditions	SAP HANA primary node is running on third site. Cluster is partly running. Cluster is put into `maintenance_mode`. Former cluster primary is detectable.
Test steps	Check the expected primary of the cluster. Failover from the DC3 node to the DC1 node. Check if the former secondary has switched to the new primary. Re-register az3n1 as a new secondary. Set cluster `maintenance_mode=false` and the cluster continues to work.
Monitoring the test	On the new primary start: `az3n1:rh2adm> watch python ${DIR_EXECUTABLES}/python_support/systemReplicationStatus.py [root@az1n1]# watch pcs status --full` On the secondary start: `clusternode:rh2adm> watch hdbnsutil -sr_state`
Starting the test	Check the expected primary of the cluster: `[root@az1n1]# pcs resource`. VIP and promoted SAP HANA resources should run on the same node which is the potential new primary. On this potential primary run as `sidadm`: `az1n1:rh2adm> hdbnsutil -sr_takeover` Re-register the former primary as new secondary: `az1n1:rh2adm> hdbnsutil -sr_register \ --remoteHost=az1n1 \ --remoteInstance=${TINSTANCE} \ --replicationMode=syncmem \ --name=DC3 \ --remoteName=DC1 \ --operationMode=logreplay \ --force_full_replica \ --online` Cluster continues to work after setting the `maintenance_mode=false`.
Expected result	New primary is starting SAP HANA. The replication status shows that all 3 sites are replicated. Second cluster site gets automatically re-registered to the new primary. DR site becomes an additional replica of the database.
Ways to return to an initial state	Run test 3.

Detailed description

Check if the cluster is put into maintenance-mode:

[root@az1n1]# pcs property config maintenance-mode
Cluster Properties:
 maintenance-mode: true

If the maintenance-mode is not true you can set it with:

[root@az1n1]# pcs property set  maintenance-mode=true

Check the system replication status and discover the primary database on all nodes.

First of all, discover the primary database using:

az1n1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"

The output should be as follows:

On az1n1:

az1n1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"
mode: syncmem
primary masters: az3n1

On az2n1:

az2n1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"
mode: syncmem
primary masters: az3n1

On az3n1:

az3n1:rh2adm> hdbnsutil -sr_state | egrep -e "^mode:|primary masters"
mode: primary

On all three nodes, the primary database is az3n1.

On this primary database, you have to ensure that the system replication status is active for all three nodes and the return code is 15:

az3n1:rh2adm> python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py
|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replication    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ |
|SYSTEMDB |az3n1 |30201 |nameserver   |        1 |      3 |DC3       |az2n1    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az3n1 |30207 |xsengine     |        2 |      3 |DC3       |az2n1    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az3n1 |30203 |indexserver  |        3 |      3 |DC3       |az2n1    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|SYSTEMDB |az3n1 |30201 |nameserver   |        1 |      3 |DC3       |az1n1    |    30201 |        1 |DC1       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az3n1 |30207 |xsengine     |        2 |      3 |DC3       |az1n1    |    30207 |        1 |DC1       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az3n1 |30203 |indexserver  |        3 |      3 |DC3       |az1n1    |    30203 |        1 |DC1       |YES           |SYNCMEM     |ACTIVE      |               |        True |

status system replication site "2": ACTIVE
status system replication site "1": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 3
site name: DC3
[rh2adm@az3n1: python_support]# echo $?
15

Check if all three sr_states are consistent.

Run hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode, on all three nodes:

az1n1:rh2adm>hdbnsutil -sr_state --sapcontrol=1 |grep  site.*Mode


az2n1:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode


az3n1:rh2adm>hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode

The output should be the same on all nodes:

siteReplicationMode/DC1=primary
siteReplicationMode/DC3=async
siteReplicationMode/DC2=syncmem
siteOperationMode/DC1=primary
siteOperationMode/DC3=logreplay
siteOperationMode/DC2=logreplay

Start monitoring in separate windows.

On az1n1, start:

az1n1:rh2adm> watch "python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"

On az3n1, start:

az3n1:rh2adm>watch "python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo \$?"

On az2n1, start:

az2n1:rh2adm> watch "hdbnsutil -sr_state --sapcontrol=1 |grep  siteReplicationMode"

Start the test.
To failover to az1n1, start on az1n1:
```
az1n1:rh2adm> hdbnsutil -sr_takeover
done.
```

Check the output of the monitors.

The monitor on az1n1 changes to:

Every 2.0s: python systemReplicationStatus.py; echo $?                                                                                                                                                            az1n1: Mon Sep  4 23:34:30 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replication    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ |
|SYSTEMDB |az1n1 |30201 |nameserver   |        1 |      1 |DC1       |az2n1    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az1n1 |30207 |xsengine     |        2 |      1 |DC1       |az2n1    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az1n1 |30203 |indexserver  |        3 |      1 |DC1       |az2n1    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |

status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1
15

Important is also the return code 15.

The monitor on az2n1 changes to:

Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep  site.*Mode                                                az2n1: Mon Sep  4 23:35:18 2023

siteReplicationMode/DC1=primary
siteReplicationMode/DC2=syncmem
siteOperationMode/DC1=primary
siteOperationMode/DC2=logreplay

DC3 is gone and needs to be re-registered.

On az3n1, the systemReplicationStatus reports an error, and the returncode changes to 11.

Check if cluster nodes get re-registered:

az1n1:rh2adm>  hdbnsutil -sr_state

System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~

online: true

mode: primary
operation mode: primary
site id: 1
site name: DC1

is source system: true
is secondary/consumer system: false
has secondaries/consumers attached: true
is a takeover active: false
is primary suspended: false

Host Mappings:
~~~~~~~~~~~~~~

az1n1 -> [DC2] az2n1
az1n1 -> [DC1] az1n1


Site Mappings:
~~~~~~~~~~~~~~
DC1 (primary/primary)
    |---DC2 (syncmem/logreplay)

Tier of DC1: 1
Tier of DC2: 2

Replication mode of DC1: primary
Replication mode of DC2: syncmem

Operation mode of DC1: primary
Operation mode of DC2: logreplay

Mapping: DC1 -> DC2
done.

The Site Mapping shows that az2n1 (DC2) was re-registered.

Check or enable the vip resource:

[root@az1n1]# pcs resource
  * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02] (unmanaged):
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started az2n1 (unmanaged)
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started az1n1 (unmanaged)
  * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable, unmanaged):
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Master az2n1 (unmanaged)
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Slave az1n1 (unmanaged)
  * vip_RH2_02_MASTER   (ocf::heartbeat:IPaddr2):        Stopped (disabled, unmanaged)

The vip resource vip_RH2_02_MASTER is stopped.

To start it again run:

[root@az1n1]# pcs resource enable vip_RH2_02_MASTER
Warning: 'vip_RH2_02_MASTER' is unmanaged

The warning is right because the cluster does not start any resources unless maintenance-mode=false.

Stop cluster maintenance-mode.

Before we stop the maintenance-mode, we should start two monitors in separate windows to see the changes.

On az2n1, run:

[root@az2n1]# watch pcs status --full

On az1n1, run:

az1n1:rh2adm> watch "python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py; echo $?"

Run this command to unset the maintenance-mode on az1n1:

[root@az1n1]# pcs property set maintenance-mode=false

The monitor on az1n1 should show you that everything is running now as expected:

Every 2.0s: pcs status --full                                                                                                                                                                                     az1n1: Tue Sep  5 00:01:17 2023

Cluster name: cluster1
Cluster Summary:
  * Stack: corosync
  * Current DC: az1n1 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum
  * Last updated: Tue Sep  5 00:01:17 2023
  * Last change:  Tue Sep  5 00:00:30 2023 by root via crm_attribute on az1n1
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ az1n1 (1) az2n1 (2) ]

Full List of Resources:
  * auto_rhevm_fence1   (stonith:fence_rhevm):   Started az1n1
  * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]:
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started az2n1
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started az1n1
  * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable):
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Slave az2n1
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Master az1n1
  * vip_RH2_02_MASTER   (ocf::heartbeat:IPaddr2):        Started az1n1

Node Attributes:
  * Node: az1n1 (1):
    * hana_rh2_clone_state              : PROMOTED
    * hana_rh2_op_mode                  : logreplay
    * hana_rh2_remoteHost               : az2n1
    * hana_rh2_roles                    : 4:P:master1:master:worker:master
    * hana_rh2_site                     : DC1
    * hana_rh2_sra                      : -
    * hana_rh2_srah                     : -
    * hana_rh2_srmode                   : syncmem
    * hana_rh2_sync_state               : PRIM
    * hana_rh2_version                  : 2.00.062.00
    * hana_rh2_vhost                    : az1n1
    * lpa_rh2_lpt                       : 1693872030
    * master-SAPHana_RH2_02             : 150
  * Node: az2n1 (2):
    * hana_rh2_clone_state              : DEMOTED
    * hana_rh2_op_mode                  : logreplay
    * hana_rh2_remoteHost               : az1n1
    * hana_rh2_roles                    : 4:S:master1:master:worker:master
    * hana_rh2_site                     : DC2
    * hana_rh2_sra                      : -
    * hana_rh2_srah                     : -
    * hana_rh2_srmode                   : syncmem
    * hana_rh2_sync_state               : SOK
    * hana_rh2_version                  : 2.00.062.00
    * hana_rh2_vhost                    : az2n1
    * lpa_rh2_lpt                       : 30
    * master-SAPHana_RH2_02             : 100

Migration Summary:

Tickets:

PCSD Status:
  az1n1: Online
  az2n1: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

After manual interaction, it is always good advice to cleanup the cluster, as described in Cleaning up cluster.

Re-register az3n1 to the new primary on az1n1.

az3n1 needs to be re-registered. To monitor the progress, start on az1n1:

az1n1:rh2adm> watch -n 5 'python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?'

On az3n1, start:

az3n1:rh2adm> watch 'hdbnsutil -sr_state --sapcontrol=1 |grep  siteReplicationMode'

Now you can re-register az3n1 with this command:

az3n1:rh2adm> hdbnsutil -sr_register --remoteHost=az1n1 --remoteInstance=${TINSTANCE} --replicationMode=async --name=DC3 --remoteName=DC1 --operationMode=logreplay --online

The monitor on az1n1 changes to:

Every 5.0s: python /usr/sap/${SAPSYSTEMNAME}/HDB${TINSTANCE}/exe/python_support/systemReplicationStatus.py ; echo Status $?                                                                                         az1n1: Tue Sep  5 00:14:40 2023

|Database |Host   |Port  |Service Name |Volume ID |Site ID |Site Name |Secondary |Secondary |Secondary |Secondary |Secondary     |Replication |Replication |Replication    |Secondary    |
|         |       |      |             |          |        |          |Host      |Port      |Site ID   |Site Name |Active Status |Mode        |Status      |Status Details |Fully Synced |
|-------- |------ |----- |------------ |--------- |------- |--------- |--------- |--------- |--------- |--------- |------------- |----------- |----------- |-------------- |------------ |
|SYSTEMDB |az1n1 |30201 |nameserver   |        1 |      1 |DC1       |az3n1    |    30201 |        3 |DC3       |YES           |ASYNC     |ACTIVE      |               |        True |
|RH2      |az1n1 |30207 |xsengine     |        2 |      1 |DC1       |az3n1    |    30207 |        3 |DC3       |YES           |ASYNC     |ACTIVE      |               |        True |
|RH2      |az1n1 |30203 |indexserver  |        3 |      1 |DC1       |az3n1    |    30203 |        3 |DC3       |YES           |ASYNC     |ACTIVE      |               |        True |
|SYSTEMDB |az1n1 |30201 |nameserver   |        1 |      1 |DC1       |az2n1    |    30201 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az1n1 |30207 |xsengine     |        2 |      1 |DC1       |az2n1    |    30207 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |
|RH2      |az1n1 |30203 |indexserver  |        3 |      1 |DC1       |az2n1    |    30203 |        2 |DC2       |YES           |SYNCMEM     |ACTIVE      |               |        True |

status system replication site "3": ACTIVE
status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 1
site name: DC1
Status 15

And the monitor of az3n1 changes to:

Every 2.0s: hdbnsutil -sr_state --sapcontrol=1 |grep  site.*Mode                                                                az3n1: Tue Sep  5 02:15:28 2023

siteReplicationMode/DC1=primary
siteReplicationMode/DC3=syncmem
siteReplicationMode/DC2=syncmem
siteOperationMode/DC1=primary
siteOperationMode/DC3=logreplay
siteOperationMode/DC2=logreplay

Now we have again 3 entries, and az3n1 (DC3) is again a secondary site replicated from az1n1 (DC1).

Check if all nodes are part of the system replication status on az1n1.

Run on all three nodes, hdbnsutil -sr_state --sapcontrol=1 |grep site.*Mode:

az1n1:rh2adm> hdbnsutil -sr_state --sapcontrol=1 |grep  site.*ModesiteReplicationMode


az2n1:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode


az3n1:rh2adm> hsbnsutil -sr_state --sapcontrol=1 | grep site.*Mode

On all nodes, we should get the same output:

siteReplicationMode/DC1=primary
siteReplicationMode/DC3=syncmem
siteReplicationMode/DC2=syncmem
siteOperationMode/DC1=primary
siteOperationMode/DC3=logreplay
siteOperationMode/DC2=logreplay

Check pcs status --full and SOK.

Run:

[root@az1n1]# pcs status --full| grep sync_state

The output should be either PRIM or SOK:

 * hana_rh2_sync_state             	: PRIM
 * hana_rh2_sync_state             	: SOK

Finally, the cluster status should look like this, including the sync_state PRIM and SOK:

[root@az1n1]# pcs status --full
Cluster name: cluster1
Cluster Summary:
  * Stack: corosync
  * Current DC: az1n1 (1) (version 2.1.2-4.el8_6.6-ada5c3b36e2) - partition with quorum
  * Last updated: Tue Sep  5 00:18:52 2023
  * Last change:  Tue Sep  5 00:16:54 2023 by root via crm_attribute on az1n1
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ az1n1 (1) az2n1 (2) ]

Full List of Resources:
  * auto_rhevm_fence1   (stonith:fence_rhevm):   Started az1n1
  * Clone Set: SAPHanaTopology_RH2_02-clone [SAPHanaTopology_RH2_02]:
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started az2n1
    * SAPHanaTopology_RH2_02    (ocf::heartbeat:SAPHanaTopology):        Started az1n1
  * Clone Set: SAPHana_RH2_02-clone [SAPHana_RH2_02] (promotable):
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Slave az2n1
    * SAPHana_RH2_02    (ocf::heartbeat:SAPHana):        Master az1n1
  * vip_RH2_02_MASTER   (ocf::heartbeat:IPaddr2):        Started az1n1

Node Attributes:
  * Node: az1n1 (1):
    * hana_rh2_clone_state              : PROMOTED
    * hana_rh2_op_mode                  : logreplay
    * hana_rh2_remoteHost               : az2n1
    * hana_rh2_roles                    : 4:P:master1:master:worker:master
    * hana_rh2_site                     : DC1
    * hana_rh2_sra                      : -
    * hana_rh2_srah                     : -
    * hana_rh2_srmode                   : syncmem
    * hana_rh2_sync_state               : PRIM
    * hana_rh2_version                  : 2.00.062.00
    * hana_rh2_vhost                    : az1n1
    * lpa_rh2_lpt                       : 1693873014
    * master-SAPHana_RH2_02             : 150
  * Node: az2n1 (2):
    * hana_rh2_clone_state              : DEMOTED
    * hana_rh2_op_mode                  : logreplay
    * hana_rh2_remoteHost               : az1n1
    * hana_rh2_roles                    : 4:S:master1:master:worker:master
    * hana_rh2_site                     : DC2
    * hana_rh2_sra                      : -
    * hana_rh2_srah                     : -
    * hana_rh2_srmode                   : syncmem
    * hana_rh2_sync_state               : SOK
    * hana_rh2_version                  : 2.00.062.00
    * hana_rh2_vhost                    : az2n1
    * lpa_rh2_lpt                       : 30
    * master-SAPHana_RH2_02             : 100

Migration Summary:

Tickets:

PCSD Status:
  az1n1: Online
  az2n1: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Refer to Checking cluster status and Checking database to verify that all works fine again.

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 5. Test cases

5.1. Preparing the tests
Link kopieren

5.2. Monitoring the environment
Link kopieren

5.2.1. Discovering the primary node
Link kopieren

5.2.2. Checking the replication status
Link kopieren

5.2.3. Checking /var/log/messages entries
Link kopieren

5.2.4. Cluster status
Link kopieren

5.2.5. Discovering leftovers
Link kopieren

5.3. Test 1:Failover of the primary site with an active third site
Link kopieren

5.4. Test 2:Failover of the primary node with passive third site
Link kopieren

5.5. Test 3:Failover of the primary database to the third site
Link kopieren

5.6. Test 4:Failback of the primary site to the first site
Link kopieren

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 5. Test cases

5.1. Preparing the testsLink kopierenLink in die Zwischenablage kopiert!

5.2. Monitoring the environmentLink kopierenLink in die Zwischenablage kopiert!

5.2.1. Discovering the primary nodeLink kopierenLink in die Zwischenablage kopiert!

5.2.2. Checking the replication statusLink kopierenLink in die Zwischenablage kopiert!

5.2.3. Checking /var/log/messages entriesLink kopierenLink in die Zwischenablage kopiert!

5.2.4. Cluster statusLink kopierenLink in die Zwischenablage kopiert!

5.2.5. Discovering leftoversLink kopierenLink in die Zwischenablage kopiert!

5.3. Test 1:Failover of the primary site with an active third siteLink kopierenLink in die Zwischenablage kopiert!

5.4. Test 2:Failover of the primary node with passive third siteLink kopierenLink in die Zwischenablage kopiert!

5.5. Test 3:Failover of the primary database to the third siteLink kopierenLink in die Zwischenablage kopiert!

5.6. Test 4:Failback of the primary site to the first siteLink kopierenLink in die Zwischenablage kopiert!

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

5.1. Preparing the tests
Link kopieren

5.2. Monitoring the environment
Link kopieren

5.2.1. Discovering the primary node
Link kopieren

5.2.2. Checking the replication status
Link kopieren

5.2.3. Checking /var/log/messages entries
Link kopieren

5.2.4. Cluster status
Link kopieren

5.2.5. Discovering leftovers
Link kopieren

5.3. Test 1:Failover of the primary site with an active third site
Link kopieren

5.4. Test 2:Failover of the primary node with passive third site
Link kopieren

5.5. Test 3:Failover of the primary database to the third site
Link kopieren

5.6. Test 4:Failback of the primary site to the first site
Link kopieren