Chapter 10. Troubleshooting clusters in stretch mode
You can replace and remove the failed tiebreaker monitors. You can also force the cluster into the recovery or healthy mode if needed.
10.1. Replacing the tiebreaker with a monitor in quorum
If your tiebreaker monitor fails, you can replace it with an existing monitor in quorum and remove it from the cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster
- Stretch mode is enabled on a cluster
Procedure
Disable automated monitor deployment:
Example
[ceph: root@host01 /]# ceph orch apply mon --unmanaged Scheduled mon update…
[ceph: root@host01 /]# ceph orch apply mon --unmanaged Scheduled mon update…
Copy to Clipboard Copied! View the monitors in quorum:
Example
[ceph: root@host01 /]# ceph -s mon: 5 daemons, quorum host01, host02, host04, host05 (age 30s), out of quorum: host07
[ceph: root@host01 /]# ceph -s mon: 5 daemons, quorum host01, host02, host04, host05 (age 30s), out of quorum: host07
Copy to Clipboard Copied! Set the monitor in quorum as a new tiebreaker:
Syntax
ceph mon set_new_tiebreaker NEW_HOST
ceph mon set_new_tiebreaker NEW_HOST
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph mon set_new_tiebreaker host02
[ceph: root@host01 /]# ceph mon set_new_tiebreaker host02
Copy to Clipboard Copied! ImportantYou get an error message if the monitor is in the same location as existing non-tiebreaker monitors:
Example
[ceph: root@host01 /]# ceph mon set_new_tiebreaker host02 Error EINVAL: mon.host02 has location DC1, which matches mons host02 on the datacenter dividing bucket for stretch mode.
[ceph: root@host01 /]# ceph mon set_new_tiebreaker host02 Error EINVAL: mon.host02 has location DC1, which matches mons host02 on the datacenter dividing bucket for stretch mode.
Copy to Clipboard Copied! If that happens, change the location of the monitor:
Syntax
ceph mon set_location HOST datacenter=DATACENTER
ceph mon set_location HOST datacenter=DATACENTER
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph mon set_location host02 datacenter=DC3
[ceph: root@host01 /]# ceph mon set_location host02 datacenter=DC3
Copy to Clipboard Copied! Remove the failed tiebreaker monitor:
Syntax
ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --force
ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --force
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force Removed mon.host07 from host 'host07'
[ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force Removed mon.host07 from host 'host07'
Copy to Clipboard Copied! Once the monitor is removed from the host, redeploy the monitor:
Syntax
ceph mon add HOST IP_ADDRESS datacenter=DATACENTER ceph orch daemon add mon HOST
ceph mon add HOST IP_ADDRESS datacenter=DATACENTER ceph orch daemon add mon HOST
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph mon add host07 213.222.226.50 datacenter=DC1 [ceph: root@host01 /]# ceph orch daemon add mon host07
[ceph: root@host01 /]# ceph mon add host07 213.222.226.50 datacenter=DC1 [ceph: root@host01 /]# ceph orch daemon add mon host07
Copy to Clipboard Copied! Ensure there are five monitors in quorum:
Example
[ceph: root@host01 /]# ceph -s mon: 5 daemons, quorum host01, host02, host04, host05, host07 (age 15s)
[ceph: root@host01 /]# ceph -s mon: 5 daemons, quorum host01, host02, host04, host05, host07 (age 15s)
Copy to Clipboard Copied! Verify that everything is configured properly:
Example
[ceph: root@host01 /]# ceph mon dump epoch 19 fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d last_changed 2023-01-17T04:12:05.709475+0000 created 2023-01-16T05:47:25.631684+0000 min_mon_release 16 (pacific) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon host02 disallowed_leaders host02 0: [v2:132.224.169.63:3300/0,v1:132.224.169.63:6789/0] mon.host02; crush_location {datacenter=DC3} 1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2} 2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1} 3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host07; crush_location {datacenter=DC1} 4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host03; crush_location {datacenter=DC2} dumped monmap epoch 19
[ceph: root@host01 /]# ceph mon dump epoch 19 fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d last_changed 2023-01-17T04:12:05.709475+0000 created 2023-01-16T05:47:25.631684+0000 min_mon_release 16 (pacific) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon host02 disallowed_leaders host02 0: [v2:132.224.169.63:3300/0,v1:132.224.169.63:6789/0] mon.host02; crush_location {datacenter=DC3} 1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2} 2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1} 3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host07; crush_location {datacenter=DC1} 4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host03; crush_location {datacenter=DC2} dumped monmap epoch 19
Copy to Clipboard Copied! Redeploy the monitors:
Syntax
ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5”
ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5”
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host07" Scheduled mon update...
[ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host07" Scheduled mon update...
Copy to Clipboard Copied!
10.2. Replacing the tiebreaker with a new monitor
If your tiebreaker monitor fails, you can replace it with a new monitor and remove it from the cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster
- Stretch mode in enabled on a cluster
Procedure
Add a new monitor to the cluster:
Manually add the
crush_location
to the new monitor:Syntax
ceph mon add NEW_HOST IP_ADDRESS datacenter=DATACENTER
ceph mon add NEW_HOST IP_ADDRESS datacenter=DATACENTER
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph mon add host06 213.222.226.50 datacenter=DC3 adding mon.host06 at [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0]
[ceph: root@host01 /]# ceph mon add host06 213.222.226.50 datacenter=DC3 adding mon.host06 at [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0]
Copy to Clipboard Copied! NoteThe new monitor has to be in a different location than existing non-tiebreaker monitors.
Disable automated monitor deployment:
Example
[ceph: root@host01 /]# ceph orch apply mon --unmanaged Scheduled mon update…
[ceph: root@host01 /]# ceph orch apply mon --unmanaged Scheduled mon update…
Copy to Clipboard Copied! Deploy the new monitor:
Syntax
ceph orch daemon add mon NEW_HOST
ceph orch daemon add mon NEW_HOST
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph orch daemon add mon host06
[ceph: root@host01 /]# ceph orch daemon add mon host06
Copy to Clipboard Copied!
Ensure there are 6 monitors, from which 5 are in quorum:
Example
[ceph: root@host01 /]# ceph -s mon: 6 daemons, quorum host01, host02, host04, host05, host06 (age 30s), out of quorum: host07
[ceph: root@host01 /]# ceph -s mon: 6 daemons, quorum host01, host02, host04, host05, host06 (age 30s), out of quorum: host07
Copy to Clipboard Copied! Set the new monitor as a new tiebreaker:
Syntax
ceph mon set_new_tiebreaker NEW_HOST
ceph mon set_new_tiebreaker NEW_HOST
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph mon set_new_tiebreaker host06
[ceph: root@host01 /]# ceph mon set_new_tiebreaker host06
Copy to Clipboard Copied! Remove the failed tiebreaker monitor:
Syntax
ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --force
ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --force
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force Removed mon.host07 from host 'host07'
[ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force Removed mon.host07 from host 'host07'
Copy to Clipboard Copied! Verify that everything is configured properly:
Example
[ceph: root@host01 /]# ceph mon dump epoch 19 fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d last_changed 2023-01-17T04:12:05.709475+0000 created 2023-01-16T05:47:25.631684+0000 min_mon_release 16 (pacific) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon host06 disallowed_leaders host06 0: [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0] mon.host06; crush_location {datacenter=DC3} 1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2} 2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1} 3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host02; crush_location {datacenter=DC1} 4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host05; crush_location {datacenter=DC2} dumped monmap epoch 19
[ceph: root@host01 /]# ceph mon dump epoch 19 fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d last_changed 2023-01-17T04:12:05.709475+0000 created 2023-01-16T05:47:25.631684+0000 min_mon_release 16 (pacific) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon host06 disallowed_leaders host06 0: [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0] mon.host06; crush_location {datacenter=DC3} 1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2} 2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1} 3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host02; crush_location {datacenter=DC1} 4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host05; crush_location {datacenter=DC2} dumped monmap epoch 19
Copy to Clipboard Copied! Redeploy the monitors:
Syntax
ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5”
ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5”
Copy to Clipboard Copied! Example
[ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host06" Scheduled mon update…
[ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host06" Scheduled mon update…
Copy to Clipboard Copied!
10.3. Forcing stretch cluster into recovery or healthy mode
When in stretch degraded mode, the cluster goes into the recovery mode automatically after the disconnected data center comes back. If that does not happen, or you want to enable recovery mode early, you can force the stretch cluster into the recovery mode.
Prerequisites
- A running Red Hat Ceph Storage cluster
- Stretch mode in enabled on a cluster
Procedure
Force the stretch cluster into the recovery mode:
Example
[ceph: root@host01 /]# ceph osd force_recovery_stretch_mode --yes-i-really-mean-it
[ceph: root@host01 /]# ceph osd force_recovery_stretch_mode --yes-i-really-mean-it
Copy to Clipboard Copied! NoteThe recovery state puts the cluster in the
HEALTH_WARN
state.When in recovery mode, the cluster should go back into normal stretch mode after the placement groups are healthy. If that does not happen, you can force the stretch cluster into the healthy mode:
Example
[ceph: root@host01 /]# ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
[ceph: root@host01 /]# ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
Copy to Clipboard Copied! NoteYou can also run this command if you want to force the cross-data-center peering early and you are willing to risk data downtime, or you have verified separately that all the placement groups can peer, even if they are not fully recovered.
You might also wish to invoke the healthy mode to remove the
HEALTH_WARN
state, which is generated by the recovery state.NoteThe
force_recovery_stretch_mode
andforce_recovery_healthy_mode
commands should not be necessary, as they are included in the process of managing unanticipated situations.