Chapter 10. Troubleshooting clusters in stretch mode


You can replace and remove the failed tiebreaker monitors. You can also force the cluster into the recovery or healthy mode if needed.

10.1. Replacing the tiebreaker with a monitor in quorum

If your tiebreaker monitor fails, you can replace it with an existing monitor in quorum and remove it from the cluster.

Prerequisites

  • A running Red Hat Ceph Storage cluster
  • Stretch mode is enabled on a cluster

Procedure

  1. Disable automated monitor deployment:

    Example

    [ceph: root@host01 /]# ceph orch apply mon --unmanaged
    
    Scheduled mon update…
    Copy to Clipboard

  2. View the monitors in quorum:

    Example

    [ceph: root@host01 /]# ceph -s
    
    mon: 5 daemons, quorum host01, host02, host04, host05 (age 30s), out of quorum: host07
    Copy to Clipboard

  3. Set the monitor in quorum as a new tiebreaker:

    Syntax

    ceph mon set_new_tiebreaker NEW_HOST
    Copy to Clipboard

    Example

    [ceph: root@host01 /]# ceph mon set_new_tiebreaker host02
    Copy to Clipboard

    Important

    You get an error message if the monitor is in the same location as existing non-tiebreaker monitors:

    Example

    [ceph: root@host01 /]# ceph mon set_new_tiebreaker host02
    
    Error EINVAL: mon.host02 has location DC1, which matches mons host02 on the datacenter dividing bucket for stretch mode.
    Copy to Clipboard

    If that happens, change the location of the monitor:

    Syntax

    ceph mon set_location HOST datacenter=DATACENTER
    Copy to Clipboard

    Example

    [ceph: root@host01 /]# ceph mon set_location host02 datacenter=DC3
    Copy to Clipboard

  4. Remove the failed tiebreaker monitor:

    Syntax

    ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --force
    Copy to Clipboard

    Example

    [ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force
    
    Removed mon.host07 from host 'host07'
    Copy to Clipboard

  5. Once the monitor is removed from the host, redeploy the monitor:

    Syntax

    ceph mon add HOST IP_ADDRESS datacenter=DATACENTER
    ceph orch daemon add mon HOST
    Copy to Clipboard

    Example

    [ceph: root@host01 /]# ceph mon add host07 213.222.226.50 datacenter=DC1
    [ceph: root@host01 /]# ceph orch daemon add mon host07
    Copy to Clipboard

  6. Ensure there are five monitors in quorum:

    Example

    [ceph: root@host01 /]# ceph -s
    
    mon: 5 daemons, quorum host01, host02, host04, host05, host07 (age 15s)
    Copy to Clipboard

  7. Verify that everything is configured properly:

    Example

    [ceph: root@host01 /]# ceph mon dump
    
    epoch 19
    fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
    last_changed 2023-01-17T04:12:05.709475+0000
    created 2023-01-16T05:47:25.631684+0000
    min_mon_release 16 (pacific)
    election_strategy: 3
    stretch_mode_enabled 1
    tiebreaker_mon host02
    disallowed_leaders host02
    0: [v2:132.224.169.63:3300/0,v1:132.224.169.63:6789/0] mon.host02; crush_location {datacenter=DC3}
    1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2}
    2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1}
    3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host07; crush_location {datacenter=DC1}
    4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host03; crush_location {datacenter=DC2}
    dumped monmap epoch 19
    Copy to Clipboard

  8. Redeploy the monitors:

    Syntax

    ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5”
    Copy to Clipboard

    Example

    [ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host07"
    
    Scheduled mon update...
    Copy to Clipboard

10.2. Replacing the tiebreaker with a new monitor

If your tiebreaker monitor fails, you can replace it with a new monitor and remove it from the cluster.

Prerequisites

  • A running Red Hat Ceph Storage cluster
  • Stretch mode in enabled on a cluster

Procedure

  1. Add a new monitor to the cluster:

    1. Manually add the crush_location to the new monitor:

      Syntax

      ceph mon add NEW_HOST IP_ADDRESS datacenter=DATACENTER
      Copy to Clipboard

      Example

      [ceph: root@host01 /]# ceph mon add host06 213.222.226.50 datacenter=DC3
      
      adding mon.host06 at [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0]
      Copy to Clipboard

      Note

      The new monitor has to be in a different location than existing non-tiebreaker monitors.

    2. Disable automated monitor deployment:

      Example

      [ceph: root@host01 /]# ceph orch apply mon --unmanaged
      
      Scheduled mon update…
      Copy to Clipboard

    3. Deploy the new monitor:

      Syntax

      ceph orch daemon add mon NEW_HOST
      Copy to Clipboard

      Example

      [ceph: root@host01 /]# ceph orch daemon add mon host06
      Copy to Clipboard

  2. Ensure there are 6 monitors, from which 5 are in quorum:

    Example

    [ceph: root@host01 /]# ceph -s
    
    mon: 6 daemons, quorum host01, host02, host04, host05, host06 (age 30s), out of quorum: host07
    Copy to Clipboard

  3. Set the new monitor as a new tiebreaker:

    Syntax

    ceph mon set_new_tiebreaker NEW_HOST
    Copy to Clipboard

    Example

    [ceph: root@host01 /]# ceph mon set_new_tiebreaker host06
    Copy to Clipboard

  4. Remove the failed tiebreaker monitor:

    Syntax

    ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --force
    Copy to Clipboard

    Example

    [ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force
    
    Removed mon.host07 from host 'host07'
    Copy to Clipboard

  5. Verify that everything is configured properly:

    Example

    [ceph: root@host01 /]# ceph mon dump
    
    epoch 19
    fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
    last_changed 2023-01-17T04:12:05.709475+0000
    created 2023-01-16T05:47:25.631684+0000
    min_mon_release 16 (pacific)
    election_strategy: 3
    stretch_mode_enabled 1
    tiebreaker_mon host06
    disallowed_leaders host06
    0: [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0] mon.host06; crush_location {datacenter=DC3}
    1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2}
    2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1}
    3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host02; crush_location {datacenter=DC1}
    4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host05; crush_location {datacenter=DC2}
    dumped monmap epoch 19
    Copy to Clipboard

  6. Redeploy the monitors:

    Syntax

    ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5”
    Copy to Clipboard

    Example

    [ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host06"
    
    Scheduled mon update…
    Copy to Clipboard

10.3. Forcing stretch cluster into recovery or healthy mode

When in stretch degraded mode, the cluster goes into the recovery mode automatically after the disconnected data center comes back. If that does not happen, or you want to enable recovery mode early, you can force the stretch cluster into the recovery mode.

Prerequisites

  • A running Red Hat Ceph Storage cluster
  • Stretch mode in enabled on a cluster

Procedure

  1. Force the stretch cluster into the recovery mode:

    Example

    [ceph: root@host01 /]#  ceph osd force_recovery_stretch_mode --yes-i-really-mean-it
    Copy to Clipboard

    Note

    The recovery state puts the cluster in the HEALTH_WARN state.

  2. When in recovery mode, the cluster should go back into normal stretch mode after the placement groups are healthy. If that does not happen, you can force the stretch cluster into the healthy mode:

    Example

    [ceph: root@host01 /]#  ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
    Copy to Clipboard

    Note

    You can also run this command if you want to force the cross-data-center peering early and you are willing to risk data downtime, or you have verified separately that all the placement groups can peer, even if they are not fully recovered.

    You might also wish to invoke the healthy mode to remove the HEALTH_WARN state, which is generated by the recovery state.

    Note

    The force_recovery_stretch_mode and force_recovery_healthy_mode commands should not be necessary, as they are included in the process of managing unanticipated situations.

Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat