Questo contenuto non è disponibile nella lingua selezionata.

Chapter 4. Stretch clusters for Ceph storage


As a storage administrator, you can configure a two-site stretched cluster by enabling stretch mode in Ceph.

Red Hat Ceph Storage systems offer the option to expand the failure domain beyond the OSD level to a datacenter or cloud zone level.

The following diagram depicts a simplified representation of a Ceph cluster operating in stretch mode, where the tiebreaker host is provisioned in data center (DC) 3.

Figure 4.1. Stretch clusters for Ceph storage

A stretch cluster operates over a Wide Area Network (WAN), unlike a typical Ceph cluster, which operates over a Local Area Network (LAN). For illustration purposes, a data center is chosen as the failure domain, though this could also represent a cloud availability zone. Data Center 1 (DC1) and Data Center 2 (DC2) contain OSDs and Monitors within their respective domains, while Data Center 3 (DC3) contains only a single monitor. The latency between DC1 and DC2 should not exceed 10 ms RTT, as higher latency can significantly impact Ceph performance in terms of replication, recovery, and related operations. However, DC3—a non-data site typically hosted on a virtual machine—can tolerate higher latency compared to the two data sites. A stretch cluster, like the one in the diagram, can withstand a complete data center failure or a network partition between data centers as long as at least two sites remain connected.

A stretch cluster, like the one in the diagram, can withstand a complete data center failure or a network partition between data centers as long as at least two sites remain connected.

Note

There are no additional steps to power down a stretch cluster. You can see the Powering down and rebooting Red Hat Ceph Storage cluster for more information.

4.1. Stretch mode for a storage cluster

To improve availability in Stretched clusters (geographically distributed deployments), you must enter the stretch mode. When stretch mode is enabled, the Ceph OSDs only take placement groups (PGs) as active when they peer across data centers, or whichever other CRUSH bucket type you specified, assuming both are active. Pools increase in size from the default three to four, with two copies on each site.

In stretch mode, Ceph OSDs are only allowed to connect to monitors within the same data center. New monitors are not allowed to join the cluster without specified location.

If all the OSDs and monitors from a data center become inaccessible at once, the surviving data center will enter a degraded stretch mode. This issues a warning, reduces the min_size to 1, and allows the cluster to reach an active state with the data from the remaining site.

Stretch mode is designed to handle netsplit scenarios between two data centers and the loss of one data center. Stretch mode handles the netsplit scenario by choosing the surviving data center with a better connection to the tiebreaker monitor. Stretch mode handles the loss of one data center by reducing the min_size of all pools to 1, allowing the cluster to continue operating with the remaining data center. When the lost data center comes back, the cluster will recover the lost data and return to normal operation.

Note
In a stretch cluster, when a site goes down and the cluster enters a degraded state, the min_size of the pool may be temporarily reduced (e.g., to 1) to allow the placement groups (PGs) to become active and continue serving I/O. However, the size of the pool remains unchanged. The peering_crush_bucket_count stretch mode flag ensures that PGs does not become active unless they are backed by OSDs in a minimum number of distinct CRUSH buckets (e.g., different data centers). This mechanism prevents the system from creating redundant copies solely within the surviving site, ensuring that data is only fully replicated once the downed site recovers.
Copy to Clipboard Toggle word wrap

When the missing data center becomes accessible again, the cluster enters recovery stretch mode. This changes the warning and allows peering, but still requires only the OSDs from the data center, which was up the whole time.

When all PGs are in a known state and are not degraded or incomplete, the cluster goes back to the regular stretch mode, ends the warning, and restores min_size to its starting value 2. The cluster again requires both sites to peer, not only the site that stayed up the whole time, therefore you can fail over to the other site, if necessary.

Stretch mode limitations

  • It is not possible to exit from stretch mode once it is entered.
  • You cannot use erasure-coded pools with clusters in stretch mode. You can neither enter the stretch mode with erasure-coded pools, nor create an erasure-coded pool when the stretch mode is active.
  • Device class is not supported in stretch mode. In the following example, the class hdd is not supported.

    Example

    rule stretch_replicated_rule
    {id 2
    type replicated class hdd
    step take default
    step choose firstn 0 type datacenter
    step chooseleaf firstn 2 type host
    step emit
    }
    Copy to Clipboard Toggle word wrap

    To achieve same weights on both sites, the Ceph OSDs deployed in the two sites should be of equal size, that is, storage capacity in the first site is equivalent to storage capacity in the second site.

  • While it is not enforced, you should run two Ceph monitors on each site and a tiebreaker, for a total of five. This is because OSDs can only connect to monitors in their own site when in stretch mode.
  • You have to create your own CRUSH rule, which provides two copies on each site, which totals to four on both sites.
  • You cannot enable stretch mode if you have existing pools with non-default size or min_size.
  • Because the cluster runs with min_size 1 when degraded, you should only use stretch mode with all-flash OSDs. This minimizes the time needed to recover once connectivity is restored, and minimizes the potential for data loss.

Stretch peering rule

In Ceph stretch cluster mode, a critical safeguard is enforced through the stretch peering rule, which ensures that a Placement Group (PG) cannot become active if all acting replicas reside within a single failure domain, such as a single data center or cloud availability zone.

This behavior is essential for protecting data integrity during site failures. If a PG were allowed to go active with all replicas confined to one site, write operations could be falsely acknowledged without true redundancy. In the event of a site outage, this would result in complete data loss for those PGs. By enforcing zone diversity in the acting set, Ceph stretch clusters maintain high availability while minimizing the risk of data inconsistency or loss.

4.2. Deployment requirements

This information details important hardware, software, and network requirements that are needed for deploying a generalized stretch cluster configuration for three availability zones.

Software requirements

Red Hat Ceph Storage 8.1

Hardware requirements

Use the following minimum requirements before a stretch cluster configuration.

Expand
Table 4.1. ceph-osd hardware requirements
Hardware criteriaMinimum and recommended

Processor

  • 1 core minimum, 2 recommended.
  • 1 core per 200-500 MB/s throughput.
  • 1 core per 1000-3000 IOPS.
  • Results are before replication.
  • Results can vary across CPU and drive models and Ceph configuration (erasure coding, and compression).
  • ARM processors specifically can require more cores for performance.
  • SSD OSDs, especially NVMe, benefit from extra cores per OSD.
  • Actual performance depends on various factors, including drives, network, and client throughput and latency. Bench marking is recommended to assess performance accurately.

RAM

  • 4 GB or more per daemon is required (higher is recommended).
  • 2-4 GB can work, but performance might be slower.
  • Less than 2 GB is not recommended for optimal performance.

Network

A single 1 Gb/s (bonded 10+ Gb/s recommended).

Expand
Table 4.2. ceph-mon hardware requirements
Hardware criteriaMinimum and recommended

Processor

2 cores minimum

Storage drives

100 GB per daemon. SSD is recommended.

Network

A single 1 Gb/s (10+ Gb/s recommended)

Expand
Table 4.3. ceph-mds hardware requirements
Hardware criteriaMinimum and recommended

Processor

2 cores minimum

RAM

2 GB per daemon (more for production)

Disk space

1 GB per daemon

Network

A single 1 Gb/s (10+ Gb/s recommended)

Daemon placement

The following table lists the daemon placement details across various hosts and data centers.

Expand
Table 4.4. Daemon placement
HostnameData centerServices

host01

DC1

OSD+MON+MGR

host02

DC1

OSD+MON+MGR

host03

DC1

OSD+MDS+RGW

host04

DC2

OSD+MON+MGR

host05

DC2

OSD+MON+MGR

host06

DC2

OSD+MDS+RGW

host07

DC3 (Tiebreaker)

MON

Network configuration requirements

Use the following network configuration requirements before deploying stretch cluster configuration.

Note

You can use different subnets for each of the data centers.

  • Have two separate networks, one public network and one cluster network.
  • The latencies between data centers that run the Ceph Object Storage Devices (OSDs) cannot exceed 10 ms RTT.

The following is an example of a basic network configuration:

  • DC1

    Ceph public/private network: 10.0.40.0/24

  • DC2

    Ceph public/private network: 10.0.40.0/24

  • Tiebreaker

    Ceph public/private network: 10.0.40.0/24

Cluster setup requirements

Ensure that the hostname is configured by using the bare or short hostname in all hosts.

Syntax

hostnamectl set-hostname SHORT_NAME
Copy to Clipboard Toggle word wrap

Important

The hostname command should only return the short hostname, when run on all nodes. If the FQDN is returned, the cluster configuration will not be successful.

4.3. Setting the CRUSH location for the daemons

Before you enter the stretch mode, you need to prepare the cluster by setting the CRUSH location to the daemons in the Red Hat Ceph Storage cluster. There are two ways to do this:

  • Bootstrap the cluster through a service configuration file, where the locations are added to the hosts as part of deployment.
  • Set the locations manually through ceph osd crush add-bucket and ceph osd crush move commands after the cluster is deployed.

Method 1: Bootstrapping the cluster

Prerequisites

  • Root-level access to the nodes.

Procedure

  1. If you are bootstrapping your new storage cluster, you can create the service configuration .yaml file that adds the nodes to the Red Hat Ceph Storage cluster and also sets specific labels for where the services should run:

    Example

    service_type: host
    addr: host01
    hostname: host01
    location:
      root: default
      datacenter: DC1
    labels:
      - osd
      - mon
      - mgr
    ---
    service_type: host
    addr: host02
    hostname: host02
    location:
      datacenter: DC1
    labels:
      - osd
      - mon
    ---
    service_type: host
    addr: host03
    hostname: host03
    location:
      datacenter: DC1
    labels:
      - osd
      - mds
      - rgw
    ---
    service_type: host
    addr: host04
    hostname: host04
    location:
      root: default
      datacenter: DC2
    labels:
      - osd
      - mon
      - mgr
    ---
    service_type: host
    addr: host05
    hostname: host05
    location:
      datacenter: DC2
    labels:
      - osd
      - mon
    ---
    service_type: host
    addr: host06
    hostname: host06
    location:
      datacenter: DC2
    labels:
      - osd
      - mds
      - rgw
    ---
    service_type: host
    addr: host07
    hostname: host07
    labels:
      - mon
    ---
    service_type: mon
    placement:
      label: "mon"
    ---
    service_id: cephfs
    placement:
      label: "mds"
    ---
    service_type: mgr
    service_name: mgr
    placement:
      label: "mgr"
    ---
    service_type: osd
    service_id: all-available-devices
    service_name: osd.all-available-devices
    placement:
      label: "osd"
    spec:
      data_devices:
        all: true
    ---
    service_type: rgw
    service_id: objectgw
    service_name: rgw.objectgw
    placement:
      count: 2
      label: "rgw"
    spec:
      rgw_frontend_port: 8080
    Copy to Clipboard Toggle word wrap

  2. Bootstrap the storage cluster with the --apply-spec option:

    Syntax

    cephadm bootstrap --apply-spec CONFIGURATION_FILE_NAME --mon-ip MONITOR_IP_ADDRESS --ssh-private-key PRIVATE_KEY --ssh-public-key PUBLIC_KEY --registry-url REGISTRY_URL --registry-username USER_NAME --registry-password PASSWORD
    Copy to Clipboard Toggle word wrap

    Example

    [root@host01 ~]# cephadm bootstrap --apply-spec initial-config.yaml --mon-ip 10.10.128.68 --ssh-private-key /home/ceph/.ssh/id_rsa --ssh-public-key /home/ceph/.ssh/id_rsa.pub --registry-url registry.redhat.io --registry-username myuser1 --registry-password mypassword1
    Copy to Clipboard Toggle word wrap

    Important

    You can use different command options with the cephadm bootstrap command. However, always include the --apply-spec option to use the service configuration file and configure the host locations.

Method 2: Setting the locations after the deployment

Prerequisites

  • Root-level access to the nodes.

Procedure

  1. Add two buckets to which you plan to set the location of your non-tiebreaker monitors to the CRUSH map, specifying the bucket type as as datacenter:

    Syntax

    ceph osd crush add-bucket BUCKET_NAME BUCKET_TYPE
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph osd crush add-bucket DC1 datacenter
    [ceph: root@host01 /]# ceph osd crush add-bucket DC2 datacenter
    Copy to Clipboard Toggle word wrap

  2. Move the buckets under root=default:

    Syntax

    ceph osd crush move BUCKET_NAME root=default
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph osd crush move DC1 root=default
    [ceph: root@host01 /]# ceph osd crush move DC2 root=default
    Copy to Clipboard Toggle word wrap

  3. Move the OSD hosts according to the required CRUSH placement:

    Syntax

    ceph osd crush move HOST datacenter=DATACENTER
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph osd crush move host01 datacenter=DC1
    Copy to Clipboard Toggle word wrap

4.3.1. Entering the stretch mode

The new stretch mode is designed to handle two sites. There is a lower risk of component availability outages with 2-site clusters.

Prerequisites

  • Root-level access to the nodes.
  • The CRUSH location is set to the hosts.

Procedure

  1. Set the location of each monitor, matching your CRUSH map:

    Syntax

    ceph mon set_location HOST datacenter=DATACENTER
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph mon set_location host01 datacenter=DC1
    [ceph: root@host01 /]# ceph mon set_location host02 datacenter=DC1
    [ceph: root@host01 /]# ceph mon set_location host04 datacenter=DC2
    [ceph: root@host01 /]# ceph mon set_location host05 datacenter=DC2
    [ceph: root@host01 /]# ceph mon set_location host07 datacenter=DC3
    Copy to Clipboard Toggle word wrap

  2. Generate a CRUSH rule which places two copies on each data center:

    Syntax

    ceph osd getcrushmap > COMPILED_CRUSHMAP_FILENAME
    crushtool -d COMPILED_CRUSHMAP_FILENAME -o DECOMPILED_CRUSHMAP_FILENAME
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph osd getcrushmap > crush.map.bin
    [ceph: root@host01 /]# crushtool -d crush.map.bin -o crush.map.txt
    Copy to Clipboard Toggle word wrap

    1. Edit the decompiled CRUSH map file to add a new rule:

      Example

      rule stretch_rule {
              id 1 
      1
      
              type replicated
              min_size 1
              max_size 10
              step take DC1 
      2
      
              step chooseleaf firstn 2 type host
              step emit
              step take DC2 
      3
      
              step chooseleaf firstn 2 type host
              step emit
      }
      Copy to Clipboard Toggle word wrap

      1
      The rule id has to be unique. In this example, there is only one more rule with id 0, thereby the id 1 is used, however you might need to use a different rule ID depending on the number of existing rules.
      2 3
      In this example, there are two data center buckets named DC1 and DC2.
      Note

      This rule makes the cluster have read-affinity towards data center DC1. Therefore, all the reads or writes happen through Ceph OSDs placed in DC1.

      If this is not desirable, and reads or writes are to be distributed evenly across the zones, the CRUSH rule is the following:

      Example

      rule stretch_rule {
      id 1
      type replicated
      min_size 1
      max_size 10
      step take default
      step choose firstn 0 type datacenter
      step chooseleaf firstn 2 type host
      step emit
      }
      Copy to Clipboard Toggle word wrap

      In this rule, the data center is selected randomly and automatically.

      See CRUSH rules for more information on firstn and indep options.

  3. Inject the CRUSH map to make the rule available to the cluster:

    Syntax

    crushtool -c DECOMPILED_CRUSHMAP_FILENAME -o COMPILED_CRUSHMAP_FILENAME
    ceph osd setcrushmap -i COMPILED_CRUSHMAP_FILENAME
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# crushtool -c crush.map.txt -o crush2.map.bin
    [ceph: root@host01 /]# ceph osd setcrushmap -i crush2.map.bin
    Copy to Clipboard Toggle word wrap

  4. If you do not run the monitors in connectivity mode, set the election strategy to connectivity:

    Example

    [ceph: root@host01 /]# ceph mon set election_strategy connectivity
    Copy to Clipboard Toggle word wrap

  5. Enter stretch mode by setting the location of the tiebreaker monitor to split across the data centers:

    Syntax

    ceph mon set_location HOST datacenter=DATACENTER
    ceph mon enable_stretch_mode HOST stretch_rule datacenter
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph mon set_location host07 datacenter=DC3
    [ceph: root@host01 /]# ceph mon enable_stretch_mode host07 stretch_rule datacenter
    Copy to Clipboard Toggle word wrap

    In this example the monitor mon.host07 is the tiebreaker.

    Important

    The location of the tiebreaker monitor should differ from the data centers to which you previously set the non-tiebreaker monitors. In the example above, it is data center DC3.

    Important

    Do not add this data center to the CRUSH map as it results in the following error when you try to enter stretch mode:

    Error EINVAL: there are 3 datacenters in the cluster but stretch mode currently only works with 2!
    Copy to Clipboard Toggle word wrap
    Note

    If you are writing your own tooling for deploying Ceph, you can use a new --set-crush-location option when booting monitors, instead of running the ceph mon set_location command. This option accepts only a single bucket=location pair, for example ceph-mon --set-crush-location 'datacenter=DC1', which must match the bucket type you specified when running the enable_stretch_mode command.

  6. Verify that the stretch mode is enabled successfully:

    Example

    [ceph: root@host01 /]# ceph osd dump
    
    epoch 361
    fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
    created 2023-01-16T05:47:28.482717+0000
    modified 2023-01-17T17:36:50.066183+0000
    flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
    crush_version 31
    full_ratio 0.95
    backfillfull_ratio 0.92
    nearfull_ratio 0.85
    require_min_compat_client luminous
    min_compat_client luminous
    require_osd_release quincy
    stretch_mode_enabled true
    stretch_bucket_count 2
    degraded_stretch_mode 0
    recovering_stretch_mode 0
    stretch_mode_bucket 8
    Copy to Clipboard Toggle word wrap

    The stretch_mode_enabled should be set to true. You can also see the number of stretch buckets, stretch mode buckets, and if the stretch mode is degraded or recovering.

  7. Verify that the monitors are in an appropriate locations:

    Example

    [ceph: root@host01 /]# ceph mon dump
    
    epoch 19
    fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
    last_changed 2023-01-17T04:12:05.709475+0000
    created 2023-01-16T05:47:25.631684+0000
    min_mon_release 16 (pacific)
    election_strategy: 3
    stretch_mode_enabled 1
    tiebreaker_mon host07
    disallowed_leaders host07
    0: [v2:132.224.169.63:3300/0,v1:132.224.169.63:6789/0] mon.host07; crush_location {datacenter=DC3}
    1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2}
    2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1}
    3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host02; crush_location {datacenter=DC1}
    4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host05; crush_location {datacenter=DC2}
    dumped monmap epoch 19
    Copy to Clipboard Toggle word wrap

    You can also see which monitor is the tiebreaker, and the monitor election strategy.

4.3.2. Configuring a CRUSH map for stretch mode

Use this information to configure a CRUSH map for stretch mode.

Prerequisites

Before you begin, make sure that you have the following prerequisites in place:

  • Root-level access to the nodes.
  • The CRUSH location is set to the hosts.

Procedure

  1. Create a CRUSH rule that makes use of this OSD crush topology by installing the ceph-base RPM package in order to use the crushtool command.

    Syntax

    dnf -y install ceph-base
    Copy to Clipboard Toggle word wrap

  2. Get the compiled CRUSH map from the cluster.

    Syntax

    ceph osd getcrushmap > /etc/ceph/crushmap.bin
    Copy to Clipboard Toggle word wrap

  3. Decompile the CRUSH map and convert it to a text file to edit it.

    Syntax

    crushtool -d /etc/ceph/crushmap.bin -o /etc/ceph/crushmap.txt
    Copy to Clipboard Toggle word wrap

  4. Add the following rule to the CRUSH map by editing the /etc/ceph/crushmap.txt at the end of the file. This rule distributes reads and writes evenly across the data center.

    Syntax

    rule stretch_rule {
            id 1
            type replicated
            step take default
            step choose firstn 0 type datacenter
            step chooseleaf firstn 2 type host
            step emit
     }
    Copy to Clipboard Toggle word wrap

    1. Optionally have the cluster with a read/write affinity towards data center 1.

      Syntax

      rule stretch_rule {
               id 1
               type replicated
               step take DC1
               step chooseleaf firstn 2 type host
               step emit
               step take DC2
               step chooseleaf firstn 2 type host
               step emit
       }
      Copy to Clipboard Toggle word wrap

      The CRUSH rule declared contains the following information:
           Rule name
                Description: A unique name for identifying the rule.
                Value: stretch_rule
           id
                Description: A unique whole number for identifying the rule.
                Value: 1
           type
                Description: Describes a rule for either a storage drive replicated or erasure-coded.
                Value: replicated
           step take default
                Description: Takes the root bucket called default, and begins iterating down the tree.
           step take DC1
                Description: Takes the bucket called DC1, and begins iterating down the tree.
           step choose firstn 0 type datacenter
                Description: Selects the datacenter bucket, and goes into its subtrees.
           step chooseleaf firstn 2 type host
                Description: Selects the number of buckets of the given type. In this case, it is two different hosts located in the datacenter it entered at the previous level.
           step emit
                Description: Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to pick from different trees in the same rule.
      Copy to Clipboard Toggle word wrap
  5. Compile the new CRUSH map from /etc/ceph/crushmap.txt and convert it to a binary file /etc/ceph/crushmap2.bin.

    Syntax

    crushtool -c /path/to/crushmap.txt -o /path/to/crushmap2.bin
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# crushtool -c /etc/ceph/crushmap.txt -o /etc/ceph/crushmap2.bin
    Copy to Clipboard Toggle word wrap

  6. Inject the newly created CRUSH map back into the cluster.

    Syntax

    ceph osd setcrushmap -i /path/to/compiled_crushmap
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph osd setcrushmap -i /path/to/compiled_crushmap
    17
    Copy to Clipboard Toggle word wrap

    Note

    The number 17 is a counter and increases (18,19, and so on) depending on the changes that are made to the CRUSH map.

Verifying

Verify that the newly created stretch_rule available for use.

Syntax

ceph osd crush rule ls
Copy to Clipboard Toggle word wrap

Example

[ceph: root@host01 /]# ceph osd crush rule ls

replicated_rule
stretch_rule
Copy to Clipboard Toggle word wrap

4.3.2.1. Entering stretch mode

Stretch mode is designed to handle two sites. There is a lesser risk of component availability outages with 2-site clusters.

Prerequisites

Before you begin, make sure that you have the following prerequisites in place:

  • Root-level access to the nodes.
  • The CRUSH location is set to the hosts.
  • The CRUSH map configured to include stretch rule.
  • No erasure coded pools in the cluster.
  • Weights of the two sites are the same.

Procedure

  1. Check the current election strategy being used by the monitors.

    Syntax

    ceph mon dump | grep election_strategy
    Copy to Clipboard Toggle word wrap

    Note

    The Ceph cluster election_strategy is set to 1, by default.

    Example

    [ceph: root@host01 /]# ceph mon dump | grep election_strategy
    
    dumped monmap epoch 9
    election_strategy: 1
    Copy to Clipboard Toggle word wrap

  2. Change the election strategy to connectivity.

    Syntax

    ceph mon set election_strategy connectivity
    Copy to Clipboard Toggle word wrap

    For more information about configuring the election strategy, see Configuring monitor election strategy.

  3. Use the ceph mon dump command to verify that the election strategy was updated to 3.

    Example

    [ceph: root@host01 /]# ceph mon dump | grep election_strategy
    
    dumped monmap epoch 22
    election_strategy: 3
    Copy to Clipboard Toggle word wrap

  4. Set the location of the tiebreaker monitor so that it is split across the data centers.

    Syntax

    ceph mon set_location TIEBREAKER_HOST datacenter=DC3
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph mon set_location host07 datacenter=DC3
    Copy to Clipboard Toggle word wrap

  5. Verify that the tiebreaker monitor is set as expected.

    Syntax

    ceph mon dump
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph mon dump
    
    epoch 8
    
    fsid 4158287e-169e-11f0-b1ad-fa163e98b991
    
    last_changed 2025-04-11T07:14:48.652801+0000
    
    created 2025-04-11T06:29:24.974553+0000
    
    min_mon_release 19 (squid)
    
    election_strategy: 3
    
    0: [v2:10.0.57.33:3300/0,v1:10.0.57.33:6789/0] mon.host07; crush_location {datacenter=DC3}
    
    1: [v2:10.0.58.200:3300/0,v1:10.0.58.200:6789/0] mon.host05; crush_location {datacenter=DC2}
    
    2: [v2:10.0.58.47:3300/0,v1:10.0.58.47:6789/0] mon.host02; crush_location {datacenter=DC1}
    
    3: [v2:10.0.58.104:3300/0,v1:10.0.58.104:6789/0] mon.host04; crush_location {datacenter=DC2}
    
    4: [v2:10.0.58.38:3300/0,v1:10.0.58.38:6789/0] mon.host01; crush_location {datacenter=DC1}
    
    dumped monmap epoch 8
    0
    Copy to Clipboard Toggle word wrap

  6. Enter stretch mode.

    Syntax

    ceph mon enable_stretch_mode TIEBREAKER_HOST STRETCH_RULE STRETCH_BUCKET
    Copy to Clipboard Toggle word wrap

    In the following example:

    • The tiebreaker node is set as host07.
    • The stretch rule is stretch_rule, as created in .
    • The stretch bucket is set as datacenter.
[ceph: root@host01 /]# ceph mon enable_stretch_mode host07 stretch_rule datacenter
Copy to Clipboard Toggle word wrap

Verifying

Verify that stretch mode was implemented correctly by continuing to CROSREF.

4.3.2.2. Verifying stretch mode

Use this information to verify that stretch mode was created correctly with the implemented CRUSH rules.

Procedure

  1. Verify that all pools are using the CRUSH rule that was created in the Ceph cluster. In these examples, the CRUSH rule is set as stretch_rule, per the settings that were created in Configuring a CRUSH map for stretch mode.

    Syntax

    for pool in $(rados lspools);do echo -n "Pool: ${pool}; ";ceph osd pool get ${pool} crush_rule;done
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# for pool in $(rados lspools);do echo -n "Pool: ${pool}; ";ceph osd pool get ${pool} crush_rule;done
    Pool: device_health_metrics; crush_rule: stretch_rule
    Pool: cephfs.cephfs.meta; crush_rule: stretch_rule
    Pool: cephfs.cephfs.data; crush_rule: stretch_rule
    Pool: .rgw.root; crush_rule: stretch_rule
    Pool: default.rgw.log; crush_rule: stretch_rule
    Pool: default.rgw.control; crush_rule: stretch_rule
    Pool: default.rgw.meta; crush_rule: stretch_rule
    Pool: rbdpool; crush_rule: stretch_rule
    Copy to Clipboard Toggle word wrap

  2. Verify that stretch mode is enabled. Ensure that stretch_mode_enabled is set to true.

    Syntax

    ceph osd dump
    Copy to Clipboard Toggle word wrap

    The output includes the following information:

    stretch_mode_enabled
    Set to true if stretch mode is enabled.
    stretch_bucket_count
    The number of data centers with OSDs.
    degraded_stretch_mode
    Output of 0 if not degraded. If the stretch mode is degraded, this outputs the number of up sites.
    recovering_stretch_mode
    Output of 0 if not recovering. If the stretch mode is recovering, the output is 1.
    stretch_mode_bucket

    A unique value set for each CRUSH bucket type. This value is usually set to 8, for data center.

    Example

    "stretch_mode": {
                "stretch_mode_enabled": true,
                "stretch_bucket_count": 2,
                "degraded_stretch_mode": 0,
                "recovering_stretch_mode": 1,
                "stretch_mode_bucket": 8
    Copy to Clipboard Toggle word wrap

  3. Verify that stretch mode is using the mon map, by using the ceph mon dump.

    Ensure the following:

    • stretch_mode_enabled is set to 1
    • The correct mon host is set as tiebreaker_mon
    • The correct mon host is set as disallowed_leaders

      Syntax

      ceph mon dump
      Copy to Clipboard Toggle word wrap

      Example

      [ceph: root@host01 /]# ceph mon dump
      epoch 16
      fsid ff19789c-f5c7-11ef-8e1c-fa163e4e1f7e
      last_changed 2025-02-28T12:12:51.089706+0000
      created 2025-02-28T11:34:59.325503+0000
      min_mon_release 19 (squid)
      election_strategy: 3
      stretch_mode_enabled 1
      tiebreaker_mon host07
      disallowed_leaders host07
      0: [v2:10.0.56.37:3300/0,v1:10.0.56.37:6789/0] mon.host01; crush_location {datacenter=DC1}
      1: [v2:10.0.59.188:3300/0,v1:10.0.59.188:6789/0] mon.host05; crush_location {datacenter=DC2}
      2: [v2:10.0.59.35:3300/0,v1:10.0.59.35:6789/0] mon.host02; crush_location {datacenter=DC1}
      3: [v2:10.0.56.189:3300/0,v1:10.0.56.189:6789/0] mon.host07; crush_location {datacenter=DC3}
      4: [v2:10.0.56.13:3300/0,v1:10.0.56.13:6789/0] mon.host04; crush_location {datacenter=DC2}
      dumped monmap epoch 16
      Copy to Clipboard Toggle word wrap

What to do next

  1. Deploy, configure, and administer a Ceph Object Gateway. For more information, see Ceph Object Gateway.
  2. Manage, create, configure, and use Ceph Block Devices. For more information, see Ceph block devices.
  3. Create, mount, and work the Ceph File System (CephFS). For more information, see Ceph File Systems.

4.4. Using and maintaining stretch mode

Use and maintain stretch mode by adding OSD hosts, managing data center monitor service hosts, and replacing tiebreakers with a monitor both with and without a quorum.

4.4.1. Adding OSD hosts in stretch mode

You can add Ceph OSDs in the stretch mode. The procedure is similar to the addition of the OSD hosts on a cluster where stretch mode is not enabled.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Stretch mode in enabled on a cluster.
  • Root-level access to the nodes.

Procedure

  1. List the available devices to deploy OSDs:

    Syntax

    ceph orch device ls [--hostname=HOST_1 HOST_2] [--wide] [--refresh]
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph orch device ls
    Copy to Clipboard Toggle word wrap

  2. Deploy the OSDs on specific hosts or on all the available devices:

    • Create an OSD from a specific device on a specific host:

      Syntax

      ceph orch daemon add osd HOST:DEVICE_PATH
      Copy to Clipboard Toggle word wrap

      Example

      [ceph: root@host01 /]# ceph orch daemon add osd host03:/dev/sdb
      Copy to Clipboard Toggle word wrap

    • Deploy OSDs on any available and unused devices:

      Important

      This command creates collocated WAL and DB devices. If you want to create non-collocated devices, do not use this command.

      Example

      [ceph: root@host01 /]# ceph orch apply osd --all-available-devices
      Copy to Clipboard Toggle word wrap

  3. Move the OSD hosts under the CRUSH bucket:

    Syntax

    ceph osd crush move HOST datacenter=DATACENTER
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph osd crush move host03 datacenter=DC1
    [ceph: root@host01 /]# ceph osd crush move host06 datacenter=DC2
    Copy to Clipboard Toggle word wrap

    Note

    Ensure you add the same topology nodes on both sites. Issues might arise if hosts are added only on one site.

4.4.2. Managing data center monitor service hosts in stretch mode

Use this information to add and remove data center monitor service (mon) hosts in stretch mode. Managing data centers can be done by using the specification file or directly on the Ceph cluster.

Prerequisites

Before you begin, make sure that you have the following prerequisites in place:

  • A running Red Hat Ceph Storage cluster
  • Stretch mode in enabled on a cluster
  • Root-level access to the nodes.

4.4.2.1. Managing a mon service with a service specification file

These steps detail how to add a mon service. To remove the service, use the same steps of updating the service specification file, with removing the needed information.

Procedure

  1. Export the specification file for mon and save the output to mon-spec.yaml.

    Syntax

    ceph orch ls mon --export > mon-spec.yaml
    Copy to Clipboard Toggle word wrap

    After the file is exported, the YAML file can be edited.

  2. Add the new host details. In the following example, host08 is being added to the cluster into the DC2 data center bucket.

    Syntax

    service_type: host
    addr: 10.1.172.225
    hostname: host08
    labels:
    - mon
    ---
    service_type: mon
    service_name: mon
    placement:
     label: mon
    spec:
     crush_locations:
       host01:
       - datacenter=DC1
      host02:
       - datacenter=DC1
      host03:
       - datacenter=DC1
       host04:
       - datacenter=DC2
       host05:
       - datacenter=DC2
       host06:
       - datacenter=DC2
       host08:
       - datacenter=DC2
    Copy to Clipboard Toggle word wrap

  3. Apply the specification file.

    Syntax

    ceph orch apply -i mon-spec.yaml
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# eph orch apply -i mon-spec.yaml
    Added host 'host08' with addr '10.1.172.225'
    Scheduled mon update...
    Copy to Clipboard Toggle word wrap

Verifying

  1. Use the ceph mon dump command to verify that the mon service was deployed and that the appropriate CRUSH location was added to the monitor.

    Example

    [ceph: root@host01 /]# ceph mon dump
    epoch 16
    fsid ff19789c-f5c7-11ef-8e1c-fa163e4e1f7e
    last_changed 2025-02-28T12:12:51.089706+0000
    created 2025-02-28T11:34:59.325503+0000
    min_mon_release 19 (squid)
    election_strategy: 3
    stretch_mode_enabled 1
    tiebreaker_mon host07
    disallowed_leaders host07
    0: [v2:10.0.56.37:3300/0,v1:10.0.56.37:6789/0] mon.host01; crush_location {datacenter=DC1}
    1: [v2:10.0.59.188:3300/0,v1:10.0.59.188:6789/0] mon.host05; crush_location {datacenter=DC2}
    2: [v2:10.0.59.35:3300/0,v1:10.0.59.35:6789/0] mon.host02; crush_location {datacenter=DC1}
    3: [v2:10.0.56.189:3300/0,v1:10.0.56.189:6789/0] mon.host07; crush_location {datacenter=DC3}
    4: [v2:10.0.56.13:3300/0,v1:10.0.56.13:6789/0] mon.host04; crush_location {datacenter=DC2}
    dumped monmap epoch 16
    Copy to Clipboard Toggle word wrap

  2. Use the ceph orch host ls to verify that the host was added to the cluster.

    Example

    [ceph: root@host01 /]# ceph orch host ls
    HOST                                        ADDR         LABELS       STATUS
    host01            10.0.56.37   mgr,mon,osd
    host02            10.0.59.35   mgr,mon,osd
    host03            10.0.58.106  osd,mds,rgw
    host04            10.0.56.13   osd,mon,mgr
    host05            10.0.59.188  mgr,mon,osd
    host06            10.0.56.223  rgw,mds,osd
    host07            10.0.56.189  _admin,mon
    7 hosts in cluster
    Copy to Clipboard Toggle word wrap

4.4.2.2. Managing a mon service with the command-line interface

These steps detail how to add a mon service. To remove the service, use the same steps of updating with the CLI, with removing the needed information.

Procedure

  1. Set the monitor service to unmanaged.

    Syntax

    ceph orch set-unmanaged mon
    Copy to Clipboard Toggle word wrap

  2. Optional: Use the ceph orch ls command to verify that the service was set, as expected.

    Example

    [ceph: root@host01 /]# ceph orch host ls
    NAME                                 PORTS             RUNNING  REFRESHED  AGE  PLACEMENT
    mon                                                        8/8  10m ago    19s  <unmanaged>
    Copy to Clipboard Toggle word wrap

  3. Add a new host with the mon label.

    Syntax

    ceph orch host add HOST_NAME IP_ADDRESS_OF_HOST [--label=LABEL_NAME_1,LABEL_NAME_2]
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph orch host add host08 10.1.172.205 --labels=mon
    Copy to Clipboard Toggle word wrap

  4. Add a monitor service with CRUSH locations.

    Note

    At this point, the mon is not running and is not managed by Cephadm.

    Syntax

    ceph mon add NODE:_IP_ADDRESS_ datacenter=DC2
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph mon add host08:10.1.172.205 datacenter=DC2
    Copy to Clipboard Toggle word wrap

  5. Deploy the monitor daemon using Cephadm.

    Syntax

    ceph orch daemon add mon host08
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph orch daemon add mon host08
    Deployed mon.host08 on host 'host08'
    Copy to Clipboard Toggle word wrap

  6. Enable Cephadm management for the monitor service.

    Syntax

    ceph orch set-managed mon
    Copy to Clipboard Toggle word wrap

  7. Start the newly added mon daemon.

    Syntax

    ceph orch set-managed mgr
    Copy to Clipboard Toggle word wrap

Verifying

Verify that the service, monitor, and host are added and running.

  1. Use the ceph orch ls command to verify that the service is running.

    Example

    [ceph: root@host01 /]# ceph orch host ls
    NAME                                 PORTS             RUNNING  REFRESHED  AGE  PLACEMENT
    mon                                                        8/8  7m ago     4d   label:mon
    Copy to Clipboard Toggle word wrap

  2. Use the ceph mon dump command to verify that the mon service was deployed and that the appropriate CRUSH location was added to the monitor.

    Example

    [ceph: root@host01 /]# ceph mon dump
    epoch 16
    fsid ff19789c-f5c7-11ef-8e1c-fa163e4e1f7e
    last_changed 2025-02-28T12:12:51.089706+0000
    created 2025-02-28T11:34:59.325503+0000
    min_mon_release 19 (squid)
    election_strategy: 3
    stretch_mode_enabled 1
    tiebreaker_mon host07
    disallowed_leaders host07
    0: [v2:10.0.56.37:3300/0,v1:10.0.56.37:6789/0] mon.host01; crush_location {datacenter=DC1}
    1: [v2:10.0.59.188:3300/0,v1:10.0.59.188:6789/0] mon.host05; crush_location {datacenter=DC2}
    2: [v2:10.0.59.35:3300/0,v1:10.0.59.35:6789/0] mon.host02; crush_location {datacenter=DC1}
    3: [v2:10.0.56.189:3300/0,v1:10.0.56.189:6789/0] mon.host07; crush_location {datacenter=DC3}
    4: [v2:10.0.56.13:3300/0,v1:10.0.56.13:6789/0] mon.host04; crush_location {datacenter=DC2}
    dumped monmap epoch 16
    Copy to Clipboard Toggle word wrap

  3. Use the ceph orch host ls commmand to verify that the host was added to the cluster.

    Example

    [ceph: root@host01 /]# ceph orch host ls
    HOST                                        ADDR         LABELS       STATUS
    host01            10.0.56.37   mgr,mon,osd
    host02            10.0.59.35   mgr,mon,osd
    host03            10.0.58.106  osd,mds,rgw
    host04            10.0.56.13   osd,mon,mgr
    host05            10.0.59.188  mgr,mon,osd
    host06            10.0.56.223  rgw,mds,osd
    host07            10.0.56.189  _admin,mon
    7 hosts in cluster
    Copy to Clipboard Toggle word wrap

4.4.3. Replacing the tiebreaker with a monitor in quorum

If your tiebreaker monitor fails, you can replace it with an existing monitor in quorum and remove it from the cluster.

Prerequisites

  • A running Red Hat Ceph Storage cluster
  • Stretch mode is enabled on a cluster

Procedure

  1. Disable automated monitor deployment:

    Example

    [ceph: root@host01 /]# ceph orch apply mon --unmanaged
    
    Scheduled mon update…
    Copy to Clipboard Toggle word wrap

  2. View the monitors in quorum:

    Example

    [ceph: root@host01 /]# ceph -s
    
    mon: 5 daemons, quorum host01, host02, host04, host05 (age 30s), out of quorum: host07
    Copy to Clipboard Toggle word wrap

  3. Set the monitor in quorum as a new tiebreaker:

    Syntax

    ceph mon set_new_tiebreaker NEW_HOST
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph mon set_new_tiebreaker host02
    Copy to Clipboard Toggle word wrap

    Important

    You get an error message if the monitor is in the same location as existing non-tiebreaker monitors:

    Example

    [ceph: root@host01 /]# ceph mon set_new_tiebreaker host02
    
    Error EINVAL: mon.host02 has location DC1, which matches mons host02 on the datacenter dividing bucket for stretch mode.
    Copy to Clipboard Toggle word wrap

    If that happens, change the location of the monitor:

    Syntax

    ceph mon set_location HOST datacenter=DATACENTER
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph mon set_location host02 datacenter=DC3
    Copy to Clipboard Toggle word wrap

  4. Remove the failed tiebreaker monitor:

    Syntax

    ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --force
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force
    
    Removed mon.host07 from host 'host07'
    Copy to Clipboard Toggle word wrap

  5. Once the monitor is removed from the host, redeploy the monitor:

    Syntax

    ceph mon add HOST IP_ADDRESS datacenter=DATACENTER
    ceph orch daemon add mon HOST
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph mon add host07 213.222.226.50 datacenter=DC1
    [ceph: root@host01 /]# ceph orch daemon add mon host07
    Copy to Clipboard Toggle word wrap

  6. Ensure there are five monitors in quorum:

    Example

    [ceph: root@host01 /]# ceph -s
    
    mon: 5 daemons, quorum host01, host02, host04, host05, host07 (age 15s)
    Copy to Clipboard Toggle word wrap

  7. Verify that everything is configured properly:

    Example

    [ceph: root@host01 /]# ceph mon dump
    
    epoch 19
    fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
    last_changed 2023-01-17T04:12:05.709475+0000
    created 2023-01-16T05:47:25.631684+0000
    min_mon_release 16 (pacific)
    election_strategy: 3
    stretch_mode_enabled 1
    tiebreaker_mon host02
    disallowed_leaders host02
    0: [v2:132.224.169.63:3300/0,v1:132.224.169.63:6789/0] mon.host02; crush_location {datacenter=DC3}
    1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2}
    2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1}
    3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host07; crush_location {datacenter=DC1}
    4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host03; crush_location {datacenter=DC2}
    dumped monmap epoch 19
    Copy to Clipboard Toggle word wrap

  8. Redeploy the monitors:

    Syntax

    ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host07"
    
    Scheduled mon update...
    Copy to Clipboard Toggle word wrap

4.4.4. Replacing the tiebreaker with a new monitor

If your tiebreaker monitor fails, you can replace it with a new monitor and remove it from the cluster.

Prerequisites

Before you begin, make sure that you have the following prerequisites in place:

  • A running Red Hat Ceph Storage cluster
  • Stretch mode in enabled on a cluster

Procedure

  1. Add a new monitor to the cluster:

    1. Manually add the crush_location to the new monitor:

      Syntax

      ceph mon add NEW_HOST IP_ADDRESS datacenter=DATACENTER
      Copy to Clipboard Toggle word wrap

      Example

      [ceph: root@host01 /]# ceph mon add host06 213.222.226.50 datacenter=DC3
      
      adding mon.host06 at [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0]
      Copy to Clipboard Toggle word wrap

      Note

      The new monitor has to be in a different location than existing non-tiebreaker monitors.

    2. Disable automated monitor deployment:

      Example

      [ceph: root@host01 /]# ceph orch apply mon --unmanaged
      
      Scheduled mon update…
      Copy to Clipboard Toggle word wrap

    3. Deploy the new monitor:

      Syntax

      ceph orch daemon add mon NEW_HOST
      Copy to Clipboard Toggle word wrap

      Example

      [ceph: root@host01 /]# ceph orch daemon add mon host06
      Copy to Clipboard Toggle word wrap

  2. Ensure there are 6 monitors, from which 5 are in quorum:

    Example

    [ceph: root@host01 /]# ceph -s
    
    mon: 6 daemons, quorum host01, host02, host04, host05, host06 (age 30s), out of quorum: host07
    Copy to Clipboard Toggle word wrap

  3. Set the new monitor as a new tiebreaker:

    Syntax

    ceph mon set_new_tiebreaker NEW_HOST
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph mon set_new_tiebreaker host06
    Copy to Clipboard Toggle word wrap

  4. Remove the failed tiebreaker monitor:

    Syntax

    ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --force
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force
    
    Removed mon.host07 from host 'host07'
    Copy to Clipboard Toggle word wrap

  5. Verify that everything is configured properly:

    Example

    [ceph: root@host01 /]# ceph mon dump
    
    epoch 19
    fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
    last_changed 2023-01-17T04:12:05.709475+0000
    created 2023-01-16T05:47:25.631684+0000
    min_mon_release 16 (pacific)
    election_strategy: 3
    stretch_mode_enabled 1
    tiebreaker_mon host06
    disallowed_leaders host06
    0: [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0] mon.host06; crush_location {datacenter=DC3}
    1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2}
    2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1}
    3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host02; crush_location {datacenter=DC1}
    4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host05; crush_location {datacenter=DC2}
    dumped monmap epoch 19
    Copy to Clipboard Toggle word wrap

  6. Redeploy the monitors:

    Syntax

    ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host06"
    
    Scheduled mon update…
    Copy to Clipboard Toggle word wrap

4.5. Read affinity in stretch clusters

Read Affinity reduces cross-zone traffic by keeping the data access within the respective data centers.

For stretched clusters deployed in multi-zone environments, the read affinity topology implementation provides a mechanism to help keep traffic within the data center it originated from. Ceph Object Gateway volumes have the ability to read data from an OSD in proximity to the client, according to OSD locations defined in the CRUSH map and topology labels on nodes.

For example, a stretch cluster contains a Ceph Object Gateway primary OSD and replicated OSDs spread across two data centers A and B. If a GET action is performed on an Object in data center A, the READ operation is performed on the data of the OSDs closest to the client in data center A.

4.5.1. Performing localized reads

You can perform a localized read on a replicated pool in a stretch cluster. When a localized read request is made on a replicated pool, Ceph selects the local OSDs closest to the client based on the client location specified in crush_location.

Prerequisites

  • A stretch cluster with two data centers and Ceph Object Gateway configured on both.
  • A user created with a bucket having primary and replicated OSDs.

Procedure

  • To perform a localized read, set rados_replica_read_policy to 'localize' in the OSD daemon configuration using the ceph config set command.

    [ceph: root@host01 /]# ceph config set client.rgw.rgw.1 rados_replica_read_policy localize
    Copy to Clipboard Toggle word wrap
  • Verification: Perform the below steps to verify the localized read from an OSD set.

    1. Run the ceph osd tree command to view the OSDs and the data centers.

      Example

      [ceph: root@host01 /]# ceph osd tree
      
      ID  CLASS  WEIGHT   TYPE NAME                                 STATUS  REWEIGHT  PRI-AFF
      -1         0.58557  root default
      -3         0.29279      datacenter DC1
      -2         0.09760          host ceph-ci-fbv67y-ammmck-node2
       2    hdd  0.02440              osd.2                             up   1.00000  1.00000
      11    hdd  0.02440              osd.11                            up   1.00000  1.00000
      17    hdd  0.02440              osd.17                            up   1.00000  1.00000
      22    hdd  0.02440              osd.22                            up   1.00000  1.00000
      -4         0.09760          host ceph-ci-fbv67y-ammmck-node3
       0    hdd  0.02440              osd.0                             up   1.00000  1.00000
       6    hdd  0.02440              osd.6                             up   1.00000  1.00000
      12    hdd  0.02440              osd.12                            up   1.00000  1.00000
      18    hdd  0.02440              osd.18                            up   1.00000  1.00000
      -5         0.09760          host ceph-ci-fbv67y-ammmck-node4
       5    hdd  0.02440              osd.5                             up   1.00000  1.00000
      10    hdd  0.02440              osd.10                            up   1.00000  1.00000
      16    hdd  0.02440              osd.16                            up   1.00000  1.00000
      23    hdd  0.02440              osd.23                            up   1.00000  1.00000
      -7         0.29279      datacenter DC2
      -6         0.09760          host ceph-ci-fbv67y-ammmck-node5
       3    hdd  0.02440              osd.3                             up   1.00000  1.00000
       8    hdd  0.02440              osd.8                             up   1.00000  1.00000
      14    hdd  0.02440              osd.14                            up   1.00000  1.00000
      20    hdd  0.02440              osd.20                            up   1.00000  1.00000
      -8         0.09760          host ceph-ci-fbv67y-ammmck-node6
       4    hdd  0.02440              osd.4                             up   1.00000  1.00000
       9    hdd  0.02440              osd.9                             up   1.00000  1.00000
      15    hdd  0.02440              osd.15                            up   1.00000  1.00000
      21    hdd  0.02440              osd.21                            up   1.00000  1.00000
      -9         0.09760          host ceph-ci-fbv67y-ammmck-node7
       1    hdd  0.02440              osd.1                             up   1.00000  1.00000
       7    hdd  0.02440              osd.7                             up   1.00000  1.00000
      13    hdd  0.02440              osd.13                            up   1.00000  1.00000
      19    hdd  0.02440              osd.19                            up   1.00000  1.00000
      Copy to Clipboard Toggle word wrap

    2. Run the ceph orch command to identify the Ceph Object Gateway daemons in the data centers.

      Example

      [ceph: root@host01 /]# ceph orch ps | grep rg
      
      rgw.rgw.1.ceph-ci-fbv67y-ammmck-node4.dmsmex         ceph-ci-fbv67y-ammmck-node4            *:80              running (4h)     10m ago  22h    93.3M        -  19.1.0-55.el9cp  0ee0a0ad94c7  34f27723ccd2
      rgw.rgw.1.ceph-ci-fbv67y-ammmck-node7.pocecp         ceph-ci-fbv67y-ammmck-node7            *:80              running (4h)     10m ago  22h    96.4M        -  19.1.0-55.el9cp  0ee0a0ad94c7  40e4f2a6d4c4
      Copy to Clipboard Toggle word wrap

    3. Verify if a default read has happened by running the vim command on the Ceph Object Gateway logs.

      Example

      [ceph: root@host01 /]# vim /var/log/ceph/<fsid>/<ceph-client-rgw>.log
      
      2024-08-26T08:07:45.471+0000 7fc623e63640  1 ====== starting new request req=0x7fc5b93694a0 =====
      2024-08-26T08:07:45.471+0000 7fc623e63640  1 -- 10.0.67.142:0/279982082 --> [v2:10.0.66.23:6816/73244434,v1:10.0.66.23:6817/73244434] -- osd_op(unknown.0.0:9081 11.55 11:ab26b168:::3acf4091-c54c-43b5-a495-c505fe545d25.27842.1_f1:head [getxattrs,stat] snapc 0=[] ondisk+read+localize_reads+known_if_redirected+supports_pool_eio e3533) -- 0x55f781bd2000 con 0x55f77f0e8c00
      Copy to Clipboard Toggle word wrap

      You can see in the logs that a localized read has taken place.

      Important

      To be able to view the debug logs, you must first enable debug_ms 1 in the configuration by running the ceph config set command.

      [ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node4.dgvrmx    advanced  debug_ms    1/1
      
      [ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node7.rfkqqq    advanced  debug_ms    1/1
      Copy to Clipboard Toggle word wrap

4.5.2. Performing balanced reads

You can perform a balanced read on a pool to retrieve evenly distributed OSDs across data centers. When a balanced READ is issued on a pool, the read operations are distributed evenly across all OSDs that are spread across the data centers.

Prerequisites

  • A stretch cluster with two data centers and Ceph Object Gateway configured on both.
  • A user created with a bucket and OSDs - primary and replicated OSDs.

Procedure

  • To perform a balanced read, set rados_replica_read_policy to 'balance' in the OSD daemon configuration using the ceph config set command.

    [ceph: root@host01 /]# ceph config set client.rgw.rgw.1 rados_replica_read_policy balance
    Copy to Clipboard Toggle word wrap
  • Verification: Perform the below steps to verify the balance read from an OSD set.

    1. Run the ceph osd tree command to view the OSDs and the data centers.

      Example

      [ceph: root@host01 /]# ceph osd tree
      
      ID  CLASS  WEIGHT   TYPE NAME                                 STATUS  REWEIGHT  PRI-AFF
      -1         0.58557  root default
      -3         0.29279      datacenter DC1
      -2         0.09760          host ceph-ci-fbv67y-ammmck-node2
       2    hdd  0.02440              osd.2                             up   1.00000  1.00000
      11    hdd  0.02440              osd.11                            up   1.00000  1.00000
      17    hdd  0.02440              osd.17                            up   1.00000  1.00000
      22    hdd  0.02440              osd.22                            up   1.00000  1.00000
      -4         0.09760          host ceph-ci-fbv67y-ammmck-node3
       0    hdd  0.02440              osd.0                             up   1.00000  1.00000
       6    hdd  0.02440              osd.6                             up   1.00000  1.00000
      12    hdd  0.02440              osd.12                            up   1.00000  1.00000
      18    hdd  0.02440              osd.18                            up   1.00000  1.00000
      -5         0.09760          host ceph-ci-fbv67y-ammmck-node4
       5    hdd  0.02440              osd.5                             up   1.00000  1.00000
      10    hdd  0.02440              osd.10                            up   1.00000  1.00000
      16    hdd  0.02440              osd.16                            up   1.00000  1.00000
      23    hdd  0.02440              osd.23                            up   1.00000  1.00000
      -7         0.29279      datacenter DC2
      -6         0.09760          host ceph-ci-fbv67y-ammmck-node5
       3    hdd  0.02440              osd.3                             up   1.00000  1.00000
       8    hdd  0.02440              osd.8                             up   1.00000  1.00000
      14    hdd  0.02440              osd.14                            up   1.00000  1.00000
      20    hdd  0.02440              osd.20                            up   1.00000  1.00000
      -8         0.09760          host ceph-ci-fbv67y-ammmck-node6
       4    hdd  0.02440              osd.4                             up   1.00000  1.00000
       9    hdd  0.02440              osd.9                             up   1.00000  1.00000
      15    hdd  0.02440              osd.15                            up   1.00000  1.00000
      21    hdd  0.02440              osd.21                            up   1.00000  1.00000
      -9         0.09760          host ceph-ci-fbv67y-ammmck-node7
       1    hdd  0.02440              osd.1                             up   1.00000  1.00000
       7    hdd  0.02440              osd.7                             up   1.00000  1.00000
      13    hdd  0.02440              osd.13                            up   1.00000  1.00000
      19    hdd  0.02440              osd.19                            up   1.00000  1.00000
      Copy to Clipboard Toggle word wrap

    2. Run the ceph orch command to identify the Ceph Object Gateway daemons in the data centers.

      Example

      [ceph: root@host01 /]# ceph orch ps | grep rg
      
      rgw.rgw.1.ceph-ci-fbv67y-ammmck-node4.dmsmex         ceph-ci-fbv67y-ammmck-node4            *:80              running (4h)     10m ago  22h    93.3M        -  19.1.0-55.el9cp  0ee0a0ad94c7  34f27723ccd2
      rgw.rgw.1.ceph-ci-fbv67y-ammmck-node7.pocecp         ceph-ci-fbv67y-ammmck-node7            *:80              running (4h)     10m ago  22h    96.4M        -  19.1.0-55.el9cp  0ee0a0ad94c7  40e4f2a6d4c4
      Copy to Clipboard Toggle word wrap

    3. Verify if a balanced read has happened by running the vim command on the Ceph Object Gateway logs.

      Example

      [ceph: root@host01 /]# vim /var/log/ceph/<fsid>/<ceph-client-rgw>.log
      
      2024-08-27T09:32:25.510+0000 7f2a7a284640  1 ====== starting new request req=0x7f2a31fcf4a0 =====
      2024-08-27T09:32:25.510+0000 7f2a7a284640  1 -- 10.0.67.142:0/3116867178 --> [v2:10.0.64.146:6816/2838383288,v1:10.0.64.146:6817/2838383288] -- osd_op(unknown.0.0:268731 11.55 11:ab26b168:::3acf4091-c54c-43b5-a495-c505fe545d25.27842.1_f1:head [getxattrs,stat] snapc 0=[] ondisk+read+balance_reads+known_if_redirected+supports_pool_eio e3554) -- 0x55cd1b88dc00 con 0x55cd18dd6000
      Copy to Clipboard Toggle word wrap

      You can see in the logs that a balanced read has taken place.

      Important

      To be able to view the debug logs, you must first enable debug_ms 1 in the configuration by running the ceph config set command.

      [ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node4.dgvrmx    advanced  debug_ms    1/1
      
      [ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node7.rfkqqq    advanced  debug_ms    1/1
      Copy to Clipboard Toggle word wrap

4.5.3. Performing default reads

You can perform a default read on a pool to retrieve data from primary data centers. When a default READ is issued on a pool, the IO operations are retrieved directly from each OSD in the data center.

Prerequisites

  • A stretch cluster with two data centers and Ceph Object Gateway configured on both.
  • A user created with a bucket and OSDs - primary and replicated OSDs.

Procedure

  • To perform a default read, set rados_replica_read_policy to 'default' in the OSD daemon configuration by using the ceph config set command.

    Example

    [ceph: root@host01 /]#ceph config set
    
    client.rgw.rgw.1 advanced rados_replica_read_policy default
    Copy to Clipboard Toggle word wrap

    The IO operations from the closest OSD in a data center are retrieved when a GET operation is performed.

  • Verification: Perform the below steps to verify the localized read from an OSD set.

    1. Run the ceph osd tree command to view the OSDs and the data centers.

      Example

      [ceph: root@host01 /]# ceph osd tree
      
      ID  CLASS  WEIGHT   TYPE NAME                                 STATUS  REWEIGHT  PRI-AFF
      -1         0.58557  root default
      -3         0.29279      datacenter DC1
      -2         0.09760          host ceph-ci-fbv67y-ammmck-node2
       2    hdd  0.02440              osd.2                             up   1.00000  1.00000
      11    hdd  0.02440              osd.11                            up   1.00000  1.00000
      17    hdd  0.02440              osd.17                            up   1.00000  1.00000
      22    hdd  0.02440              osd.22                            up   1.00000  1.00000
      -4         0.09760          host ceph-ci-fbv67y-ammmck-node3
       0    hdd  0.02440              osd.0                             up   1.00000  1.00000
       6    hdd  0.02440              osd.6                             up   1.00000  1.00000
      12    hdd  0.02440              osd.12                            up   1.00000  1.00000
      18    hdd  0.02440              osd.18                            up   1.00000  1.00000
      -5         0.09760          host ceph-ci-fbv67y-ammmck-node4
       5    hdd  0.02440              osd.5                             up   1.00000  1.00000
      10    hdd  0.02440              osd.10                            up   1.00000  1.00000
      16    hdd  0.02440              osd.16                            up   1.00000  1.00000
      23    hdd  0.02440              osd.23                            up   1.00000  1.00000
      -7         0.29279      datacenter DC2
      -6         0.09760          host ceph-ci-fbv67y-ammmck-node5
       3    hdd  0.02440              osd.3                             up   1.00000  1.00000
       8    hdd  0.02440              osd.8                             up   1.00000  1.00000
      14    hdd  0.02440              osd.14                            up   1.00000  1.00000
      20    hdd  0.02440              osd.20                            up   1.00000  1.00000
      -8         0.09760          host ceph-ci-fbv67y-ammmck-node6
       4    hdd  0.02440              osd.4                             up   1.00000  1.00000
       9    hdd  0.02440              osd.9                             up   1.00000  1.00000
      15    hdd  0.02440              osd.15                            up   1.00000  1.00000
      21    hdd  0.02440              osd.21                            up   1.00000  1.00000
      -9         0.09760          host ceph-ci-fbv67y-ammmck-node7
       1    hdd  0.02440              osd.1                             up   1.00000  1.00000
       7    hdd  0.02440              osd.7                             up   1.00000  1.00000
      13    hdd  0.02440              osd.13                            up   1.00000  1.00000
      19    hdd  0.02440              osd.19                            up   1.00000  1.00000
      Copy to Clipboard Toggle word wrap

    2. Run the ceph orch command to identify the Ceph Object Gateway daemons in the data centers.

      Example

      ceph orch ps | grep rg
      
      rgw.rgw.1.ceph-ci-fbv67y-ammmck-node4.dmsmex         ceph-ci-fbv67y-ammmck-node4            *:80              running (4h)     10m ago  22h    93.3M        -  19.1.0-55.el9cp  0ee0a0ad94c7  34f27723ccd2
      rgw.rgw.1.ceph-ci-fbv67y-ammmck-node7.pocecp         ceph-ci-fbv67y-ammmck-node7            *:80              running (4h)     10m ago  22h    96.4M        -  19.1.0-55.el9cp  0ee0a0ad94c7  40e4f2a6d4c4
      Copy to Clipboard Toggle word wrap

    3. Verify if a default read has happened by running the vim command on the Ceph Object Gateway logs.

      Example

      [ceph: root@host01 /]# vim /var/log/ceph/<fsid>/<ceph-client-rgw>.log
      
      2024-08-28T10:26:05.155+0000 7fe6b03dd640  1 ====== starting new request req=0x7fe6879674a0 =====
      2024-08-28T10:26:05.156+0000 7fe6b03dd640  1 -- 10.0.64.251:0/2235882725 --> [v2:10.0.65.171:6800/4255735352,v1:10.0.65.171:6801/4255735352] -- osd_op(unknown.0.0:1123 11.6d 11:b69767fc:::699c2d80-5683-43c5-bdcd-e8912107c176.24827.3_f1:head [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e4513) -- 0x5639da653800 con 0x5639d804d800
      Copy to Clipboard Toggle word wrap

      You can see in the logs that a default read has taken place.

      Important

      To be able to view the debug logs, you must first enable debug_ms 1 in the configuration by running the ceph config set command.

      [ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node4.dgvrmx    advanced  debug_ms    1/1
      
      [ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node7.rfkqqq    advanced  debug_ms    1/1
      Copy to Clipboard Toggle word wrap
Torna in cima
Red Hat logoGithubredditYoutubeTwitter

Formazione

Prova, acquista e vendi

Community

Informazioni sulla documentazione di Red Hat

Aiutiamo gli utenti Red Hat a innovarsi e raggiungere i propri obiettivi con i nostri prodotti e servizi grazie a contenuti di cui possono fidarsi. Esplora i nostri ultimi aggiornamenti.

Rendiamo l’open source più inclusivo

Red Hat si impegna a sostituire il linguaggio problematico nel codice, nella documentazione e nelle proprietà web. Per maggiori dettagli, visita il Blog di Red Hat.

Informazioni su Red Hat

Forniamo soluzioni consolidate che rendono più semplice per le aziende lavorare su piattaforme e ambienti diversi, dal datacenter centrale all'edge della rete.

Theme

© 2025 Red Hat