此内容没有您所选择的语言版本。
Chapter 4. Stretch clusters for Ceph storage
As a storage administrator, you can configure a two-site stretched cluster by enabling stretch mode in Ceph.
Red Hat Ceph Storage systems offer the option to expand the failure domain beyond the OSD level to a datacenter or cloud zone level.
The following diagram depicts a simplified representation of a Ceph cluster operating in stretch mode, where the tiebreaker host is provisioned in data center (DC) 3.
Figure 4.1. Stretch clusters for Ceph storage
A stretch cluster operates over a Wide Area Network (WAN), unlike a typical Ceph cluster, which operates over a Local Area Network (LAN). For illustration purposes, a data center is chosen as the failure domain, though this could also represent a cloud availability zone. Data Center 1 (DC1) and Data Center 2 (DC2) contain OSDs and Monitors within their respective domains, while Data Center 3 (DC3) contains only a single monitor. The latency between DC1 and DC2 should not exceed 10 ms RTT, as higher latency can significantly impact Ceph performance in terms of replication, recovery, and related operations. However, DC3—a non-data site typically hosted on a virtual machine—can tolerate higher latency compared to the two data sites. A stretch cluster, like the one in the diagram, can withstand a complete data center failure or a network partition between data centers as long as at least two sites remain connected.
A stretch cluster, like the one in the diagram, can withstand a complete data center failure or a network partition between data centers as long as at least two sites remain connected.
There are no additional steps to power down a stretch cluster. You can see the Powering down and rebooting Red Hat Ceph Storage cluster for more information.
4.1. Stretch mode for a storage cluster 复制链接链接已复制到粘贴板!
To improve availability in Stretched clusters (geographically distributed deployments), you must enter the stretch mode. When stretch mode is enabled, the Ceph OSDs only take placement groups (PGs) as active when they peer across data centers, or whichever other CRUSH bucket type you specified, assuming both are active. Pools increase in size from the default three to four, with two copies on each site.
In stretch mode, Ceph OSDs are only allowed to connect to monitors within the same data center. New monitors are not allowed to join the cluster without specified location.
If all the OSDs and monitors from a data center become inaccessible at once, the surviving data center will enter a degraded stretch mode. This issues a warning, reduces the min_size to 1, and allows the cluster to reach an active state with the data from the remaining site.
Stretch mode is designed to handle netsplit scenarios between two data centers and the loss of one data center. Stretch mode handles the netsplit scenario by choosing the surviving data center with a better connection to the tiebreaker monitor. Stretch mode handles the loss of one data center by reducing the min_size of all pools to 1, allowing the cluster to continue operating with the remaining data center. When the lost data center comes back, the cluster will recover the lost data and return to normal operation.
In a stretch cluster, when a site goes down and the cluster enters a degraded state, the min_size of the pool may be temporarily reduced (e.g., to 1) to allow the placement groups (PGs) to become active and continue serving I/O. However, the size of the pool remains unchanged. The peering_crush_bucket_count stretch mode flag ensures that PGs does not become active unless they are backed by OSDs in a minimum number of distinct CRUSH buckets (e.g., different data centers). This mechanism prevents the system from creating redundant copies solely within the surviving site, ensuring that data is only fully replicated once the downed site recovers.
When the missing data center becomes accessible again, the cluster enters recovery stretch mode. This changes the warning and allows peering, but still requires only the OSDs from the data center, which was up the whole time.
When all PGs are in a known state and are not degraded or incomplete, the cluster goes back to the regular stretch mode, ends the warning, and restores min_size to its starting value 2. The cluster again requires both sites to peer, not only the site that stayed up the whole time, therefore you can fail over to the other site, if necessary.
Stretch mode limitations
- It is not possible to exit from stretch mode once it is entered.
- You cannot use erasure-coded pools with clusters in stretch mode. You can neither enter the stretch mode with erasure-coded pools, nor create an erasure-coded pool when the stretch mode is active.
Device class is not supported in stretch mode. In the following example, the
class hddis not supported.Example
rule stretch_replicated_rule {id 2 type replicated class hdd step take default step choose firstn 0 type datacenter step chooseleaf firstn 2 type host step emit }To achieve same weights on both sites, the Ceph OSDs deployed in the two sites should be of equal size, that is, storage capacity in the first site is equivalent to storage capacity in the second site.
- While it is not enforced, you should run two Ceph monitors on each site and a tiebreaker, for a total of five. This is because OSDs can only connect to monitors in their own site when in stretch mode.
- You have to create your own CRUSH rule, which provides two copies on each site, which totals to four on both sites.
-
You cannot enable stretch mode if you have existing pools with non-default size or
min_size. -
Because the cluster runs with
min_size 1when degraded, you should only use stretch mode with all-flash OSDs. This minimizes the time needed to recover once connectivity is restored, and minimizes the potential for data loss.
Stretch peering rule
In Ceph stretch cluster mode, a critical safeguard is enforced through the stretch peering rule, which ensures that a Placement Group (PG) cannot become active if all acting replicas reside within a single failure domain, such as a single data center or cloud availability zone.
This behavior is essential for protecting data integrity during site failures. If a PG were allowed to go active with all replicas confined to one site, write operations could be falsely acknowledged without true redundancy. In the event of a site outage, this would result in complete data loss for those PGs. By enforcing zone diversity in the acting set, Ceph stretch clusters maintain high availability while minimizing the risk of data inconsistency or loss.
4.2. Deployment requirements 复制链接链接已复制到粘贴板!
This information details important hardware, software, and network requirements that are needed for deploying a generalized stretch cluster configuration for three availability zones.
Software requirements
Red Hat Ceph Storage 8.1
Hardware requirements
Use the following minimum requirements before a stretch cluster configuration.
| Hardware criteria | Minimum and recommended |
|---|---|
| Processor |
|
| RAM |
|
| Network | A single 1 Gb/s (bonded 10+ Gb/s recommended). |
| Hardware criteria | Minimum and recommended |
|---|---|
| Processor | 2 cores minimum |
| Storage drives | 100 GB per daemon. SSD is recommended. |
| Network | A single 1 Gb/s (10+ Gb/s recommended) |
| Hardware criteria | Minimum and recommended |
|---|---|
| Processor | 2 cores minimum |
| RAM | 2 GB per daemon (more for production) |
| Disk space | 1 GB per daemon |
| Network | A single 1 Gb/s (10+ Gb/s recommended) |
Daemon placement
The following table lists the daemon placement details across various hosts and data centers.
| Hostname | Data center | Services |
|---|---|---|
| host01 | DC1 | OSD+MON+MGR |
| host02 | DC1 | OSD+MON+MGR |
| host03 | DC1 | OSD+MDS+RGW |
| host04 | DC2 | OSD+MON+MGR |
| host05 | DC2 | OSD+MON+MGR |
| host06 | DC2 | OSD+MDS+RGW |
| host07 | DC3 (Tiebreaker) | MON |
Network configuration requirements
Use the following network configuration requirements before deploying stretch cluster configuration.
You can use different subnets for each of the data centers.
- Have two separate networks, one public network and one cluster network.
- The latencies between data centers that run the Ceph Object Storage Devices (OSDs) cannot exceed 10 ms RTT.
The following is an example of a basic network configuration:
DC1
Ceph public/private network: 10.0.40.0/24
DC2
Ceph public/private network: 10.0.40.0/24
Tiebreaker
Ceph public/private network: 10.0.40.0/24
Cluster setup requirements
Ensure that the hostname is configured by using the bare or short hostname in all hosts.
Syntax
hostnamectl set-hostname SHORT_NAME
The hostname command should only return the short hostname, when run on all nodes. If the FQDN is returned, the cluster configuration will not be successful.
4.3. Setting the CRUSH location for the daemons 复制链接链接已复制到粘贴板!
Before you enter the stretch mode, you need to prepare the cluster by setting the CRUSH location to the daemons in the Red Hat Ceph Storage cluster. There are two ways to do this:
- Bootstrap the cluster through a service configuration file, where the locations are added to the hosts as part of deployment.
-
Set the locations manually through
ceph osd crush add-bucketandceph osd crush movecommands after the cluster is deployed.
Method 1: Bootstrapping the cluster
Prerequisites
- Root-level access to the nodes.
Procedure
If you are bootstrapping your new storage cluster, you can create the service configuration
.yamlfile that adds the nodes to the Red Hat Ceph Storage cluster and also sets specific labels for where the services should run:Example
service_type: host addr: host01 hostname: host01 location: root: default datacenter: DC1 labels: - osd - mon - mgr --- service_type: host addr: host02 hostname: host02 location: datacenter: DC1 labels: - osd - mon --- service_type: host addr: host03 hostname: host03 location: datacenter: DC1 labels: - osd - mds - rgw --- service_type: host addr: host04 hostname: host04 location: root: default datacenter: DC2 labels: - osd - mon - mgr --- service_type: host addr: host05 hostname: host05 location: datacenter: DC2 labels: - osd - mon --- service_type: host addr: host06 hostname: host06 location: datacenter: DC2 labels: - osd - mds - rgw --- service_type: host addr: host07 hostname: host07 labels: - mon --- service_type: mon placement: label: "mon" --- service_id: cephfs placement: label: "mds" --- service_type: mgr service_name: mgr placement: label: "mgr" --- service_type: osd service_id: all-available-devices service_name: osd.all-available-devices placement: label: "osd" spec: data_devices: all: true --- service_type: rgw service_id: objectgw service_name: rgw.objectgw placement: count: 2 label: "rgw" spec: rgw_frontend_port: 8080Bootstrap the storage cluster with the
--apply-specoption:Syntax
cephadm bootstrap --apply-spec CONFIGURATION_FILE_NAME --mon-ip MONITOR_IP_ADDRESS --ssh-private-key PRIVATE_KEY --ssh-public-key PUBLIC_KEY --registry-url REGISTRY_URL --registry-username USER_NAME --registry-password PASSWORDExample
[root@host01 ~]# cephadm bootstrap --apply-spec initial-config.yaml --mon-ip 10.10.128.68 --ssh-private-key /home/ceph/.ssh/id_rsa --ssh-public-key /home/ceph/.ssh/id_rsa.pub --registry-url registry.redhat.io --registry-username myuser1 --registry-password mypassword1ImportantYou can use different command options with the
cephadm bootstrapcommand. However, always include the--apply-specoption to use the service configuration file and configure the host locations.
Method 2: Setting the locations after the deployment
Prerequisites
- Root-level access to the nodes.
Procedure
Add two buckets to which you plan to set the location of your non-tiebreaker monitors to the CRUSH map, specifying the bucket type as as
datacenter:Syntax
ceph osd crush add-bucket BUCKET_NAME BUCKET_TYPEExample
[ceph: root@host01 /]# ceph osd crush add-bucket DC1 datacenter [ceph: root@host01 /]# ceph osd crush add-bucket DC2 datacenterMove the buckets under
root=default:Syntax
ceph osd crush move BUCKET_NAME root=defaultExample
[ceph: root@host01 /]# ceph osd crush move DC1 root=default [ceph: root@host01 /]# ceph osd crush move DC2 root=defaultMove the OSD hosts according to the required CRUSH placement:
Syntax
ceph osd crush move HOST datacenter=DATACENTERExample
[ceph: root@host01 /]# ceph osd crush move host01 datacenter=DC1
4.3.1. Setting the CRUSH location during bootstrap 复制链接链接已复制到粘贴板!
For hosts with multiple public networks, update the public networks in CIDR format, within the configuration file. Enhance the specification file by setting the monitor CRUSH location. Set this information during the bootstrap procedure.
For more information about Ceph bootstrapping and different cephadm bootstrap ommand options, see Bootstrapping a new storage cluster.
Prerequisites
Before you begin, be sure that you have root-level access to the nodes.
Procedure
Create a
cluster-spec.yamlfile. The specification file adds the nodes to the Red Hat Ceph Storage cluster and also sets specific labels for where the services run.Example
service_type: host addr: <host03 address> hostname: host03 location: datacenter: DC1 labels: - osd - mds - rgw --- service_type: host addr: <host04 address> hostname: host04 location: root: default datacenter: DC2 labels: - osd - mon - mgr --- service_type: host addr: host05 hostname: host05 location: datacenter: DC2 labels: - osd - mon - mgr --- service_type: host addr: <host06 address> hostname: host06 location: datacenter: DC2 labels: - osd - mds - rgw --- service_type: host addr: <host07 address> hostname: host07 labels: - mon ------ service_type: mon spec: crush_locations: host01: - datacenter=DC1 host02: - datacenter=DC1 host04: - datacenter=DC2 host05: - datacenter=DC2 placement: label: mon --- service_type: mgr service_name: mgr placement: label: "mgr" --- service_type: osd service_id: all-available-devices service_name: osd.all-available-devices placement: label: "osd" spec: data_devices: all: true --- service_type: rgw service_id: objectgw service_name: rgw.objectgw placement: count: 2 label: "rgw" spec: rgw_frontend_port: 8080Run the
cephadm bootstrapcommand as the root user on the node that will serve as the initial Monitor node in the cluster.The MONITOR_IP_ADDRESS value is the node’s IP address that you are using to run the
cephadm bootstrapcommand. Run one of the following commands, based on your configuration needs.For hosts that are present on the same network.
Syntax
cephadm bootstrap --apply-spec CONFIGURATION_FILE_NAME --mon-ip MONITOR_IP_ADDRESS --ssh-private-key PRIVATE_KEY --ssh-public-key PUBLIC_KEY --registry-url REGISTRY_URL --registry-username USER_NAME --registry-password PASSWORDFor hosts that are present on different networks.
Syntax
cephadm bootstrap --apply-spec CONFIGURATION_FILE_NAME --mon-ip MONITOR_IP_ADDRESS --ssh-private-key PRIVATE_KEY --ssh-public-key PUBLIC_KEY --registry-url REGISTRY_URL --registry-username USER_NAME --registry-password PASSWORD --config ceph.confThe following configuration in the
ceph.conffile defines the public network settings for the cluster:[global] public_network = 10.1.172.0/23,10.0.64.0/22After the bootstrap process completes, the following output is emitted.
Or, if you are only running a single cluster on this host: sudo /usr/sbin/cephadm shell Please consider enabling telemetry to help improve Ceph: ceph telemetry on For more information see: https://docs.ceph.com/en/latest/mgr/telemetry/ Bootstrap complete.
Verification
Verify the network configuration.
ceph config dump | grep networkIn the following example, the output confirms that there are multiple networks and that
10.1.172.0/23and10.0.64.0/22are properly configured.Example
[ceph: root@host01 /]# ceph config dump | grep network global advanced public_network 10.1.172.0/23,10.0.64.0/22 *Verify that the status of storage cluster deployment is in the
HEALTH_OKstate.Example
[ceph: root@host01 /]# ceph -s cluster: id: ff19789c-f5c7-11ef-8e1c-fa163e4e1f7e health: HEALTH_OK services: mon: 5 daemons, quorum host01,host05,host02,host07-installer,host04 (age 10m) mgr: host05.aswlzq(active, since 43m), standbys: host02.ctajlt, host01.napqyw, host04.wdglem mds: 1/1 daemons up, 1 standby osd: 24 osds: 24 up (since 31m), 24 in (since 32m) rgw: 4 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 7 pools, 1019 pgs objects: 216 objects, 456 KiB usage: 2.7 GiB used, 357 GiB / 360 GiB availVerify that all the nodes that were added in the
cluster-spec.yamlfile are added to the cluster.ceph orch host lsExample
[ceph: root@host01 /]# ceph orch host ls HOST ADDR LABELS STATUS host01 10.0.56.37 mgr,mon,osd host02 10.0.59.35 mgr,mon,osd host03 10.0.58.106 osd,mds,rgw host04 10.0.56.13 osd,mon,mgr host05 10.0.59.188 mgr,mon,osd host06 10.0.56.223 rgw,mds,osd host07 10.0.56.189 _admin,mon 7 hosts in clusterUse the
ceph osd treecommand to verify the following:- CRUSH locations for the OSD hosts
- That each host has one OSD configured
-
That each host OSD is in the
upstate That each node is in the correct data center bucket
Example
[ceph: root@host01 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.35010 root default -3 0.17505 datacenter DC1 -2 0.05835 host host01 5 hdd 0.01459 osd.5 up 1.00000 1.00000 11 hdd 0.01459 osd.11 up 1.00000 1.00000 17 hdd 0.01459 osd.17 up 1.00000 1.00000 23 hdd 0.01459 osd.23 up 1.00000 1.00000 -4 0.05835 host host02 1 hdd 0.01459 osd.1 up 1.00000 1.00000 6 hdd 0.01459 osd.6 up 1.00000 1.00000 12 hdd 0.01459 osd.12 up 1.00000 1.00000 18 hdd 0.01459 osd.18 up 1.00000 1.00000 -5 0.05835 host host03 3 hdd 0.01459 osd.3 up 1.00000 1.00000 10 hdd 0.01459 osd.10 up 1.00000 1.00000 16 hdd 0.01459 osd.16 up 1.00000 1.00000 22 hdd 0.01459 osd.22 up 1.00000 1.00000 -7 0.17505 datacenter DC2 -6 0.05835 host host04 2 hdd 0.01459 osd.2 up 1.00000 1.00000 8 hdd 0.01459 osd.8 up 1.00000 1.00000 14 hdd 0.01459 osd.14 up 1.00000 1.00000 20 hdd 0.01459 osd.20 up 1.00000 1.00000 -8 0.05835 host host05 0 hdd 0.01459 osd.0 up 1.00000 1.00000 7 hdd 0.01459 osd.7 up 1.00000 1.00000 13 hdd 0.01459 osd.13 up 1.00000 1.00000 19 hdd 0.01459 osd.19 up 1.00000 1.00000 -9 0.05835 host host06 4 hdd 0.01459 osd.4 up 1.00000 1.00000 9 hdd 0.01459 osd.9 up 1.00000 1.00000 15 hdd 0.01459 osd.15 up 1.00000 1.00000 21 hdd 0.01459 osd.21 up 1.00000 1.00000
Verify the CRUSH locations for the MON hosts. Check the
mon mapto ensure that each MON host has acrush_locationspecified.ceph mon dumpThe output displays details about the MON map, including the
crush_locationfor each host.Example
[ceph: root@host01 /]# ceph mon dump epoch 5 fsid 4158287e-169e-11f0-b1ad-fa163e98b991 last_changed 2025-04-11T06:32:20.332479+0000 created 2025-04-11T06:29:24.974553+0000 min_mon_release 19 (squid) election_strategy: 1 0: [v2:10.0.57.33:3300/0,v1:10.0.57.33:6789/0] mon.host07 1: [v2:10.0.58.200:3300/0,v1:10.0.58.200:6789/0] mon.host05; crush_location {datacenter=DC2} 2: [v2:10.0.58.47:3300/0,v1:10.0.58.47:6789/0] mon.host02; crush_location {datacenter=DC1} 3: [v2:10.0.58.104:3300/0,v1:10.0.58.104:6789/0] mon.host04; crush_location {datacenter=DC2} 4: [v2:10.0.58.38:3300/0,v1:10.0.58.38:6789/0] mon.host01; crush_location {datacenter=DC1}
Set the locations manually through ceph osd crush add-bucket and ceph osd crush move commands after the cluster is deployed.
Prerequisites
Before you begin, be sure that you have root-level access to the nodes.
Procedure
Add two buckets to which you plan to set the location of your non-tiebreaker monitors to the CRUSH map. Specify the bucket type as
datacenter.Syntax
ceph osd crush add-bucket BUCKET_NAME datacenterExample
[ceph: root@host01 /]# ceph osd crush add-bucket DC1 datacenter [ceph: root@host01 /]# ceph osd crush add-bucket DC2 datacenterMove each of the buckets to
root=default.Syntax
ceph osd crush move BUCKET_NAME root=defaultExample
[ceph: root@host01 /]# ceph osd crush move DC1 root=default [ceph: root@host01 /]# ceph osd crush move DC2 root=defaultMove the OSD hosts, according to the required CRUSH placement.
Syntax
ceph osd crush move HOST datacenter=DATACENTERExample
[ceph: root@host01 /]# ceph osd crush move host01 datacenter=DC1Verify the CRUSH locations for OSD hosts.
Example
[ceph: root@host01 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.35010 root default -3 0.17505 datacenter DC1 -2 0.05835 host host01 5 hdd 0.01459 osd.5 up 1.00000 1.00000 11 hdd 0.01459 osd.11 up 1.00000 1.00000 17 hdd 0.01459 osd.17 up 1.00000 1.00000 23 hdd 0.01459 osd.23 up 1.00000 1.00000 -4 0.05835 host host02 1 hdd 0.01459 osd.1 up 1.00000 1.00000 6 hdd 0.01459 osd.6 up 1.00000 1.00000 12 hdd 0.01459 osd.12 up 1.00000 1.00000 18 hdd 0.01459 osd.18 up 1.00000 1.00000 -5 0.05835 host host03 3 hdd 0.01459 osd.3 up 1.00000 1.00000 10 hdd 0.01459 osd.10 up 1.00000 1.00000 16 hdd 0.01459 osd.16 up 1.00000 1.00000 22 hdd 0.01459 osd.22 up 1.00000 1.00000 -7 0.17505 datacenter DC2 -6 0.05835 host host04 2 hdd 0.01459 osd.2 up 1.00000 1.00000 8 hdd 0.01459 osd.8 up 1.00000 1.00000 14 hdd 0.01459 osd.14 up 1.00000 1.00000 20 hdd 0.01459 osd.20 up 1.00000 1.00000 -8 0.05835 host host05 0 hdd 0.01459 osd.0 up 1.00000 1.00000 7 hdd 0.01459 osd.7 up 1.00000 1.00000 13 hdd 0.01459 osd.13 up 1.00000 1.00000 19 hdd 0.01459 osd.19 up 1.00000 1.00000 -9 0.05835 host host06 4 hdd 0.01459 osd.4 up 1.00000 1.00000 9 hdd 0.01459 osd.9 up 1.00000 1.00000 15 hdd 0.01459 osd.15 up 1.00000 1.00000 21 hdd 0.01459 osd.21 up 1.00000 1.00000Set the location of each monitor, matching your CRUSH map.
Syntax
ceph mon set_location HOST datacenter=DATACENTERExample
[ceph: root@host01 /]# ceph mon set_location host01 datacenter=DC1 [ceph: root@host01 /]# ceph mon set_location host02 datacenter=DC1 [ceph: root@host01 /]# ceph mon set_location host04 datacenter=DC2 [ceph: root@host01 /]# ceph mon set_location host05 datacenter=DC2Verify the CRUSH locations for the MON hosts. Check the
mon mapto ensure that each MON host has acrush_locationspecified.Syntax
ceph mon dumpThe output displays details about the MON map, including the
crush_locationfor each host.Example
[ceph: root@host01 /]# ceph mon dump epoch 5 fsid 4158287e-169e-11f0-b1ad-fa163e98b991 last_changed 2025-04-11T06:32:20.332479+0000 created 2025-04-11T06:29:24.974553+0000 min_mon_release 19 (squid) election_strategy: 1 0: [v2:10.0.57.33:3300/0,v1:10.0.57.33:6789/0] mon.host07 1: [v2:10.0.58.200:3300/0,v1:10.0.58.200:6789/0] mon.host05; crush_location {datacenter=DC2} 2: [v2:10.0.58.47:3300/0,v1:10.0.58.47:6789/0] mon.host02; crush_location {datacenter=DC1} 3: [v2:10.0.58.104:3300/0,v1:10.0.58.104:6789/0] mon.host04; crush_location {datacenter=DC2} 4: [v2:10.0.58.38:3300/0,v1:10.0.58.38:6789/0] mon.host01; crush_location {datacenter=DC1}
4.4. Configuring a CRUSH map for stretch mode 复制链接链接已复制到粘贴板!
Use this information to configure a CRUSH map for stretch mode.
Prerequisites
Before you begin, make sure that you have the following prerequisites in place:
- Root-level access to the nodes.
- The CRUSH location is set to the hosts.
Procedure
Create a CRUSH rule that makes use of this OSD crush topology by installing the ceph-base RPM package in order to use the
crushtoolcommand.Syntax
dnf -y install ceph-baseGet the compiled CRUSH map from the cluster.
Syntax
ceph osd getcrushmap > /etc/ceph/crushmap.binDecompile the CRUSH map and convert it to a text file to edit it.
Syntax
crushtool -d /etc/ceph/crushmap.bin -o /etc/ceph/crushmap.txtAdd the following rule to the CRUSH map by editing the
/etc/ceph/crushmap.txtat the end of the file. This rule distributes reads and writes evenly across the data center.Syntax
rule stretch_rule { id 1 type replicated step take default step choose firstn 0 type datacenter step chooseleaf firstn 2 type host step emit }Optionally have the cluster with a read/write affinity towards data center 1.
Syntax
rule stretch_rule { id 1 type replicated step take DC1 step chooseleaf firstn 2 type host step emit step take DC2 step chooseleaf firstn 2 type host step emit }The CRUSH rule declared contains the following information: Rule name Description: A unique name for identifying the rule. Value: stretch_rule id Description: A unique whole number for identifying the rule. Value: 1 type Description: Describes a rule for either a storage drive replicated or erasure-coded. Value: replicated step take default Description: Takes the root bucket called default, and begins iterating down the tree. step take DC1 Description: Takes the bucket called DC1, and begins iterating down the tree. step choose firstn 0 type datacenter Description: Selects the datacenter bucket, and goes into its subtrees. step chooseleaf firstn 2 type host Description: Selects the number of buckets of the given type. In this case, it is two different hosts located in the datacenter it entered at the previous level. step emit Description: Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to pick from different trees in the same rule.
Compile the new CRUSH map from
/etc/ceph/crushmap.txtand convert it to a binary file/etc/ceph/crushmap2.bin.Syntax
crushtool -c /path/to/crushmap.txt -o /path/to/crushmap2.binExample
[ceph: root@host01 /]# crushtool -c /etc/ceph/crushmap.txt -o /etc/ceph/crushmap2.binInject the newly created CRUSH map back into the cluster.
Syntax
ceph osd setcrushmap -i /path/to/compiled_crushmapExample
[ceph: root@host01 /]# ceph osd setcrushmap -i /path/to/compiled_crushmap 17NoteThe number 17 is a counter and increases (18,19, and so on) depending on the changes that are made to the CRUSH map.
Verifying
Verify that the newly created stretch_rule available for use.
Syntax
ceph osd crush rule ls
Example
[ceph: root@host01 /]# ceph osd crush rule ls
replicated_rule
stretch_rule
4.4.1. Changing stretch mode 复制链接链接已复制到粘贴板!
Change the stretch mode state by entering or exiting stretch mode as needed to support your cluster’s availability and data‑placement requirements.
4.4.1.1. Entering stretch mode 复制链接链接已复制到粘贴板!
Stretch mode is designed to handle two sites. There is a lesser risk of component availability outages with 2-site clusters.
Prerequisites
Before you begin, make sure that you have the following prerequisites in place:
- Root-level access to the nodes.
- The CRUSH location is set to the hosts.
- The CRUSH map configured to include stretch rule.
- No erasure coded pools in the cluster.
- Weights of the two sites are the same.
Procedure
Check the current election strategy being used by the monitors.
Syntax
ceph mon dump | grep election_strategyNoteThe Ceph cluster
election_strategyis set to1, by default.Example
[ceph: root@host01 /]# ceph mon dump | grep election_strategy dumped monmap epoch 9 election_strategy: 1Change the election strategy to
connectivity.Syntax
ceph mon set election_strategy connectivityFor more information about configuring the election strategy, see Configuring monitor election strategy.
Use the
ceph mon dumpcommand to verify that the election strategy was updated to3.Example
[ceph: root@host01 /]# ceph mon dump | grep election_strategy dumped monmap epoch 22 election_strategy: 3Set the location of the tiebreaker monitor so that it is split across the data centers.
Syntax
ceph mon set_location TIEBREAKER_HOST datacenter=DC3Example
[ceph: root@host01 /]# ceph mon set_location host07 datacenter=DC3Verify that the tiebreaker monitor is set as expected.
Syntax
ceph mon dumpExample
[ceph: root@host01 /]# ceph mon dump epoch 8 fsid 4158287e-169e-11f0-b1ad-fa163e98b991 last_changed 2025-04-11T07:14:48.652801+0000 created 2025-04-11T06:29:24.974553+0000 min_mon_release 19 (squid) election_strategy: 3 0: [v2:10.0.57.33:3300/0,v1:10.0.57.33:6789/0] mon.host07; crush_location {datacenter=DC3} 1: [v2:10.0.58.200:3300/0,v1:10.0.58.200:6789/0] mon.host05; crush_location {datacenter=DC2} 2: [v2:10.0.58.47:3300/0,v1:10.0.58.47:6789/0] mon.host02; crush_location {datacenter=DC1} 3: [v2:10.0.58.104:3300/0,v1:10.0.58.104:6789/0] mon.host04; crush_location {datacenter=DC2} 4: [v2:10.0.58.38:3300/0,v1:10.0.58.38:6789/0] mon.host01; crush_location {datacenter=DC1} dumped monmap epoch 8 0Enter stretch mode.
Syntax
ceph mon enable_stretch_mode TIEBREAKER_HOST STRETCH_RULE STRETCH_BUCKETIn the following example:
- The tiebreaker node is set as host07.
- The stretch rule is stretch_rule, as created in .
- The stretch bucket is set as datacenter.
[ceph: root@host01 /]# ceph mon enable_stretch_mode host07 stretch_rule datacenter
Verifying
Verify that stretch mode was implemented correctly by continuing to Verifying stretch mode.
4.4.1.2. Exiting stretch mode 复制链接链接已复制到粘贴板!
Disable stretch mode by moving pools to a specified CRUSH rule or to the default replicated rule.
Procedure
Disable stretch mode. You can specify a CRUSH rule to move all pools to. If you do not specify a rule, Ceph moves the pools to the default replicated CRUSH rule.
Syntax
ceph mon disable_stretch_mode CRUSH_RULE --yes-i-really-mean-it
4.4.2. Verifying stretch mode 复制链接链接已复制到粘贴板!
Use this information to verify that stretch mode was created correctly with the implemented CRUSH rules.
Procedure
Verify that all pools are using the CRUSH rule that was created in the Ceph cluster. In these examples, the CRUSH rule is set as
stretch_rule, per the settings that were created in Configuring a CRUSH map for stretch mode.Syntax
for pool in $(rados lspools);do echo -n "Pool: ${pool}; ";ceph osd pool get ${pool} crush_rule;doneExample
[ceph: root@host01 /]# for pool in $(rados lspools);do echo -n "Pool: ${pool}; ";ceph osd pool get ${pool} crush_rule;done Pool: device_health_metrics; crush_rule: stretch_rule Pool: cephfs.cephfs.meta; crush_rule: stretch_rule Pool: cephfs.cephfs.data; crush_rule: stretch_rule Pool: .rgw.root; crush_rule: stretch_rule Pool: default.rgw.log; crush_rule: stretch_rule Pool: default.rgw.control; crush_rule: stretch_rule Pool: default.rgw.meta; crush_rule: stretch_rule Pool: rbdpool; crush_rule: stretch_ruleVerify that stretch mode is enabled. Ensure that
stretch_mode_enabledis set totrue.Syntax
ceph osd dumpThe output includes the following information:
- stretch_mode_enabled
-
Set to
trueif stretch mode is enabled. - stretch_bucket_count
- The number of data centers with OSDs.
- degraded_stretch_mode
-
Output of
0if not degraded. If the stretch mode is degraded, this outputs the number of up sites. - recovering_stretch_mode
-
Output of
0if not recovering. If the stretch mode is recovering, the output is1. - stretch_mode_bucket
A unique value set for each CRUSH bucket type. This value is usually set to
8, for data center.Example
"stretch_mode": { "stretch_mode_enabled": true, "stretch_bucket_count": 2, "degraded_stretch_mode": 0, "recovering_stretch_mode": 1, "stretch_mode_bucket": 8
Verify that stretch mode is using the mon map, by using the
ceph mon dump.Ensure the following:
-
stretch_mode_enabledis set to1 -
The correct mon host is set as
tiebreaker_mon The correct mon host is set as
disallowed_leadersSyntax
ceph mon dumpExample
[ceph: root@host01 /]# ceph mon dump epoch 16 fsid ff19789c-f5c7-11ef-8e1c-fa163e4e1f7e last_changed 2025-02-28T12:12:51.089706+0000 created 2025-02-28T11:34:59.325503+0000 min_mon_release 19 (squid) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon host07 disallowed_leaders host07 0: [v2:10.0.56.37:3300/0,v1:10.0.56.37:6789/0] mon.host01; crush_location {datacenter=DC1} 1: [v2:10.0.59.188:3300/0,v1:10.0.59.188:6789/0] mon.host05; crush_location {datacenter=DC2} 2: [v2:10.0.59.35:3300/0,v1:10.0.59.35:6789/0] mon.host02; crush_location {datacenter=DC1} 3: [v2:10.0.56.189:3300/0,v1:10.0.56.189:6789/0] mon.host07; crush_location {datacenter=DC3} 4: [v2:10.0.56.13:3300/0,v1:10.0.56.13:6789/0] mon.host04; crush_location {datacenter=DC2} dumped monmap epoch 16
-
What to do next
- Deploy, configure, and administer a Ceph Object Gateway. For more information, see Ceph Object Gateway.
- Manage, create, configure, and use Ceph Block Devices. For more information, see Ceph block devices.
- Create, mount, and work the Ceph File System (CephFS). For more information, see Ceph File Systems.
4.5. Using and maintaining stretch mode 复制链接链接已复制到粘贴板!
Use and maintain stretch mode by adding OSD hosts, managing data center monitor service hosts, and replacing tiebreakers with a monitor both with and without a quorum.
4.5.1. Adding OSD hosts in stretch mode 复制链接链接已复制到粘贴板!
You can add Ceph OSDs in the stretch mode. The procedure is similar to the addition of the OSD hosts on a cluster where stretch mode is not enabled.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Stretch mode in enabled on a cluster.
- Root-level access to the nodes.
Procedure
List the available devices to deploy OSDs:
Syntax
ceph orch device ls [--hostname=HOST_1 HOST_2] [--wide] [--refresh]Example
[ceph: root@host01 /]# ceph orch device lsDeploy the OSDs on specific hosts or on all the available devices:
Create an OSD from a specific device on a specific host:
Syntax
ceph orch daemon add osd HOST:DEVICE_PATHExample
[ceph: root@host01 /]# ceph orch daemon add osd host03:/dev/sdbDeploy OSDs on any available and unused devices:
ImportantThis command creates collocated WAL and DB devices. If you want to create non-collocated devices, do not use this command.
Example
[ceph: root@host01 /]# ceph orch apply osd --all-available-devices
Move the OSD hosts under the CRUSH bucket:
Syntax
ceph osd crush move HOST datacenter=DATACENTERExample
[ceph: root@host01 /]# ceph osd crush move host03 datacenter=DC1 [ceph: root@host01 /]# ceph osd crush move host06 datacenter=DC2NoteEnsure you add the same topology nodes on both sites. Issues might arise if hosts are added only on one site.
Use this information to add and remove data center monitor service (mon) hosts in stretch mode. Managing data centers can be done by using the specification file or directly on the Ceph cluster.
Prerequisites
Before you begin, make sure that you have the following prerequisites in place:
- A running Red Hat Ceph Storage cluster
- Stretch mode in enabled on a cluster
- Root-level access to the nodes.
These steps detail how to add a mon service. To remove the service, use the same steps of updating the service specification file, with removing the needed information.
Procedure
Export the specification file for mon and save the output to
mon-spec.yaml.Syntax
ceph orch ls mon --export > mon-spec.yamlAfter the file is exported, the YAML file can be edited.
Add the new host details. In the following example,
host08is being added to the cluster into the DC2 data center bucket.Syntax
service_type: host addr: 10.1.172.225 hostname: host08 labels: - mon --- service_type: mon service_name: mon placement: label: mon spec: crush_locations: host01: - datacenter=DC1 host02: - datacenter=DC1 host03: - datacenter=DC1 host04: - datacenter=DC2 host05: - datacenter=DC2 host06: - datacenter=DC2 host08: - datacenter=DC2Apply the specification file.
Syntax
ceph orch apply -i mon-spec.yamlExample
[ceph: root@host01 /]# eph orch apply -i mon-spec.yaml Added host 'host08' with addr '10.1.172.225' Scheduled mon update...
Verifying
Use the
ceph mon dumpcommand to verify that themonservice was deployed and that the appropriate CRUSH location was added to the monitor.Example
[ceph: root@host01 /]# ceph mon dump epoch 16 fsid ff19789c-f5c7-11ef-8e1c-fa163e4e1f7e last_changed 2025-02-28T12:12:51.089706+0000 created 2025-02-28T11:34:59.325503+0000 min_mon_release 19 (squid) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon host07 disallowed_leaders host07 0: [v2:10.0.56.37:3300/0,v1:10.0.56.37:6789/0] mon.host01; crush_location {datacenter=DC1} 1: [v2:10.0.59.188:3300/0,v1:10.0.59.188:6789/0] mon.host05; crush_location {datacenter=DC2} 2: [v2:10.0.59.35:3300/0,v1:10.0.59.35:6789/0] mon.host02; crush_location {datacenter=DC1} 3: [v2:10.0.56.189:3300/0,v1:10.0.56.189:6789/0] mon.host07; crush_location {datacenter=DC3} 4: [v2:10.0.56.13:3300/0,v1:10.0.56.13:6789/0] mon.host04; crush_location {datacenter=DC2} dumped monmap epoch 16Use the
ceph orch host lsto verify that the host was added to the cluster.Example
[ceph: root@host01 /]# ceph orch host ls HOST ADDR LABELS STATUS host01 10.0.56.37 mgr,mon,osd host02 10.0.59.35 mgr,mon,osd host03 10.0.58.106 osd,mds,rgw host04 10.0.56.13 osd,mon,mgr host05 10.0.59.188 mgr,mon,osd host06 10.0.56.223 rgw,mds,osd host07 10.0.56.189 _admin,mon 7 hosts in cluster
These steps detail how to add a mon service. To remove the service, use the same steps of updating with the CLI, with removing the needed information.
Procedure
Set the monitor service to
unmanaged.Syntax
ceph orch set-unmanaged monOptional: Use the
ceph orch lscommand to verify that the service was set, as expected.Example
[ceph: root@host01 /]# ceph orch host ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT mon 8/8 10m ago 19s <unmanaged>Add a new host with the
monlabel.Syntax
ceph orch host add HOST_NAME IP_ADDRESS_OF_HOST [--label=LABEL_NAME_1,LABEL_NAME_2]Example
[ceph: root@host01 /]# ceph orch host add host08 10.1.172.205 --labels=monAdd a monitor service with CRUSH locations.
NoteAt this point, the mon is not running and is not managed by Cephadm.
Syntax
ceph mon add NODE:_IP_ADDRESS_ datacenter=DC2Example
[ceph: root@host01 /]# ceph mon add host08:10.1.172.205 datacenter=DC2Deploy the monitor daemon using Cephadm.
Syntax
ceph orch daemon add mon host08Example
[ceph: root@host01 /]# ceph orch daemon add mon host08 Deployed mon.host08 on host 'host08'Enable Cephadm management for the monitor service.
Syntax
ceph orch set-managed monStart the newly added
mondaemon.Syntax
ceph orch set-managed mgr
Verifying
Verify that the service, monitor, and host are added and running.
Use the
ceph orch lscommand to verify that the service is running.Example
[ceph: root@host01 /]# ceph orch host ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT mon 8/8 7m ago 4d label:monUse the
ceph mon dumpcommand to verify that themonservice was deployed and that the appropriate CRUSH location was added to the monitor.Example
[ceph: root@host01 /]# ceph mon dump epoch 16 fsid ff19789c-f5c7-11ef-8e1c-fa163e4e1f7e last_changed 2025-02-28T12:12:51.089706+0000 created 2025-02-28T11:34:59.325503+0000 min_mon_release 19 (squid) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon host07 disallowed_leaders host07 0: [v2:10.0.56.37:3300/0,v1:10.0.56.37:6789/0] mon.host01; crush_location {datacenter=DC1} 1: [v2:10.0.59.188:3300/0,v1:10.0.59.188:6789/0] mon.host05; crush_location {datacenter=DC2} 2: [v2:10.0.59.35:3300/0,v1:10.0.59.35:6789/0] mon.host02; crush_location {datacenter=DC1} 3: [v2:10.0.56.189:3300/0,v1:10.0.56.189:6789/0] mon.host07; crush_location {datacenter=DC3} 4: [v2:10.0.56.13:3300/0,v1:10.0.56.13:6789/0] mon.host04; crush_location {datacenter=DC2} dumped monmap epoch 16Use the
ceph orch host lscommmand to verify that the host was added to the cluster.Example
[ceph: root@host01 /]# ceph orch host ls HOST ADDR LABELS STATUS host01 10.0.56.37 mgr,mon,osd host02 10.0.59.35 mgr,mon,osd host03 10.0.58.106 osd,mds,rgw host04 10.0.56.13 osd,mon,mgr host05 10.0.59.188 mgr,mon,osd host06 10.0.56.223 rgw,mds,osd host07 10.0.56.189 _admin,mon 7 hosts in cluster
4.5.3. Replacing the tiebreaker with a monitor in quorum 复制链接链接已复制到粘贴板!
If your tiebreaker monitor fails, you can replace it with an existing monitor in quorum and remove it from the cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster
- Stretch mode is enabled on a cluster
Procedure
Disable automated monitor deployment:
Example
[ceph: root@host01 /]# ceph orch apply mon --unmanaged Scheduled mon update…View the monitors in quorum:
Example
[ceph: root@host01 /]# ceph -s mon: 5 daemons, quorum host01, host02, host04, host05 (age 30s), out of quorum: host07Set the monitor in quorum as a new tiebreaker:
Syntax
ceph mon set_new_tiebreaker NEW_HOSTExample
[ceph: root@host01 /]# ceph mon set_new_tiebreaker host02ImportantYou get an error message if the monitor is in the same location as existing non-tiebreaker monitors:
Example
[ceph: root@host01 /]# ceph mon set_new_tiebreaker host02 Error EINVAL: mon.host02 has location DC1, which matches mons host02 on the datacenter dividing bucket for stretch mode.If that happens, change the location of the monitor:
Syntax
ceph mon set_location HOST datacenter=DATACENTERExample
[ceph: root@host01 /]# ceph mon set_location host02 datacenter=DC3Remove the failed tiebreaker monitor:
Syntax
ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --forceExample
[ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force Removed mon.host07 from host 'host07'Once the monitor is removed from the host, redeploy the monitor:
Syntax
ceph mon add HOST IP_ADDRESS datacenter=DATACENTER ceph orch daemon add mon HOSTExample
[ceph: root@host01 /]# ceph mon add host07 213.222.226.50 datacenter=DC1 [ceph: root@host01 /]# ceph orch daemon add mon host07Ensure there are five monitors in quorum:
Example
[ceph: root@host01 /]# ceph -s mon: 5 daemons, quorum host01, host02, host04, host05, host07 (age 15s)Verify that everything is configured properly:
Example
[ceph: root@host01 /]# ceph mon dump epoch 19 fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d last_changed 2023-01-17T04:12:05.709475+0000 created 2023-01-16T05:47:25.631684+0000 min_mon_release 16 (pacific) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon host02 disallowed_leaders host02 0: [v2:132.224.169.63:3300/0,v1:132.224.169.63:6789/0] mon.host02; crush_location {datacenter=DC3} 1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2} 2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1} 3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host07; crush_location {datacenter=DC1} 4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host03; crush_location {datacenter=DC2} dumped monmap epoch 19Redeploy the monitors:
Syntax
ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5”Example
[ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host07" Scheduled mon update...
4.5.4. Replacing the tiebreaker with a new monitor 复制链接链接已复制到粘贴板!
If your tiebreaker monitor fails, you can replace it with a new monitor and remove it from the cluster.
Prerequisites
Before you begin, make sure that you have the following prerequisites in place:
- A running Red Hat Ceph Storage cluster
- Stretch mode in enabled on a cluster
Procedure
Add a new monitor to the cluster:
Manually add the
crush_locationto the new monitor:Syntax
ceph mon add NEW_HOST IP_ADDRESS datacenter=DATACENTERExample
[ceph: root@host01 /]# ceph mon add host06 213.222.226.50 datacenter=DC3 adding mon.host06 at [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0]NoteThe new monitor has to be in a different location than existing non-tiebreaker monitors.
Disable automated monitor deployment:
Example
[ceph: root@host01 /]# ceph orch apply mon --unmanaged Scheduled mon update…Deploy the new monitor:
Syntax
ceph orch daemon add mon NEW_HOSTExample
[ceph: root@host01 /]# ceph orch daemon add mon host06
Ensure there are 6 monitors, from which 5 are in quorum:
Example
[ceph: root@host01 /]# ceph -s mon: 6 daemons, quorum host01, host02, host04, host05, host06 (age 30s), out of quorum: host07Set the new monitor as a new tiebreaker:
Syntax
ceph mon set_new_tiebreaker NEW_HOSTExample
[ceph: root@host01 /]# ceph mon set_new_tiebreaker host06Remove the failed tiebreaker monitor:
Syntax
ceph orch daemon rm FAILED_TIEBREAKER_MONITOR --forceExample
[ceph: root@host01 /]# ceph orch daemon rm mon.host07 --force Removed mon.host07 from host 'host07'Verify that everything is configured properly:
Example
[ceph: root@host01 /]# ceph mon dump epoch 19 fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d last_changed 2023-01-17T04:12:05.709475+0000 created 2023-01-16T05:47:25.631684+0000 min_mon_release 16 (pacific) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon host06 disallowed_leaders host06 0: [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0] mon.host06; crush_location {datacenter=DC3} 1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location {datacenter=DC2} 2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location {datacenter=DC1} 3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host02; crush_location {datacenter=DC1} 4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host05; crush_location {datacenter=DC2} dumped monmap epoch 19Redeploy the monitors:
Syntax
ceph orch apply mon --placement="HOST_1, HOST_2, HOST_3, HOST_4, HOST_5”Example
[ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05, host06" Scheduled mon update…
4.6. Read affinity in stretch clusters 复制链接链接已复制到粘贴板!
Read Affinity reduces cross-zone traffic by keeping the data access within the respective data centers.
For stretched clusters deployed in multi-zone environments, the read affinity topology implementation provides a mechanism to help keep traffic within the data center it originated from. Ceph Object Gateway volumes have the ability to read data from an OSD in proximity to the client, according to OSD locations defined in the CRUSH map and topology labels on nodes.
For example, a stretch cluster contains a Ceph Object Gateway primary OSD and replicated OSDs spread across two data centers A and B. If a GET action is performed on an Object in data center A, the READ operation is performed on the data of the OSDs closest to the client in data center A.
4.6.1. Performing localized reads 复制链接链接已复制到粘贴板!
You can perform a localized read on a replicated pool in a stretch cluster. When a localized read request is made on a replicated pool, Ceph selects the local OSDs closest to the client based on the client location specified in crush_location.
Prerequisites
- A stretch cluster with two data centers and Ceph Object Gateway configured on both.
- A user created with a bucket having primary and replicated OSDs.
Procedure
To perform a localized read, set
rados_replica_read_policyto 'localize' in the OSD daemon configuration using theceph config setcommand.[ceph: root@host01 /]# ceph config set client.rgw.rgw.1 rados_replica_read_policy localizeVerification: Perform the below steps to verify the localized read from an OSD set.
Run the
ceph osd treecommand to view the OSDs and the data centers.Example
[ceph: root@host01 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.58557 root default -3 0.29279 datacenter DC1 -2 0.09760 host ceph-ci-fbv67y-ammmck-node2 2 hdd 0.02440 osd.2 up 1.00000 1.00000 11 hdd 0.02440 osd.11 up 1.00000 1.00000 17 hdd 0.02440 osd.17 up 1.00000 1.00000 22 hdd 0.02440 osd.22 up 1.00000 1.00000 -4 0.09760 host ceph-ci-fbv67y-ammmck-node3 0 hdd 0.02440 osd.0 up 1.00000 1.00000 6 hdd 0.02440 osd.6 up 1.00000 1.00000 12 hdd 0.02440 osd.12 up 1.00000 1.00000 18 hdd 0.02440 osd.18 up 1.00000 1.00000 -5 0.09760 host ceph-ci-fbv67y-ammmck-node4 5 hdd 0.02440 osd.5 up 1.00000 1.00000 10 hdd 0.02440 osd.10 up 1.00000 1.00000 16 hdd 0.02440 osd.16 up 1.00000 1.00000 23 hdd 0.02440 osd.23 up 1.00000 1.00000 -7 0.29279 datacenter DC2 -6 0.09760 host ceph-ci-fbv67y-ammmck-node5 3 hdd 0.02440 osd.3 up 1.00000 1.00000 8 hdd 0.02440 osd.8 up 1.00000 1.00000 14 hdd 0.02440 osd.14 up 1.00000 1.00000 20 hdd 0.02440 osd.20 up 1.00000 1.00000 -8 0.09760 host ceph-ci-fbv67y-ammmck-node6 4 hdd 0.02440 osd.4 up 1.00000 1.00000 9 hdd 0.02440 osd.9 up 1.00000 1.00000 15 hdd 0.02440 osd.15 up 1.00000 1.00000 21 hdd 0.02440 osd.21 up 1.00000 1.00000 -9 0.09760 host ceph-ci-fbv67y-ammmck-node7 1 hdd 0.02440 osd.1 up 1.00000 1.00000 7 hdd 0.02440 osd.7 up 1.00000 1.00000 13 hdd 0.02440 osd.13 up 1.00000 1.00000 19 hdd 0.02440 osd.19 up 1.00000 1.00000Run the
ceph orchcommand to identify the Ceph Object Gateway daemons in the data centers.Example
[ceph: root@host01 /]# ceph orch ps | grep rg rgw.rgw.1.ceph-ci-fbv67y-ammmck-node4.dmsmex ceph-ci-fbv67y-ammmck-node4 *:80 running (4h) 10m ago 22h 93.3M - 19.1.0-55.el9cp 0ee0a0ad94c7 34f27723ccd2 rgw.rgw.1.ceph-ci-fbv67y-ammmck-node7.pocecp ceph-ci-fbv67y-ammmck-node7 *:80 running (4h) 10m ago 22h 96.4M - 19.1.0-55.el9cp 0ee0a0ad94c7 40e4f2a6d4c4Verify if a default read has happened by running the
vimcommand on the Ceph Object Gateway logs.Example
[ceph: root@host01 /]# vim /var/log/ceph/<fsid>/<ceph-client-rgw>.log 2024-08-26T08:07:45.471+0000 7fc623e63640 1 ====== starting new request req=0x7fc5b93694a0 ===== 2024-08-26T08:07:45.471+0000 7fc623e63640 1 -- 10.0.67.142:0/279982082 --> [v2:10.0.66.23:6816/73244434,v1:10.0.66.23:6817/73244434] -- osd_op(unknown.0.0:9081 11.55 11:ab26b168:::3acf4091-c54c-43b5-a495-c505fe545d25.27842.1_f1:head [getxattrs,stat] snapc 0=[] ondisk+read+localize_reads+known_if_redirected+supports_pool_eio e3533) -- 0x55f781bd2000 con 0x55f77f0e8c00You can see in the logs that a localized read has taken place.
ImportantTo be able to view the debug logs, you must first enable
debug_ms 1in the configuration by running theceph config setcommand.[ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node4.dgvrmx advanced debug_ms 1/1 [ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node7.rfkqqq advanced debug_ms 1/1
4.6.2. Performing balanced reads 复制链接链接已复制到粘贴板!
You can perform a balanced read on a pool to retrieve evenly distributed OSDs across data centers. When a balanced READ is issued on a pool, the read operations are distributed evenly across all OSDs that are spread across the data centers.
Prerequisites
- A stretch cluster with two data centers and Ceph Object Gateway configured on both.
- A user created with a bucket and OSDs - primary and replicated OSDs.
Procedure
To perform a balanced read, set
rados_replica_read_policyto 'balance' in the OSD daemon configuration using theceph config setcommand.[ceph: root@host01 /]# ceph config set client.rgw.rgw.1 rados_replica_read_policy balanceVerification: Perform the below steps to verify the balance read from an OSD set.
Run the
ceph osd treecommand to view the OSDs and the data centers.Example
[ceph: root@host01 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.58557 root default -3 0.29279 datacenter DC1 -2 0.09760 host ceph-ci-fbv67y-ammmck-node2 2 hdd 0.02440 osd.2 up 1.00000 1.00000 11 hdd 0.02440 osd.11 up 1.00000 1.00000 17 hdd 0.02440 osd.17 up 1.00000 1.00000 22 hdd 0.02440 osd.22 up 1.00000 1.00000 -4 0.09760 host ceph-ci-fbv67y-ammmck-node3 0 hdd 0.02440 osd.0 up 1.00000 1.00000 6 hdd 0.02440 osd.6 up 1.00000 1.00000 12 hdd 0.02440 osd.12 up 1.00000 1.00000 18 hdd 0.02440 osd.18 up 1.00000 1.00000 -5 0.09760 host ceph-ci-fbv67y-ammmck-node4 5 hdd 0.02440 osd.5 up 1.00000 1.00000 10 hdd 0.02440 osd.10 up 1.00000 1.00000 16 hdd 0.02440 osd.16 up 1.00000 1.00000 23 hdd 0.02440 osd.23 up 1.00000 1.00000 -7 0.29279 datacenter DC2 -6 0.09760 host ceph-ci-fbv67y-ammmck-node5 3 hdd 0.02440 osd.3 up 1.00000 1.00000 8 hdd 0.02440 osd.8 up 1.00000 1.00000 14 hdd 0.02440 osd.14 up 1.00000 1.00000 20 hdd 0.02440 osd.20 up 1.00000 1.00000 -8 0.09760 host ceph-ci-fbv67y-ammmck-node6 4 hdd 0.02440 osd.4 up 1.00000 1.00000 9 hdd 0.02440 osd.9 up 1.00000 1.00000 15 hdd 0.02440 osd.15 up 1.00000 1.00000 21 hdd 0.02440 osd.21 up 1.00000 1.00000 -9 0.09760 host ceph-ci-fbv67y-ammmck-node7 1 hdd 0.02440 osd.1 up 1.00000 1.00000 7 hdd 0.02440 osd.7 up 1.00000 1.00000 13 hdd 0.02440 osd.13 up 1.00000 1.00000 19 hdd 0.02440 osd.19 up 1.00000 1.00000Run the
ceph orchcommand to identify the Ceph Object Gateway daemons in the data centers.Example
[ceph: root@host01 /]# ceph orch ps | grep rg rgw.rgw.1.ceph-ci-fbv67y-ammmck-node4.dmsmex ceph-ci-fbv67y-ammmck-node4 *:80 running (4h) 10m ago 22h 93.3M - 19.1.0-55.el9cp 0ee0a0ad94c7 34f27723ccd2 rgw.rgw.1.ceph-ci-fbv67y-ammmck-node7.pocecp ceph-ci-fbv67y-ammmck-node7 *:80 running (4h) 10m ago 22h 96.4M - 19.1.0-55.el9cp 0ee0a0ad94c7 40e4f2a6d4c4Verify if a balanced read has happened by running the
vimcommand on the Ceph Object Gateway logs.Example
[ceph: root@host01 /]# vim /var/log/ceph/<fsid>/<ceph-client-rgw>.log 2024-08-27T09:32:25.510+0000 7f2a7a284640 1 ====== starting new request req=0x7f2a31fcf4a0 ===== 2024-08-27T09:32:25.510+0000 7f2a7a284640 1 -- 10.0.67.142:0/3116867178 --> [v2:10.0.64.146:6816/2838383288,v1:10.0.64.146:6817/2838383288] -- osd_op(unknown.0.0:268731 11.55 11:ab26b168:::3acf4091-c54c-43b5-a495-c505fe545d25.27842.1_f1:head [getxattrs,stat] snapc 0=[] ondisk+read+balance_reads+known_if_redirected+supports_pool_eio e3554) -- 0x55cd1b88dc00 con 0x55cd18dd6000You can see in the logs that a balanced read has taken place.
ImportantTo be able to view the debug logs, you must first enable
debug_ms 1in the configuration by running theceph config setcommand.[ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node4.dgvrmx advanced debug_ms 1/1 [ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node7.rfkqqq advanced debug_ms 1/1
4.6.3. Performing default reads 复制链接链接已复制到粘贴板!
You can perform a default read on a pool to retrieve data from primary data centers. When a default READ is issued on a pool, the IO operations are retrieved directly from each OSD in the data center.
Prerequisites
- A stretch cluster with two data centers and Ceph Object Gateway configured on both.
- A user created with a bucket and OSDs - primary and replicated OSDs.
Procedure
To perform a default read, set
rados_replica_read_policyto 'default' in the OSD daemon configuration by using theceph config setcommand.Example
[ceph: root@host01 /]#ceph config set client.rgw.rgw.1 advanced rados_replica_read_policy defaultThe IO operations from the closest OSD in a data center are retrieved when a GET operation is performed.
Verification: Perform the below steps to verify the localized read from an OSD set.
Run the
ceph osd treecommand to view the OSDs and the data centers.Example
[ceph: root@host01 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.58557 root default -3 0.29279 datacenter DC1 -2 0.09760 host ceph-ci-fbv67y-ammmck-node2 2 hdd 0.02440 osd.2 up 1.00000 1.00000 11 hdd 0.02440 osd.11 up 1.00000 1.00000 17 hdd 0.02440 osd.17 up 1.00000 1.00000 22 hdd 0.02440 osd.22 up 1.00000 1.00000 -4 0.09760 host ceph-ci-fbv67y-ammmck-node3 0 hdd 0.02440 osd.0 up 1.00000 1.00000 6 hdd 0.02440 osd.6 up 1.00000 1.00000 12 hdd 0.02440 osd.12 up 1.00000 1.00000 18 hdd 0.02440 osd.18 up 1.00000 1.00000 -5 0.09760 host ceph-ci-fbv67y-ammmck-node4 5 hdd 0.02440 osd.5 up 1.00000 1.00000 10 hdd 0.02440 osd.10 up 1.00000 1.00000 16 hdd 0.02440 osd.16 up 1.00000 1.00000 23 hdd 0.02440 osd.23 up 1.00000 1.00000 -7 0.29279 datacenter DC2 -6 0.09760 host ceph-ci-fbv67y-ammmck-node5 3 hdd 0.02440 osd.3 up 1.00000 1.00000 8 hdd 0.02440 osd.8 up 1.00000 1.00000 14 hdd 0.02440 osd.14 up 1.00000 1.00000 20 hdd 0.02440 osd.20 up 1.00000 1.00000 -8 0.09760 host ceph-ci-fbv67y-ammmck-node6 4 hdd 0.02440 osd.4 up 1.00000 1.00000 9 hdd 0.02440 osd.9 up 1.00000 1.00000 15 hdd 0.02440 osd.15 up 1.00000 1.00000 21 hdd 0.02440 osd.21 up 1.00000 1.00000 -9 0.09760 host ceph-ci-fbv67y-ammmck-node7 1 hdd 0.02440 osd.1 up 1.00000 1.00000 7 hdd 0.02440 osd.7 up 1.00000 1.00000 13 hdd 0.02440 osd.13 up 1.00000 1.00000 19 hdd 0.02440 osd.19 up 1.00000 1.00000Run the
ceph orchcommand to identify the Ceph Object Gateway daemons in the data centers.Example
ceph orch ps | grep rg rgw.rgw.1.ceph-ci-fbv67y-ammmck-node4.dmsmex ceph-ci-fbv67y-ammmck-node4 *:80 running (4h) 10m ago 22h 93.3M - 19.1.0-55.el9cp 0ee0a0ad94c7 34f27723ccd2 rgw.rgw.1.ceph-ci-fbv67y-ammmck-node7.pocecp ceph-ci-fbv67y-ammmck-node7 *:80 running (4h) 10m ago 22h 96.4M - 19.1.0-55.el9cp 0ee0a0ad94c7 40e4f2a6d4c4Verify if a default read has happened by running the vim command on the Ceph Object Gateway logs.
Example
[ceph: root@host01 /]# vim /var/log/ceph/<fsid>/<ceph-client-rgw>.log 2024-08-28T10:26:05.155+0000 7fe6b03dd640 1 ====== starting new request req=0x7fe6879674a0 ===== 2024-08-28T10:26:05.156+0000 7fe6b03dd640 1 -- 10.0.64.251:0/2235882725 --> [v2:10.0.65.171:6800/4255735352,v1:10.0.65.171:6801/4255735352] -- osd_op(unknown.0.0:1123 11.6d 11:b69767fc:::699c2d80-5683-43c5-bdcd-e8912107c176.24827.3_f1:head [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e4513) -- 0x5639da653800 con 0x5639d804d800You can see in the logs that a default read has taken place.
ImportantTo be able to view the debug logs, you must first enable
debug_ms 1in the configuration by running theceph config setcommand.[ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node4.dgvrmx advanced debug_ms 1/1 [ceph: root@host01 /]#ceph config set client.rgw.rgw.1.ceph-ci-gune2w-mysx73-node7.rfkqqq advanced debug_ms 1/1