Chapter 6. Management of OSDs using the Ceph Orchestrator
As a storage administrator, you can use the Ceph Orchestrators to manage OSDs of a Red Hat Ceph Storage cluster.
6.1. Ceph OSDs
When a Red Hat Ceph Storage cluster is up and running, you can add OSDs to the storage cluster at runtime.
A Ceph OSD generally consists of one ceph-osd
daemon for one storage drive and its associated journal within a node. If a node has multiple storage drives, then map one ceph-osd
daemon for each drive.
Red Hat recommends checking the capacity of a cluster regularly to see if it is reaching the upper end of its storage capacity. As a storage cluster reaches its near full
ratio, add one or more OSDs to expand the storage cluster’s capacity.
When you want to reduce the size of a Red Hat Ceph Storage cluster or replace the hardware, you can also remove an OSD at runtime. If the node has multiple storage drives, you might also need to remove one of the ceph-osd
daemon for that drive. Generally, it’s a good idea to check the capacity of the storage cluster to see if you are reaching the upper end of its capacity. Ensure that when you remove an OSD that the storage cluster is not at its near full
ratio.
Do not let a storage cluster reach the full
ratio before adding an OSD. OSD failures that occur after the storage cluster reaches the near full
ratio can cause the storage cluster to exceed the full
ratio. Ceph blocks write access to protect the data until you resolve the storage capacity issues. Do not remove OSDs without considering the impact on the full
ratio first.
6.2. Ceph OSD node configuration
Configure Ceph OSDs and their supporting hardware similarly as a storage strategy for the pool(s) that will use the OSDs. Ceph prefers uniform hardware across pools for a consistent performance profile. For best performance, consider a CRUSH hierarchy with drives of the same type or size.
If you add drives of dissimilar size, adjust their weights accordingly. When you add the OSD to the CRUSH map, consider the weight for the new OSD. Hard drive capacity grows approximately 40% per year, so newer OSD nodes might have larger hard drives than older nodes in the storage cluster, that is, they might have a greater weight.
Before doing a new installation, review the Requirements for Installing Red Hat Ceph Storage chapter in the Installation Guide.
6.3. Automatically tuning OSD memory
The OSD daemons adjust the memory consumption based on the osd_memory_target
configuration option. The option osd_memory_target
sets OSD memory based upon the available RAM in the system.
If Red Hat Ceph Storage is deployed on dedicated nodes that do not share memory with other services, cephadm
automatically adjusts the per-OSD consumption based on the total amount of RAM and the number of deployed OSDs.
By default, the osd_memory_target_autotune
parameter is set to true
in the Red Hat Ceph Storage cluster.
Syntax
ceph config set osd osd_memory_target_autotune true
Cephadm starts with a fraction mgr/cephadm/autotune_memory_target_ratio
, which defaults to 0.7
of the total RAM in the system, subtract off any memory consumed by non-autotuned daemons such as non-OSDS and for OSDs for which osd_memory_target_autotune
is false, and then divide by the remaining OSDs.
The osd_memory_target
parameter is calculated as follows:
Syntax
osd_memory_target = TOTAL_RAM_OF_THE_OSD * (1048576) * (autotune_memory_target_ratio) / NUMBER_OF_OSDS_IN_THE_OSD_NODE - (SPACE_ALLOCATED_FOR_OTHER_DAEMONS)
SPACE_ALLOCATED_FOR_OTHER_DAEMONS may optionally include the following daemon space allocations:
- Alertmanager: 1 GB
- Grafana: 1 GB
- Ceph Manager: 4 GB
- Ceph Monitor: 2 GB
- Node-exporter: 1 GB
- Prometheus: 1 GB
For example, if a node has 24 OSDs and has 251 GB RAM space, then osd_memory_target
is 7860684936
.
The final targets are reflected in the configuration database with options. You can view the limits and the current memory consumed by each daemon from the ceph orch ps
output under MEM LIMIT
column.
The default setting of osd_memory_target_autotune
true
is unsuitable for hyperconverged infrastructures where compute and Ceph storage services are colocated. In a hyperconverged infrastructure, the autotune_memory_target_ratio
can be set to 0.2
to reduce the memory consumption of Ceph.
Example
[ceph: root@host01 /]# ceph config set mgr mgr/cephadm/autotune_memory_target_ratio 0.2
You can manually set a specific memory target for an OSD in the storage cluster.
Example
[ceph: root@host01 /]# ceph config set osd.123 osd_memory_target 7860684936
You can manually set a specific memory target for an OSD host in the storage cluster.
Syntax
ceph config set osd/host:HOSTNAME osd_memory_target TARGET_BYTES
Example
[ceph: root@host01 /]# ceph config set osd/host:host01 osd_memory_target 1000000000
Enabling osd_memory_target_autotune
overwrites existing manual OSD memory target settings. To prevent daemon memory from being tuned even when the osd_memory_target_autotune
option or other similar options are enabled, set the _no_autotune_memory
label on the host.
Syntax
ceph orch host label add HOSTNAME _no_autotune_memory
You can exclude an OSD from memory autotuning by disabling the autotune option and setting a specific memory target.
Example
[ceph: root@host01 /]# ceph config set osd.123 osd_memory_target_autotune false [ceph: root@host01 /]# ceph config set osd.123 osd_memory_target 16G
6.4. Listing devices for Ceph OSD deployment
You can check the list of available devices before deploying OSDs using the Ceph Orchestrator. The commands are used to print a list of devices discoverable by Cephadm. A storage device is considered available if all of the following conditions are met:
- The device must have no partitions.
- The device must not have any LVM state.
- The device must not be mounted.
- The device must not contain a file system.
- The device must not contain a Ceph BlueStore OSD.
- The device must be larger than 5 GB.
Ceph will not provision an OSD on a device that is not available.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- All manager and monitor daemons are deployed.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
List the available devices to deploy OSDs:
Syntax
ceph orch device ls [--hostname=HOSTNAME_1 HOSTNAME_2] [--wide] [--refresh]
Example
[ceph: root@host01 /]# ceph orch device ls --wide --refresh
Using the
--wide
option provides all details relating to the device, including any reasons that the device might not be eligible for use as an OSD. This option does not support NVMe devices.Optional: To enable Health, Ident, and Fault fields in the output of
ceph orch device ls
, run the following commands:NoteThese fields are supported by
libstoragemgmt
library and currently supports SCSI, SAS, and SATA devices.As root user outside the Cephadm shell, check your hardware’s compatibility with
libstoragemgmt
library to avoid unplanned interruption to services:Example
[root@host01 ~]# cephadm shell lsmcli ldl
In the output, you see the Health Status as Good with the respective SCSI VPD 0x83 ID.
NoteIf you do not get this information, then enabling the fields might cause erratic behavior of devices.
Log back into the Cephadm shell and enable
libstoragemgmt
support:Example
[root@host01 ~]# cephadm shell [ceph: root@host01 /]# ceph config set mgr mgr/cephadm/device_enhanced_scan true
Once this is enabled,
ceph orch device ls
gives the output of Health field as Good.
Verification
List the devices:
Example
[ceph: root@host01 /]# ceph orch device ls
6.5. Zapping devices for Ceph OSD deployment
You need to check the list of available devices before deploying OSDs. If there is no space available on the devices, you can clear the data on the devices by zapping them.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- All manager and monitor daemons are deployed.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
List the available devices to deploy OSDs:
Syntax
ceph orch device ls [--hostname=HOSTNAME_1 HOSTNAME_2] [--wide] [--refresh]
Example
[ceph: root@host01 /]# ceph orch device ls --wide --refresh
Clear the data of a device:
Syntax
ceph orch device zap HOSTNAME FILE_PATH --force
Example
[ceph: root@host01 /]# ceph orch device zap host02 /dev/sdb --force
Verification
Verify the space is available on the device:
Example
[ceph: root@host01 /]# ceph orch device ls
You will see that the field under Available is Yes.
Additional Resources
- See the Listing devices for Ceph OSD deployment section in the Red Hat Ceph Storage Operations Guide for more information.
6.6. Deploying Ceph OSDs on all available devices
You can deploy all OSDS on all the available devices. Cephadm allows the Ceph Orchestrator to discover and deploy the OSDs on any available and unused storage device.
To deploy OSDs all available devices, run the command without the unmanaged
parameter and then re-run the command with the parameter to prevent from creating future OSDs.
The deployment of OSDs with --all-available-devices
is generally used for smaller clusters. For larger clusters, use the OSD specification file.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- All manager and monitor daemons are deployed.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
List the available devices to deploy OSDs:
Syntax
ceph orch device ls [--hostname=HOSTNAME_1 HOSTNAME_2] [--wide] [--refresh]
Example
[ceph: root@host01 /]# ceph orch device ls --wide --refresh
Deploy OSDs on all available devices:
Example
[ceph: root@host01 /]# ceph orch apply osd --all-available-devices
The effect of
ceph orch apply
is persistent which means that the Orchestrator automatically finds the device, adds it to the cluster, and creates new OSDs. This occurs under the following conditions:- New disks or drives are added to the system.
- Existing disks or drives are zapped.
An OSD is removed and the devices are zapped.
You can disable automatic creation of OSDs on all the available devices by using the
--unmanaged
parameter.Example
[ceph: root@host01 /]# ceph orch apply osd --all-available-devices --unmanaged=true
Setting the parameter
--unmanaged
totrue
disables the creation of OSDs and also there is no change if you apply a new OSD service.NoteThe command
ceph orch daemon add
creates new OSDs, but does not add an OSD service.
Verification
List the service:
Example
[ceph: root@host01 /]# ceph orch ls
View the details of the node and devices:
Example
[ceph: root@host01 /]# ceph osd tree
Additional Resources
- See the Listing devices for Ceph OSD deployment section in the Red Hat Ceph Storage Operations Guide.
6.7. Deploying Ceph OSDs on specific devices and hosts
You can deploy all the Ceph OSDs on specific devices and hosts using the Ceph Orchestrator.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- All manager and monitor daemons are deployed.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
List the available devices to deploy OSDs:
Syntax
ceph orch device ls [--hostname=HOSTNAME_1 HOSTNAME_2] [--wide] [--refresh]
Example
[ceph: root@host01 /]# ceph orch device ls --wide --refresh
Deploy OSDs on specific devices and hosts:
Syntax
ceph orch daemon add osd HOSTNAME:DEVICE_PATH
Example
[ceph: root@host01 /]# ceph orch daemon add osd host02:/dev/sdb
To deploy ODSs on a raw physical device, without an LVM layer, use the
--method raw
option.Syntax
ceph orch daemon add osd --method raw HOSTNAME:DEVICE_PATH
Example
[ceph: root@host01 /]# ceph orch daemon add osd --method raw host02:/dev/sdb
NoteIf you have separate DB or WAL devices, the ratio of block to DB or WAL devices MUST be 1:1.
Verification
List the service:
Example
[ceph: root@host01 /]# ceph orch ls osd
View the details of the node and devices:
Example
[ceph: root@host01 /]# ceph osd tree
List the hosts, daemons, and processes:
Syntax
ceph orch ps --service_name=SERVICE_NAME
Example
[ceph: root@host01 /]# ceph orch ps --service_name=osd
Additional Resources
- See the Listing devices for Ceph OSD deployment section in the Red Hat Ceph Storage Operations Guide.
6.8. Advanced service specifications and filters for deploying OSDs
Service Specification of type OSD is a way to describe a cluster layout using the properties of disks. It gives the user an abstract way to tell Ceph which disks should turn into an OSD with the required configuration without knowing the specifics of device names and paths. For each device and each host, define a yaml
file or a json
file.
General settings for OSD specifications
- service_type: 'osd': This is mandatory to create OSDS
- service_id: Use the service name or identification you prefer. A set of OSDs is created using the specification file. This name is used to manage all the OSDs together and represent an Orchestrator service.
placement: This is used to define the hosts on which the OSDs need to be deployed.
You can use on the following options:
- host_pattern: '*' - A host name pattern used to select hosts.
- label: 'osd_host' - A label used in the hosts where OSD need to be deployed.
- hosts: 'host01', 'host02' - An explicit list of host names where OSDs needs to be deployed.
selection of devices: The devices where OSDs are created. This allows us to separate an OSD from different devices. You can create only BlueStore OSDs which have three components:
- OSD data: contains all the OSD data
- WAL: BlueStore internal journal or write-ahead Log
- DB: BlueStore internal metadata
- data_devices: Define the devices to deploy OSD. In this case, OSDs are created in a collocated schema. You can use filters to select devices and folders.
- wal_devices: Define the devices used for WAL OSDs. You can use filters to select devices and folders.
- db_devices: Define the devices for DB OSDs. You can use the filters to select devices and folders.
-
encrypted: An optional parameter to encrypt information on the OSD which can set to either
True
orFalse
- unmanaged: An optional parameter, set to False by default. You can set it to True if you do not want the Orchestrator to manage the OSD service.
- block_wal_size: User-defined value, in bytes.
- block_db_size: User-defined value, in bytes.
- osds_per_device: User-defined value for deploying more than one OSD per device.
-
method: An optional parameter to specify if an OSD is created with an LVM layer or not. Set to
raw
if you want to create OSDs on raw physical devices that do not include an LVM layer. If you have separate DB or WAL devices, the ratio of block to DB or WAL devices MUST be 1:1.
Filters for specifying devices
Filters are used in conjunction with the data_devices
, wal_devices
and db_devices
parameters.
Name of the filter | Description | Syntax | Example |
Model |
Target specific disks. You can get details of the model by running | Model: DISK_MODEL_NAME | Model: MC-55-44-XZ |
Vendor | Target specific disks | Vendor: DISK_VENDOR_NAME | Vendor: Vendor Cs |
Size Specification | Includes disks of an exact size | size: EXACT | size: '10G' |
Size Specification | Includes disks size of which is within the range | size: LOW:HIGH | size: '10G:40G' |
Size Specification | Includes disks less than or equal to in size | size: :HIGH | size: ':10G' |
Size Specification | Includes disks equal to or greater than in size | size: LOW: | size: '40G:' |
Rotational | Rotational attribute of the disk. 1 matches all disks that are rotational and 0 matches all the disks that are non-rotational. If rotational =0, then OSD is configured with SSD or NVME. If rotational=1 then the OSD is configured with HDD. | rotational: 0 or 1 | rotational: 0 |
All | Considers all the available disks | all: true | all: true |
Limiter | When you have specified valid filters but want to limit the amount of matching disks you can use the ‘limit’ directive. It should be used only as a last resort. | limit: NUMBER | limit: 2 |
To create an OSD with non-collocated components in the same host, you have to specify the different types of devices used and the devices should be on the same host.
The devices used for deploying OSDs must be supported by libstoragemgmt
.
Additional Resources
- See the Deploying Ceph OSDs using the advanced specifications section in the Red Hat Ceph Storage Operations Guide.
-
For more information on
libstoragemgmt
, see the Listing devices for Ceph OSD deployment section in the Red Hat Ceph Storage Operations Guide.
6.9. Deploying Ceph OSDs using advanced service specifications
The service specification of type OSD is a way to describe a cluster layout using the properties of disks. It gives the user an abstract way to tell Ceph which disks should turn into an OSD with the required configuration without knowing the specifics of device names and paths.
You can deploy the OSD for each device and each host by defining a yaml
file or a json
file.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- All manager and monitor daemons are deployed.
Procedure
On the monitor node, create the
osd_spec.yaml
file:Example
[root@host01 ~]# touch osd_spec.yaml
Edit the
osd_spec.yaml
file to include the following details:Syntax
service_type: osd service_id: SERVICE_ID placement: host_pattern: '*' # optional data_devices: # optional model: DISK_MODEL_NAME # optional paths: - /DEVICE_PATH osds_per_device: NUMBER_OF_DEVICES # optional db_devices: # optional size: # optional all: true # optional paths: - /DEVICE_PATH encrypted: true
Simple scenarios: In these cases, all the nodes have the same set-up.
Example
service_type: osd service_id: osd_spec_default placement: host_pattern: '*' data_devices: all: true paths: - /dev/sdb encrypted: true
Example
service_type: osd service_id: osd_spec_default placement: host_pattern: '*' data_devices: size: '80G' db_devices: size: '40G:' paths: - /dev/sdc
Simple scenario: In this case, all the nodes have the same setup with OSD devices created in raw mode, without an LVM layer.
Example
service_type: osd service_id: all-available-devices encrypted: "true" method: raw placement: host_pattern: "*" data_devices: all: "true"
Advanced scenario: This would create the desired layout by using all HDDs as
data_devices
with two SSD assigned as dedicated DB or WAL devices. The remaining SSDs aredata_devices
that have the NVMEs vendors assigned as dedicated DB or WAL devices.Example
service_type: osd service_id: osd_spec_hdd placement: host_pattern: '*' data_devices: rotational: 0 db_devices: model: Model-name limit: 2 --- service_type: osd service_id: osd_spec_ssd placement: host_pattern: '*' data_devices: model: Model-name db_devices: vendor: Vendor-name
Advanced scenario with non-uniform nodes: This applies different OSD specs to different hosts depending on the host_pattern key.
Example
service_type: osd service_id: osd_spec_node_one_to_five placement: host_pattern: 'node[1-5]' data_devices: rotational: 1 db_devices: rotational: 0 --- service_type: osd service_id: osd_spec_six_to_ten placement: host_pattern: 'node[6-10]' data_devices: model: Model-name db_devices: model: Model-name
Advanced scenario with dedicated WAL and DB devices:
Example
service_type: osd service_id: osd_using_paths placement: hosts: - host01 - host02 data_devices: paths: - /dev/sdb db_devices: paths: - /dev/sdc wal_devices: paths: - /dev/sdd
Advanced scenario with multiple OSDs per device:
Example
service_type: osd service_id: multiple_osds placement: hosts: - host01 - host02 osds_per_device: 4 data_devices: paths: - /dev/sdb
For pre-created volumes, edit the
osd_spec.yaml
file to include the following details:Syntax
service_type: osd service_id: SERVICE_ID placement: hosts: - HOSTNAME data_devices: # optional model: DISK_MODEL_NAME # optional paths: - /DEVICE_PATH db_devices: # optional size: # optional all: true # optional paths: - /DEVICE_PATH
Example
service_type: osd service_id: osd_spec placement: hosts: - machine1 data_devices: paths: - /dev/vg_hdd/lv_hdd db_devices: paths: - /dev/vg_nvme/lv_nvme
For OSDs by ID, edit the
osd_spec.yaml
file to include the following details:NoteThis configuration is applicable for Red Hat Ceph Storage 5.3z1 and later releases. For earlier releases, use pre-created lvm.
Syntax
service_type: osd service_id: OSD_BY_ID_HOSTNAME placement: hosts: - HOSTNAME data_devices: # optional model: DISK_MODEL_NAME # optional paths: - /DEVICE_PATH db_devices: # optional size: # optional all: true # optional paths: - /DEVICE_PATH
Example
service_type: osd service_id: osd_by_id_host01 placement: hosts: - host01 data_devices: paths: - /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-0-0-5 db_devices: paths: - /dev/disk/by-id/nvme-nvme.1b36-31323334-51454d55204e564d65204374726c-00000001
For OSDs by path, edit the
osd_spec.yaml
file to include the following details:NoteThis configuration is applicable for Red Hat Ceph Storage 5.3z1 and later releases. For earlier releases, use pre-created lvm.
Syntax
service_type: osd service_id: OSD_BY_PATH_HOSTNAME placement: hosts: - HOSTNAME data_devices: # optional model: DISK_MODEL_NAME # optional paths: - /DEVICE_PATH db_devices: # optional size: # optional all: true # optional paths: - /DEVICE_PATH
Example
service_type: osd service_id: osd_by_path_host01 placement: hosts: - host01 data_devices: paths: - /dev/disk/by-path/pci-0000:0d:00.0-scsi-0:0:0:4 db_devices: paths: - /dev/disk/by-path/pci-0000:00:02.0-nvme-1
Mount the YAML file under a directory in the container:
Example
[root@host01 ~]# cephadm shell --mount osd_spec.yaml:/var/lib/ceph/osd/osd_spec.yaml
Navigate to the directory:
Example
[ceph: root@host01 /]# cd /var/lib/ceph/osd/
Before deploying OSDs, do a dry run:
NoteThis step gives a preview of the deployment, without deploying the daemons.
Example
[ceph: root@host01 osd]# ceph orch apply -i osd_spec.yaml --dry-run
Deploy OSDs using service specification:
Syntax
ceph orch apply -i FILE_NAME.yml
Example
[ceph: root@host01 osd]# ceph orch apply -i osd_spec.yaml
Verification
List the service:
Example
[ceph: root@host01 /]# ceph orch ls osd
View the details of the node and devices:
Example
[ceph: root@host01 /]# ceph osd tree
Additional Resources
- See the Advanced service specifications and filters for deploying OSDs section in the Red Hat Ceph Storage Operations Guide.
6.10. Removing the OSD daemons using the Ceph Orchestrator
You can remove the OSD from a cluster by using Cephadm.
Removing an OSD from a cluster involves two steps:
- Evacuates all placement groups (PGs) from the cluster.
- Removes the PG-free OSDs from the cluster.
The --zap
option removed the volume groups, logical volumes, and the LVM metadata.
After removing OSDs, if the drives the OSDs were deployed on once again become available, cephadm`
might automatically try to deploy more OSDs on these drives if they match an existing drivegroup specification. If you deployed the OSDs you are removing with a spec and do not want any new OSDs deployed on the drives after removal, modify the drivegroup specification before removal. While deploying OSDs, if you have used --all-available-devices
option, set unmanaged: true
to stop it from picking up new drives at all. For other deployments, modify the specification. See the Deploying Ceph OSDs using advanced service specifications for more details.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- Ceph Monitor, Ceph Manager and Ceph OSD daemons are deployed on the storage cluster.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Check the device and the node from which the OSD has to be removed:
Example
[ceph: root@host01 /]# ceph osd tree
Remove the OSD:
Syntax
ceph orch osd rm OSD_ID [--replace] [--force] --zap
Example
[ceph: root@host01 /]# ceph orch osd rm 0 --zap
NoteIf you remove the OSD from the storage cluster without an option, such as
--replace
, the device is removed from the storage cluster completely. If you want to use the same device for deploying OSDs, you have to first zap the device before adding it to the storage cluster.Optional: To remove multiple OSDs from a specific node, run the following command:
Syntax
ceph orch osd rm OSD_ID OSD_ID --zap
Example
[ceph: root@host01 /]# ceph orch osd rm 2 5 --zap
Check the status of the OSD removal:
Example
[ceph: root@host01 /]# ceph orch osd rm status OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT 9 host01 done, waiting for purge 0 False False True 2023-06-06 17:50:50.525690 10 host03 done, waiting for purge 0 False False True 2023-06-06 17:49:38.731533 11 host02 done, waiting for purge 0 False False True 2023-06-06 17:48:36.641105
When no PGs are left on the OSD, it is decommissioned and removed from the cluster.
Verification
Verify the details of the devices and the nodes from which the Ceph OSDs are removed:
Example
[ceph: root@host01 /]# ceph osd tree
Additional Resources
- See the Deploying Ceph OSDs on all available devices section in the Red Hat Ceph Storage Operations Guide for more information.
- See the Deploying Ceph OSDs on specific devices and hosts section in the Red Hat Ceph Storage Operations Guide for more information.
- See the Zapping devices for Ceph OSD deployment section in the Red Hat Ceph Storage Operations Guide for more information on clearing space on devices.
6.11. Replacing the OSDs using the Ceph Orchestrator
When disks fail, you can replace the physical storage device and reuse the same OSD ID to avoid having to reconfigure the CRUSH map.
You can replace the OSDs from the cluster using the --replace
option.
If you want to replace a single OSD, see Deploying Ceph OSDs on specific devices and hosts. If you want to deploy OSDs on all available devices, see Deploying Ceph OSDs on all available devices.
This option preserves the OSD ID using the ceph orch rm
command. The OSD is not permanently removed from the CRUSH hierarchy, but is assigned the destroyed
flag. This flag is used to determine the OSD IDs that can be reused in the next OSD deployment. The destroyed
flag is used to determine which OSD id is reused in the next OSD deployment.
Similar to rm
command, replacing an OSD from a cluster involves two steps:
- Evacuating all placement groups (PGs) from the cluster.
- Removing the PG-free OSD from the cluster.
If you use OSD specification for deployment, your newly added disk is assigned the OSD ID of their replaced counterparts.
After removing OSDs, if the drives the OSDs were deployed on once again become available, cephadm
might automatically try to deploy more OSDs on these drives if they match an existing drivegroup specification. If you deployed the OSDs you are removing with a spec and do not want any new OSDs deployed on the drives after removal, modify the drivegroup specification before removal. While deploying OSDs, if you have used --all-available-devices
option, set unmanaged: true
to stop it from picking up new drives at all. For other deployments, modify the specification. See the Deploying Ceph OSDs using advanced service specifications for more details.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- Monitor, Manager, and OSD daemons are deployed on the storage cluster.
- A new OSD that replaces the removed OSD must be created on the same host from which the OSD was removed.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Ensure to dump and save a mapping of your OSD configurations for future references:
Example
[ceph: root@node /]# ceph osd metadata -f plain | grep device_paths "device_paths": "sde=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:1,sdi=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:1:0:1", "device_paths": "sde=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:1,sdf=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:1:0:1", "device_paths": "sdd=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:2,sdg=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:1:0:2", "device_paths": "sdd=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:2,sdh=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:1:0:2", "device_paths": "sdd=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:2,sdk=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:1:0:2", "device_paths": "sdc=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:3,sdl=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:1:0:3", "device_paths": "sdc=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:3,sdj=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:1:0:3", "device_paths": "sdc=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:3,sdm=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:1:0:3", [.. output omitted ..]
Check the device and the node from which the OSD has to be replaced:
Example
[ceph: root@host01 /]# ceph osd tree
Replace the OSD:
ImportantIf the storage cluster has
health_warn
or other errors associated with it, check and try to fix any errors before replacing the OSD to avoid data loss.Syntax
ceph orch osd rm OSD_ID --replace [--force]
The
--force
option can be used when there are ongoing operations on the storage cluster.Example
[ceph: root@host01 /]# ceph orch osd rm 0 --replace
Check the status of the OSD replacement:
Example
[ceph: root@host01 /]# ceph orch osd rm status
Stop the orchestrator to apply any existing OSD specification:
Example
[ceph: root@node /]# ceph orch pause [ceph: root@node /]# ceph orch status Backend: cephadm Available: Yes Paused: Yes
Zap the OSD devices that have been removed:
Example
[ceph: root@node /]# ceph orch device zap node.example.com /dev/sdi --force zap successful for /dev/sdi on node.example.com [ceph: root@node /]# ceph orch device zap node.example.com /dev/sdf --force zap successful for /dev/sdf on node.example.com
Resume the Orcestrator from pause mode
Example
[ceph: root@node /]# ceph orch resume
Check the status of the OSD replacement:
Example
[ceph: root@node /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.77112 root default -3 0.77112 host node 0 hdd 0.09639 osd.0 up 1.00000 1.00000 1 hdd 0.09639 osd.1 up 1.00000 1.00000 2 hdd 0.09639 osd.2 up 1.00000 1.00000 3 hdd 0.09639 osd.3 up 1.00000 1.00000 4 hdd 0.09639 osd.4 up 1.00000 1.00000 5 hdd 0.09639 osd.5 up 1.00000 1.00000 6 hdd 0.09639 osd.6 up 1.00000 1.00000 7 hdd 0.09639 osd.7 up 1.00000 1.00000 [.. output omitted ..]
Verification
Verify the details of the devices and the nodes from which the Ceph OSDs are replaced:
Example
[ceph: root@host01 /]# ceph osd tree
You can see an OSD with the same id as the one you replaced running on the same host.
Verify that the
db_device
for the new deployed OSDs is the replaceddb_device
:Example
[ceph: root@host01 /]# ceph osd metadata 0 | grep bluefs_db_devices "bluefs_db_devices": "nvme0n1", [ceph: root@host01 /]# ceph osd metadata 1 | grep bluefs_db_devices "bluefs_db_devices": "nvme0n1",
Additional Resources
- See the Deploying Ceph OSDs on all available devices section in the Red Hat Ceph Storage Operations Guide for more information.
- See the Deploying Ceph OSDs on specific devices and hosts section in the Red Hat Ceph Storage Operations Guide for more information.
6.12. Replacing the OSDs with pre-created LVM
After purging the OSD with the ceph-volume lvm zap
command, if the directory is not present, then you can replace the OSDs with the OSd service specification file with the pre-created LVM.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Failed OSD
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Remove the OSD:
Syntax
ceph orch osd rm OSD_ID [--replace]
Example
[ceph: root@host01 /]# ceph orch osd rm 8 --replace Scheduled OSD(s) for removal
Verify the OSD is destroyed:
Example
[ceph: root@host01 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.32297 root default -9 0.05177 host host10 3 hdd 0.01520 osd.3 up 1.00000 1.00000 13 hdd 0.02489 osd.13 up 1.00000 1.00000 17 hdd 0.01169 osd.17 up 1.00000 1.00000 -13 0.05177 host host11 2 hdd 0.01520 osd.2 up 1.00000 1.00000 15 hdd 0.02489 osd.15 up 1.00000 1.00000 19 hdd 0.01169 osd.19 up 1.00000 1.00000 -7 0.05835 host host12 20 hdd 0.01459 osd.20 up 1.00000 1.00000 21 hdd 0.01459 osd.21 up 1.00000 1.00000 22 hdd 0.01459 osd.22 up 1.00000 1.00000 23 hdd 0.01459 osd.23 up 1.00000 1.00000 -5 0.03827 host host04 1 hdd 0.01169 osd.1 up 1.00000 1.00000 6 hdd 0.01129 osd.6 up 1.00000 1.00000 7 hdd 0.00749 osd.7 up 1.00000 1.00000 9 hdd 0.00780 osd.9 up 1.00000 1.00000 -3 0.03816 host host05 0 hdd 0.01169 osd.0 up 1.00000 1.00000 8 hdd 0.01129 osd.8 destroyed 0 1.00000 12 hdd 0.00749 osd.12 up 1.00000 1.00000 16 hdd 0.00769 osd.16 up 1.00000 1.00000 -15 0.04237 host host06 5 hdd 0.01239 osd.5 up 1.00000 1.00000 10 hdd 0.01540 osd.10 up 1.00000 1.00000 11 hdd 0.01459 osd.11 up 1.00000 1.00000 -11 0.04227 host host07 4 hdd 0.01239 osd.4 up 1.00000 1.00000 14 hdd 0.01529 osd.14 up 1.00000 1.00000 18 hdd 0.01459 osd.18 up 1.00000 1.00000
Zap and remove the OSD using the
ceph-volume
command:Syntax
ceph-volume lvm zap --osd-id OSD_ID
Example
[ceph: root@host01 /]# ceph-volume lvm zap --osd-id 8 Zapping: /dev/vg1/data-lv2 Closing encrypted path /dev/mapper/l4D6ql-Prji-IzH4-dfhF-xzuf-5ETl-jNRcXC Running command: /usr/sbin/cryptsetup remove /dev/mapper/l4D6ql-Prji-IzH4-dfhF-xzuf-5ETl-jNRcXC Running command: /usr/bin/dd if=/dev/zero of=/dev/vg1/data-lv2 bs=1M count=10 conv=fsync stderr: 10+0 records in 10+0 records out stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.034742 s, 302 MB/s Zapping successful for OSD: 8
Check the OSD topology:
Example
[ceph: root@host01 /]# ceph-volume lvm list
Recreate the OSD with a specification file corresponding to that specific OSD topology:
Example
[ceph: root@host01 /]# cat osd.yml service_type: osd service_id: osd_service placement: hosts: - host03 data_devices: paths: - /dev/vg1/data-lv2 db_devices: paths: - /dev/vg1/db-lv1
Apply the updated specification file:
Example
[ceph: root@host01 /]# ceph orch apply -i osd.yml Scheduled osd.osd_service update...
Verify the OSD is back:
Example
[ceph: root@host01 /]# ceph -s [ceph: root@host01 /]# ceph osd tree
6.13. Replacing the OSDs in a non-colocated scenario
When the an OSD fails in a non-colocated scenario, you can replace the WAL/DB devices. The procedure is the same for DB and WAL devices. You need to edit the paths
under db_devices
for DB devices and paths
under wal_devices
for WAL devices.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Daemons are non-colocated.
- Failed OSD
Procedure
Identify the devices in the cluster:
Example
[root@host01 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 20G 0 disk ├─sda1 8:1 0 1G 0 part /boot └─sda2 8:2 0 19G 0 part ├─rhel-root 253:0 0 17G 0 lvm / └─rhel-swap 253:1 0 2G 0 lvm [SWAP] sdb 8:16 0 10G 0 disk └─ceph--5726d3e9--4fdb--4eda--b56a--3e0df88d663f-osd--block--3ceb89ec--87ef--46b4--99c6--2a56bac09ff0 253:2 0 10G 0 lvm sdc 8:32 0 10G 0 disk └─ceph--d7c9ab50--f5c0--4be0--a8fd--e0313115f65c-osd--block--37c370df--1263--487f--a476--08e28bdbcd3c 253:4 0 10G 0 lvm sdd 8:48 0 10G 0 disk ├─ceph--1774f992--44f9--4e78--be7b--b403057cf5c3-osd--db--31b20150--4cbc--4c2c--9c8f--6f624f3bfd89 253:7 0 2.5G 0 lvm └─ceph--1774f992--44f9--4e78--be7b--b403057cf5c3-osd--db--1bee5101--dbab--4155--a02c--e5a747d38a56 253:9 0 2.5G 0 lvm sde 8:64 0 10G 0 disk sdf 8:80 0 10G 0 disk └─ceph--412ee99b--4303--4199--930a--0d976e1599a2-osd--block--3a99af02--7c73--4236--9879--1fad1fe6203d 253:6 0 10G 0 lvm sdg 8:96 0 10G 0 disk └─ceph--316ca066--aeb6--46e1--8c57--f12f279467b4-osd--block--58475365--51e7--42f2--9681--e0c921947ae6 253:8 0 10G 0 lvm sdh 8:112 0 10G 0 disk ├─ceph--d7064874--66cb--4a77--a7c2--8aa0b0125c3c-osd--db--0dfe6eca--ba58--438a--9510--d96e6814d853 253:3 0 5G 0 lvm └─ceph--d7064874--66cb--4a77--a7c2--8aa0b0125c3c-osd--db--26b70c30--8817--45de--8843--4c0932ad2429 253:5 0 5G 0 lvm sr0
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Identify the OSDs and their DB device:
Example
[ceph: root@host01 /]# ceph-volume lvm list /dev/sdh ====== osd.2 ======= [db] /dev/ceph-d7064874-66cb-4a77-a7c2-8aa0b0125c3c/osd-db-0dfe6eca-ba58-438a-9510-d96e6814d853 block device /dev/ceph-5726d3e9-4fdb-4eda-b56a-3e0df88d663f/osd-block-3ceb89ec-87ef-46b4-99c6-2a56bac09ff0 block uuid GkWLoo-f0jd-Apj2-Zmwj-ce0h-OY6J-UuW8aD cephx lockbox secret cluster fsid fa0bd9dc-e4c4-11ed-8db4-001a4a00046e cluster name ceph crush device class db device /dev/ceph-d7064874-66cb-4a77-a7c2-8aa0b0125c3c/osd-db-0dfe6eca-ba58-438a-9510-d96e6814d853 db uuid 6gSPoc-L39h-afN3-rDl6-kozT-AX9S-XR20xM encrypted 0 osd fsid 3ceb89ec-87ef-46b4-99c6-2a56bac09ff0 osd id 2 osdspec affinity non-colocated type db vdo 0 devices /dev/sdh ====== osd.5 ======= [db] /dev/ceph-d7064874-66cb-4a77-a7c2-8aa0b0125c3c/osd-db-26b70c30-8817-45de-8843-4c0932ad2429 block device /dev/ceph-d7c9ab50-f5c0-4be0-a8fd-e0313115f65c/osd-block-37c370df-1263-487f-a476-08e28bdbcd3c block uuid Eay3I7-fcz5-AWvp-kRcI-mJaH-n03V-Zr0wmJ cephx lockbox secret cluster fsid fa0bd9dc-e4c4-11ed-8db4-001a4a00046e cluster name ceph crush device class db device /dev/ceph-d7064874-66cb-4a77-a7c2-8aa0b0125c3c/osd-db-26b70c30-8817-45de-8843-4c0932ad2429 db uuid mwSohP-u72r-DHcT-BPka-piwA-lSwx-w24N0M encrypted 0 osd fsid 37c370df-1263-487f-a476-08e28bdbcd3c osd id 5 osdspec affinity non-colocated type db vdo 0 devices /dev/sdh
In the
osds.yaml
file, setunmanaged
parameter totrue
, elsecephadm
redeploys the OSDs:Example
[ceph: root@host01 /]# cat osds.yml service_type: osd service_id: non-colocated unmanaged: true placement: host_pattern: 'ceph*' data_devices: paths: - /dev/sdb - /dev/sdc - /dev/sdf - /dev/sdg db_devices: paths: - /dev/sdd - /dev/sdh
Apply the updated specification file:
Example
[ceph: root@host01 /]# ceph orch apply -i osds.yml Scheduled osd.non-colocated update...
Check the status:
Example
[ceph: root@host01 /]# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 9m ago 4d count:1 crash 3/4 4d ago 4d * grafana ?:3000 1/1 9m ago 4d count:1 mgr 1/2 4d ago 4d count:2 mon 3/5 4d ago 4d count:5 node-exporter ?:9100 3/4 4d ago 4d * osd.non-colocated 8 4d ago 5s <unmanaged> prometheus ?:9095 1/1 9m ago 4d count:1
Remove the OSDs. Ensure to use the
--zap
option to remove hte backend services and the--replace
option to retain the OSD IDs:Example
[ceph: root@host01 /]# ceph orch osd rm 2 5 --zap --replace Scheduled OSD(s) for removal
Check the status:
Example
[ceph: root@host01 /]# ceph osd df tree | egrep -i "ID|host02|osd.2|osd.5" ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -5 0.04877 - 55 GiB 15 GiB 4.1 MiB 0 B 60 MiB 40 GiB 27.27 1.17 - host02 2 hdd 0.01219 1.00000 15 GiB 5.0 GiB 996 KiB 0 B 15 MiB 10 GiB 33.33 1.43 0 destroyed osd.2 5 hdd 0.01219 1.00000 15 GiB 5.0 GiB 1.0 MiB 0 B 15 MiB 10 GiB 33.33 1.43 0 destroyed osd.5
Edit the
osds.yaml
specification file to changeunmanaged
parameter tofalse
and replace the path to the DB device if it has changed after the device got physically replaced:Example
[ceph: root@host01 /]# cat osds.yml service_type: osd service_id: non-colocated unmanaged: false placement: host_pattern: 'ceph01*' data_devices: paths: - /dev/sdb - /dev/sdc - /dev/sdf - /dev/sdg db_devices: paths: - /dev/sdd - /dev/sde
In the above example,
/dev/sdh
is replaced with/dev/sde
.ImportantIf you use the same host specification file to replace the faulty DB device on a single OSD node, modify the
host_pattern
option to specify only the OSD node, else the deployment fails and you cannot find the new DB device on other hosts.Reapply the specification file with the
--dry-run
option to ensure the OSDs shall be deployed with the new DB device:Example
[ceph: root@host01 /]# ceph orch apply -i osds.yml --dry-run WARNING! Dry-Runs are snapshots of a certain point in time and are bound to the current inventory setup. If any of these conditions change, the preview will be invalid. Please make sure to have a minimal timeframe between planning and applying the specs. #################### SERVICESPEC PREVIEWS #################### +---------+------+--------+-------------+ |SERVICE |NAME |ADD_TO |REMOVE_FROM | +---------+------+--------+-------------+ +---------+------+--------+-------------+ ################ OSDSPEC PREVIEWS ################ +---------+-------+-------+----------+----------+-----+ |SERVICE |NAME |HOST |DATA |DB |WAL | +---------+-------+-------+----------+----------+-----+ |osd |non-colocated |host02 |/dev/sdb |/dev/sde |- | |osd |non-colocated |host02 |/dev/sdc |/dev/sde |- | +---------+-------+-------+----------+----------+-----+
Apply the specification file:
Example
[ceph: root@host01 /]# ceph orch apply -i osds.yml Scheduled osd.non-colocated update...
Check the OSDs are redeployed:
Example
[ceph: root@host01 /]# ceph osd df tree | egrep -i "ID|host02|osd.2|osd.5" ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -5 0.04877 - 55 GiB 15 GiB 4.5 MiB 0 B 60 MiB 40 GiB 27.27 1.17 - host host02 2 hdd 0.01219 1.00000 15 GiB 5.0 GiB 1.1 MiB 0 B 15 MiB 10 GiB 33.33 1.43 0 up osd.2 5 hdd 0.01219 1.00000 15 GiB 5.0 GiB 1.1 MiB 0 B 15 MiB 10 GiB 33.33 1.43 0 up osd.5
Verification
From the OSD host where the OSDS are redeployed, verify if they are on the new DB device:
Example
[ceph: root@host01 /]# ceph-volume lvm list /dev/sde ====== osd.2 ======= [db] /dev/ceph-15ce813a-8a4c-46d9-ad99-7e0845baf15e/osd-db-1998a02e-5e67-42a9-b057-e02c22bbf461 block device /dev/ceph-a4afcb78-c804-4daf-b78f-3c7ad1ed0379/osd-block-564b3d2f-0f85-4289-899a-9f98a2641979 block uuid ITPVPa-CCQ5-BbFa-FZCn-FeYt-c5N4-ssdU41 cephx lockbox secret cluster fsid fa0bd9dc-e4c4-11ed-8db4-001a4a00046e cluster name ceph crush device class db device /dev/ceph-15ce813a-8a4c-46d9-ad99-7e0845baf15e/osd-db-1998a02e-5e67-42a9-b057-e02c22bbf461 db uuid HF1bYb-fTK7-0dcB-CHzW-xvNn-dCym-KKdU5e encrypted 0 osd fsid 564b3d2f-0f85-4289-899a-9f98a2641979 osd id 2 osdspec affinity non-colocated type db vdo 0 devices /dev/sde ====== osd.5 ======= [db] /dev/ceph-15ce813a-8a4c-46d9-ad99-7e0845baf15e/osd-db-6c154191-846d-4e63-8c57-fc4b99e182bd block device /dev/ceph-b37c8310-77f9-4163-964b-f17b4c29c537/osd-block-b42a4f1f-8e19-4416-a874-6ff5d305d97f block uuid 0LuPoz-ao7S-UL2t-BDIs-C9pl-ct8J-xh5ep4 cephx lockbox secret cluster fsid fa0bd9dc-e4c4-11ed-8db4-001a4a00046e cluster name ceph crush device class db device /dev/ceph-15ce813a-8a4c-46d9-ad99-7e0845baf15e/osd-db-6c154191-846d-4e63-8c57-fc4b99e182bd db uuid SvmXms-iWkj-MTG7-VnJj-r5Mo-Moiw-MsbqVD encrypted 0 osd fsid b42a4f1f-8e19-4416-a874-6ff5d305d97f osd id 5 osdspec affinity non-colocated type db vdo 0 devices /dev/sde
6.14. Stopping the removal of the OSDs using the Ceph Orchestrator
You can stop the removal of only the OSDs that are queued for removal. This resets the initial state of the OSD and takes it off the removal queue.
If the OSD is in the process of removal, then you cannot stop the process.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- Monitor, Manager and OSD daemons are deployed on the cluster.
- Remove OSD process initiated.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Check the device and the node from which the OSD was initiated to be removed:
Example
[ceph: root@host01 /]# ceph osd tree
Stop the removal of the queued OSD:
Syntax
ceph orch osd rm stop OSD_ID
Example
[ceph: root@host01 /]# ceph orch osd rm stop 0
Check the status of the OSD removal:
Example
[ceph: root@host01 /]# ceph orch osd rm status
Verification
Verify the details of the devices and the nodes from which the Ceph OSDs were queued for removal:
Example
[ceph: root@host01 /]# ceph osd tree
Additional Resources
- See Removing the OSD daemons using the Ceph Orchestrator section in the Red Hat Ceph Storage Operations Guide for more information.
6.15. Activating the OSDs using the Ceph Orchestrator
You can activate the OSDs in the cluster in cases where the operating system of the host was reinstalled.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Hosts are added to the cluster.
- Monitor, Manager and OSD daemons are deployed on the storage cluster.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
After the operating system of the host is reinstalled, activate the OSDs:
Syntax
ceph cephadm osd activate HOSTNAME
Example
[ceph: root@host01 /]# ceph cephadm osd activate host03
Verification
List the service:
Example
[ceph: root@host01 /]# ceph orch ls
List the hosts, daemons, and processes:
Syntax
ceph orch ps --service_name=SERVICE_NAME
Example
[ceph: root@host01 /]# ceph orch ps --service_name=osd
6.16. Observing the data migration
When you add or remove an OSD to the CRUSH map, Ceph begins rebalancing the data by migrating placement groups to the new or existing OSD(s). You can observe the data migration using ceph-w
command.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Recently added or removed an OSD.
Procedure
To observe the data migration:
Example
[ceph: root@host01 /]# ceph -w
-
Watch as the placement group states change from
active+clean
toactive, some degraded objects
, and finallyactive+clean
when migration completes. -
To exit the utility, press
Ctrl + C
.
6.17. Recalculating the placement groups
Placement groups (PGs) define the spread of any pool data across the available OSDs. A placement group is built upon the given redundancy algorithm to be used. For a 3-way replication, the redundancy is defined to use three different OSDs. For erasure-coded pools, the number of OSDs to use is defined by the number of chunks.
When defining a pool the number of placement groups defines the grade of granularity the data is spread with across all available OSDs. The higher the number the better the equalization of capacity load can be. However, since handling the placement groups is also important in case of reconstruction of data, the number is significant to be carefully chosen upfront. To support calculation a tool is available to produce agile environments.
During the lifetime of a storage cluster a pool may grow above the initially anticipated limits. With the growing number of drives a recalculation is recommended. The number of placement groups per OSD should be around 100. When adding more OSDs to the storage cluster the number of PGs per OSD will lower over time. Starting with 120 drives initially in the storage cluster and setting the pg_num
of the pool to 4000 will end up in 100 PGs per OSD, given with the replication factor of three. Over time, when growing to ten times the number of OSDs, the number of PGs per OSD will go down to ten only. Because a small number of PGs per OSD will tend to an unevenly distributed capacity, consider adjusting the PGs per pool.
Adjusting the number of placement groups can be done online. Recalculating is not only a recalculation of the PG numbers, but will involve data relocation, which will be a lengthy process. However, the data availability will be maintained at any time.
Very high numbers of PGs per OSD should be avoided, because reconstruction of all PGs on a failed OSD will start at once. A high number of IOPS is required to perform reconstruction in a timely manner, which might not be available. This would lead to deep I/O queues and high latency rendering the storage cluster unusable or will result in long healing times.
Additional Resources
- See the PG calculator for calculating the values by a given use case.
- See the Erasure Code Pools chapter in the Red Hat Ceph Storage Strategies Guide for more information.