Chapter 13. Handling a node failure
As a storage administrator, you can experience a whole node failing within the storage cluster, and handling a node failure is similar to handling a disk failure. With a node failure, instead of Ceph recovering placement groups (PGs) for only one disk, all PGs on the disks within that node must be recovered. Ceph will detect that the OSDs are all down and automatically start the recovery process, known as self-healing.
There are three node failure scenarios.
- Replacing the node by using the root and Ceph OSD disks from the failed node.
- Replacing the node by reinstalling the operating system and using the Ceph OSD disks from the failed node.
- Replacing the node by reinstalling the operating system and using all new Ceph OSD disks.
For a high-level workflow for each node replacement scenario, see link:https://docs.redhat.com/en/documentation/red_hat_ceph_storage/8/html-single/operations_guide/index##ops_workflow-for replacing-a-node[Workflow for replacing a node].
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A failed node.
13.1. Considerations before adding or removing a node
One of the outstanding features of Ceph is the ability to add or remove Ceph OSD nodes at run time. This means that you can resize the storage cluster capacity or replace hardware without taking down the storage cluster.
The ability to serve Ceph clients while the storage cluster is in a degraded
state also has operational benefits. For example, you can add or remove or replace hardware during regular business hours, rather than working overtime or on weekends. However, adding and removing Ceph OSD nodes can have a significant impact on performance.
Before you add or remove Ceph OSD nodes, consider the effects on storage cluster performance:
- Whether you are expanding or reducing the storage cluster capacity, adding or removing Ceph OSD nodes induces backfilling as the storage cluster rebalances. During that rebalancing time period, Ceph uses additional resources, which can impact storage cluster performance.
- In a production Ceph storage cluster, a Ceph OSD node has a particular hardware configuration that facilitates a particular type of storage strategy.
- Since a Ceph OSD node is part of a CRUSH hierarchy, the performance impact of adding or removing a node typically affects the performance of pools that use the CRUSH ruleset.
Additional Resources
- See the Red Hat Ceph Storage Storage Strategies Guide for more details.
13.2. Workflow for replacing a node
There are three node failure scenarios. Use these high-level workflows for each scenario when replacing a node.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A failed node.
13.2.1. Replacing the node by using the root and Ceph OSD disks from the failed node
Use the root and Ceph OSD disks from the failed node to replace the node.
Procedure
Disable backfilling.
Syntax
ceph osd set noout ceph osd set noscrub ceph osd set nodeep-scrub
Example
[ceph: root@host01 /]# ceph osd set noout [ceph: root@host01 /]# ceph osd set noscrub [ceph: root@host01 /]# ceph osd set nodeep-scrub
- Replace the node, taking the disks from the old node, and adding them to the new node.
Enable backfilling.
Syntax
ceph osd unset noout ceph osd unset noscrub ceph osd unset nodeep-scrub
Example
[ceph: root@host01 /]# ceph osd unset noout [ceph: root@host01 /]# ceph osd unset noscrub [ceph: root@host01 /]# ceph osd unset nodeep-scrub
13.2.2. Replacing the node by reinstalling the operating system and using the Ceph OSD disks from the failed node
Reinstall the operating system and use the Ceph OSD disks from the failed node to replace the node.
Procedure
Disable backfilling.
Syntax
ceph osd set noout ceph osd set noscrub ceph osd set nodeep-scrub
Example
[ceph: root@host01 /]# ceph osd set noout [ceph: root@host01 /]# ceph osd set noscrub [ceph: root@host01 /]# ceph osd set nodeep-scrub
Create a backup of the Ceph configuration.
Syntax
cp /etc/ceph/ceph.conf /PATH_TO_BACKUP_LOCATION/ceph.conf
Example
[ceph: root@host01 /]# cp /etc/ceph/ceph.conf /some/backup/location/ceph.conf
- Replace the node and add the Ceph OSD disks from the failed node.
Configure disks as JBOD.
NoteThis should be done by the storage administrator.
Install the operating system. For more information about operating system requirements, see Operating system requirements for Red Hat Ceph Storage. For more information about installing the operating system, see the Red Hat Enterprise Linux product documentation.
NoteThis should be done by the system administrator.
Restore the Ceph configuration.
Syntax
cp /PATH_TO_BACKUP_LOCATION/ceph.conf /etc/ceph/ceph.conf
Example
[ceph: root@host01 /]# cp /some/backup/location/ceph.conf /etc/ceph/ceph.conf
- Add the new node to the storage cluster using the Ceph Orchestrator commands. Ceph daemons are placed automatically on the respective node. For more information, see Adding a Ceph OSD node.
Enable backfilling.
Syntax
ceph osd unset noout ceph osd unset noscrub ceph osd unset nodeep-scrub
Example
[ceph: root@host01 /]# ceph osd unset noout [ceph: root@host01 /]# ceph osd unset noscrub [ceph: root@host01 /]# ceph osd unset nodeep-scrub
13.2.3. Replacing the node by reinstalling the operating system and using all new Ceph OSD disks
Reinstall the operating system and use all new Ceph OSD disks to replace the node.
Procedure
Disable backfilling.
Syntax
ceph osd set noout ceph osd set noscrub ceph osd set nodeep-scrub
Example
[ceph: root@host01 /]# ceph osd set noout [ceph: root@host01 /]# ceph osd set noscrub [ceph: root@host01 /]# ceph osd set nodeep-scrub
- Remove all OSDs on the failed node from the storage cluster. For more information, see Removing a Ceph OSD node.
Create a backup of the Ceph configuration.
Syntax
cp /etc/ceph/ceph.conf /PATH_TO_BACKUP_LOCATION/ceph.conf
Example
[ceph: root@host01 /]# cp /etc/ceph/ceph.conf /some/backup/location/ceph.conf
- Replace the node and add the Ceph OSD disks from the failed node.
Configure disks as JBOD.
NoteThis should be done by the storage administrator.
Install the operating system. For more information about operating system requirements, see Operating system requirements for Red Hat Ceph Storage. For more information about installing the operating system, see the Red Hat Enterprise Linux product documentation.
NoteThis should be done by the system administrator.
- Add the new node to the storage cluster using the Ceph Orchestrator commands. Ceph daemons are placed automatically on the respective node. For more information, see Adding a Ceph OSD node.
Enable backfilling.
Syntax
ceph osd unset noout ceph osd unset noscrub ceph osd unset nodeep-scrub
Example
[ceph: root@host01 /]# ceph osd unset noout [ceph: root@host01 /]# ceph osd unset noscrub [ceph: root@host01 /]# ceph osd unset nodeep-scrub
13.3. Performance considerations
The following factors typically affect a storage cluster’s performance when adding or removing Ceph OSD nodes:
- Ceph clients place load on the I/O interface to Ceph; that is, the clients place load on a pool. A pool maps to a CRUSH ruleset. The underlying CRUSH hierarchy allows Ceph to place data across failure domains. If the underlying Ceph OSD node involves a pool that is experiencing high client load, the client load could significantly affect recovery time and reduce performance. Because write operations require data replication for durability, write-intensive client loads in particular can increase the time for the storage cluster to recover.
- Generally, the capacity you are adding or removing affects the storage cluster’s time to recover. In addition, the storage density of the node you add or remove might also affect recovery times. For example, a node with 36 OSDs typically takes longer to recover than a node with 12 OSDs.
-
When removing nodes, you MUST ensure that you have sufficient spare capacity so that you will not reach
full ratio
ornear full ratio
. If the storage cluster reachesfull ratio
, Ceph will suspend write operations to prevent data loss. - A Ceph OSD node maps to at least one Ceph CRUSH hierarchy, and the hierarchy maps to at least one pool. Each pool that uses a CRUSH ruleset experiences a performance impact when Ceph OSD nodes are added or removed.
-
Replication pools tend to use more network bandwidth to replicate deep copies of the data, whereas erasure coded pools tend to use more CPU to calculate
k+m
coding chunks. The more copies that exist of the data, the longer it takes for the storage cluster to recover. For example, a larger pool or one that has a greater number ofk+m
chunks will take longer to recover than a replication pool with fewer copies of the same data. - Drives, controllers and network interface cards all have throughput characteristics that might impact the recovery time. Generally, nodes with higher throughput characteristics, such as 10 Gbps and SSDs, recover more quickly than nodes with lower throughput characteristics, such as 1 Gbps and SATA drives.
13.4. Recommendations for adding or removing nodes
Red Hat recommends adding or removing one OSD at a time within a node and allowing the storage cluster to recover before proceeding to the next OSD. This helps to minimize the impact on storage cluster performance. Note that if a node fails, you might need to change the entire node at once, rather than one OSD at a time.
To remove an OSD:
To add an OSD:
When adding or removing Ceph OSD nodes, consider that other ongoing processes also affect storage cluster performance. To reduce the impact on client I/O, Red Hat recommends the following:
Calculate capacity
Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all its OSDs without reaching the full ratio
. Reaching the full ratio
will cause the storage cluster to refuse write operations.
Temporarily disable scrubbing
Scrubbing is essential to ensuring the durability of the storage cluster’s data; however, it is resource intensive. Before adding or removing a Ceph OSD node, disable scrubbing and deep-scrubbing and let the current scrubbing operations complete before proceeding.
ceph osd set noscrub ceph osd set nodeep-scrub
Once you have added or removed a Ceph OSD node and the storage cluster has returned to an active+clean
state, unset the noscrub
and nodeep-scrub
settings.
ceph osd unset noscrub ceph osd unset nodeep-scrub
Limit backfill and recovery
If you have reasonable data durability, there is nothing wrong with operating in a degraded
state. For example, you can operate the storage cluster with osd_pool_default_size = 3
and osd_pool_default_min_size = 2
. You can tune the storage cluster for the fastest possible recovery time, but doing so significantly affects Ceph client I/O performance. To maintain the highest Ceph client I/O performance, limit the backfill and recovery operations and allow them to take longer.
osd_max_backfills = 1 osd_recovery_max_active = 1 osd_recovery_op_priority = 1
You can also consider setting the sleep and delay parameters such as, osd_recovery_sleep
.
Increase the number of placement groups
Finally, if you are expanding the size of the storage cluster, you may need to increase the number of placement groups. If you determine that you need to expand the number of placement groups, Red Hat recommends making incremental increases in the number of placement groups. Increasing the number of placement groups by a significant amount will cause a considerable degradation in performance.
13.5. Adding a Ceph OSD node
To expand the capacity of the Red Hat Ceph Storage cluster, you can add an OSD node.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A provisioned node with a network connection.
Procedure
- Verify that other nodes in the storage cluster can reach the new node by its short host name.
Temporarily disable scrubbing:
Example
[ceph: root@host01 /]# ceph osd set noscrub [ceph: root@host01 /]# ceph osd set nodeep-scrub
Limit the backfill and recovery features:
Syntax
ceph tell DAEMON_TYPE.* injectargs --OPTION_NAME VALUE [--OPTION_NAME VALUE]
Example
[ceph: root@host01 /]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 1 --osd-recovery-op-priority 1
Extract the cluster’s public SSH keys to a folder:
Syntax
ceph cephadm get-pub-key > ~/PATH
Example
[ceph: root@host01 /]# ceph cephadm get-pub-key > ~/ceph.pub
Copy ceph cluster’s public SSH keys to the root user’s
authorized_keys
file on the new host:Syntax
ssh-copy-id -f -i ~/PATH root@HOST_NAME_2
Example
[ceph: root@host01 /]# ssh-copy-id -f -i ~/ceph.pub root@host02
Add the new node to the CRUSH map:
Syntax
ceph orch host add NODE_NAME IP_ADDRESS
Example
[ceph: root@host01 /]# ceph orch host add host02 10.10.128.70
Add an OSD for each disk on the node to the storage cluster.
When adding an OSD node to a Red Hat Ceph Storage cluster, Red Hat recommends adding one OSD daemon at a time and allowing the cluster to recover to an active+clean
state before proceeding to the next OSD.
Additional Resources
- See the Setting a Specific Configuration Setting at Runtime section in the Red Hat Ceph Storage Configuration Guide for more details.
- See Adding a Bucket and Moving a Bucket sections in the Red Hat Ceph Storage Storage Strategies Guide for details on placing the node at an appropriate location in the CRUSH hierarchy.
13.6. Removing a Ceph OSD node
To reduce the capacity of a storage cluster, remove an OSD node.
Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all OSDs without reaching the full ratio
. Reaching the full ratio
will cause the storage cluster to refuse write operations.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to all nodes in the storage cluster.
Procedure
Check the storage cluster’s capacity:
Syntax
ceph df rados df ceph osd df
Temporarily disable scrubbing:
Syntax
ceph osd set noscrub ceph osd set nodeep-scrub
Limit the backfill and recovery features:
Syntax
ceph tell DAEMON_TYPE.* injectargs --OPTION_NAME VALUE [--OPTION_NAME VALUE]
Example
[ceph: root@host01 /]# ceph tell osd.* injectargs --osd-max-backfills 1 --osd-recovery-max-active 1 --osd-recovery-op-priority 1
Remove each OSD on the node from the storage cluster:
Using Removing the OSD daemons using the Ceph Orchestrator.
ImportantWhen removing an OSD node from the storage cluster, Red Hat recommends removing one OSD at a time within the node and allowing the cluster to recover to an
active+clean
state before proceeding to remove the next OSD.After you remove an OSD, check to verify that the storage cluster is not getting to the
near-full ratio
:Syntax
ceph -s ceph df
- Repeat this step until all OSDs on the node are removed from the storage cluster.
Once all OSDs are removed, remove the host:
Additional Resources
- See the Setting a specific configuration at runtime section in the Red Hat Ceph Storage Configuration Guide for more details.
13.7. Simulating a node failure
To simulate a hard node failure, power off the node and reinstall the operating system.
Prerequisites
- A healthy running Red Hat Ceph Storage cluster.
- Root-level access to all nodes on the storage cluster.
Procedure
Check the storage cluster’s capacity to understand the impact of removing the node:
Example
[ceph: root@host01 /]# ceph df [ceph: root@host01 /]# rados df [ceph: root@host01 /]# ceph osd df
Optionally, disable recovery and backfilling:
Example
[ceph: root@host01 /]# ceph osd set noout [ceph: root@host01 /]# ceph osd set noscrub [ceph: root@host01 /]# ceph osd set nodeep-scrub
- Shut down the node.
If you are changing the host name, remove the node from CRUSH map:
Example
[ceph: root@host01 /]# ceph osd crush rm host03
Check the status of the storage cluster:
Example
[ceph: root@host01 /]# ceph -s
- Reinstall the operating system on the node.
Add the new node:
- Using the Adding hosts using the Ceph Orchestrator.
Optionally, enable recovery and backfilling:
Example
[ceph: root@host01 /]# ceph osd unset noout [ceph: root@host01 /]# ceph osd unset noscrub [ceph: root@host01 /]# ceph osd unset nodeep-scrub
Check Ceph’s health:
Example
[ceph: root@host01 /]# ceph -s
Additional Resources
- See the Red Hat Ceph Storage Installation Guide for more details.