Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Handling a disk failure


As a storage administrator, you will have to deal with a disk failure at some point over the life time of the storage cluster. Testing and simulating a disk failure before a real failure happens will ensure you are ready for when the real thing does happen.

Here is the high-level workflow for replacing a failed disk:

  1. Find the failed OSD.
  2. Take OSD out.
  3. Stop the OSD daemon on the node.
  4. Check Ceph’s status.
  5. Remove the OSD from the CRUSH map.
  6. Delete the OSD authorization.
  7. Remove the OSD from the storage cluster.
  8. Unmount the filesystem on node.
  9. Replace the failed drive.
  10. Add the OSD back to the storage cluster.
  11. Check Ceph’s status.

2.1. Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A failed disk.

2.2. Disk failures

Ceph is designed for fault tolerance, which means Ceph can operate in a degraded state without losing data. Ceph can still operate even if a data storage drive fails. The degraded state means the extra copies of the data stored on other OSDs will backfill automatically to other OSDs in the storage cluster. When an OSD gets marked down this can mean the drive has failed.

When a drive fails, initially the OSD status will be down, but still in the storage cluster. Networking issues can also mark an OSD as down even if it is really up. First check for any network issues in the environment. If the networking checks out okay, then it is likely the OSD drive has failed.

Modern servers typically deploy with hot-swappable drives allowing you to pull a failed drive and replace it with a new one without bringing down the node. However, with Ceph you will also have to remove the software-defined part of the OSD.

2.2.1. Replacing a failed OSD disk

The general procedure for replacing an OSD involves removing the OSD from the storage cluster, replacing the drive and then recreating the OSD.

Important

When replacing the BlueStore block.db disk that contains the BlueStore OSD’s database partitions, Red Hat only supports the re-deploying of all OSDs using Ansible. A corrupt block.db files will impact all OSDs which are included in that block.db files.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A failed disk.

Procedure

  1. Check storage cluster health:

    # ceph health
    Copy to Clipboard Toggle word wrap
  2. Identify the OSD location in the CRUSH hierarchy:

    # ceph osd tree | grep -i down
    Copy to Clipboard Toggle word wrap
  3. On the OSD node, try to start the OSD:

    # systemctl start ceph-osd@$OSD_ID
    Copy to Clipboard Toggle word wrap

    If the command indicates that the OSD is already running, there might be a heartbeat or networking issue. If you cannot restart the OSD, then the drive might have failed.

    Note

    If the OSD is down, then the OSD will eventually get marked out. This is normal behavior for Ceph Storage. When the OSD gets marked out, other OSDs with copies of the failed OSD’s data will begin backfilling to ensure that the required number of copies exist within the storage cluster. While the storage cluster is backfilling, the cluster will be in a degraded state.

  4. For containerized deployments of Ceph, try to start the OSD container by referencing the drive associated with the OSD:

    # systemctl start ceph-osd@$OSD_DRIVE
    Copy to Clipboard Toggle word wrap

    If the command indicates that the OSD is already running, there might be a heartbeat or networking issue. If you cannot restart the OSD, then the drive might have failed.

    Note

    The drive associated with the OSD can be determined by Mapping a container OSD ID to a drive.

  5. Check the failed OSD’s mount point:

    Note

    For containerized deployments of Ceph, if the OSD is down the container will be down and the OSD drive will be unmounted, so you cannot run df to check its mount point. Use another method to determine if the OSD drive has failed. For example, run smartctl on the drive from the container node.

    # df -h
    Copy to Clipboard Toggle word wrap

    If you cannot restart the OSD, you can check the mount point. If the mount point no longer appears, then you can try remounting the OSD drive and restarting the OSD. If you cannot restore the mount point, then you might have a failed OSD drive.

    Using the smartctl utility cab help determine if the drive is healthy. For example:

    # yum install smartmontools
    # smartctl -H /dev/$DRIVE
    Copy to Clipboard Toggle word wrap

    If the drive has failed, you will need to replace it.

  6. Stop the OSD process:

    # systemctl stop ceph-osd@$OSD_ID
    Copy to Clipboard Toggle word wrap
    1. If using FileStore, then flush the journal to disk:

      # ceph osd -i $$OSD_ID --flush-journal
      Copy to Clipboard Toggle word wrap
  7. For containerized deployments of Ceph, stop the OSD container by referencing the drive associated with the OSD:

    # systemctl stop ceph-osd@$OSD_DRIVE
    Copy to Clipboard Toggle word wrap
  8. Remove the OSD out of the storage cluster:

    # ceph osd out $OSD_ID
    Copy to Clipboard Toggle word wrap
  9. Ensure the failed OSD is backfilling:

    # ceph -w
    Copy to Clipboard Toggle word wrap
  10. Remove the OSD from the CRUSH Map:

    # ceph osd crush remove osd.$OSD_ID
    Copy to Clipboard Toggle word wrap
    Note

    This step is only needed, if you are permanently removing the OSD and not redeploying it.

  11. Remove the OSD’s authentication keys:

    # ceph auth del osd.$OSD_ID
    Copy to Clipboard Toggle word wrap
  12. Verify that the keys for the OSD are not listed:

    # ceph auth list
    Copy to Clipboard Toggle word wrap
  13. Remove the OSD from the storage cluster:

    # ceph osd rm osd.$OSD_ID
    Copy to Clipboard Toggle word wrap
  14. Unmount the failed drive path:

    Note

    For containerized deployments of Ceph, if the OSD is down the container will be down and the OSD drive will be unmounted. In this case there is nothing to unmount and this step can be skipped.

    # umount /var/lib/ceph/osd/$CLUSTER_NAME-$OSD_ID
    Copy to Clipboard Toggle word wrap
  15. Replace the physical drive. Refer to the hardware vendor’s documentation for the node. If the drive is hot swappable, simply replace the failed drive with a new drive. If the drive is NOT hot swappable and the node contains multiple OSDs, you MIGHT need to bring the node down to replace the physical drive. If you need to bring the node down temporarily, you might set the cluster to noout to prevent backfilling:

    # ceph osd set noout
    Copy to Clipboard Toggle word wrap

    Once you replace the drive and you bring the node and its OSDs back online, remove the noout setting:

    # ceph osd unset noout
    Copy to Clipboard Toggle word wrap

    Allow the new drive to appear under the /dev/ directory and make a note of the drive path before proceeding further.

  16. Find the OSD drive and format the disk.
  17. Recreate the OSD:

    1. Using Ansible.
    2. Using the command-line interface.
  18. Check the CRUSH hierarchy to ensure it is accurate:

    # ceph osd tree
    Copy to Clipboard Toggle word wrap

    If you are not satisfied with the location of the OSD in the CRUSH hierarchy, you might move it with the move command:

    # ceph osd crush move $BUCKET_TO_MOVE $BUCKET_TYPE=$PARENT_BUCKET
    Copy to Clipboard Toggle word wrap
  19. Verify the OSD is online.

2.2.2. Replacing an OSD drive while retaining the OSD ID

When replacing a failed OSD drive, you can keep the original OSD ID and CRUSH map entry.

Note

The ceph-volume lvm commands defaults to BlueStore for OSDs. To use FileStore OSDs, then use the --filestore, --data and --journal options.

See the Preparing the OSD Data and Journal Drives section for more details.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • A failed disk.

Procedure

  1. Destroy the OSD:

    ceph osd destroy $OSD_ID --yes-i-really-mean-it
    Copy to Clipboard Toggle word wrap

    Example

    $ ceph osd destroy 1 --yes-i-really-mean-it
    Copy to Clipboard Toggle word wrap

  2. Optionally, if the replacement disk was used previously, then you need to zap the disk:

    ceph-volume lvm zap $DEVICE
    Copy to Clipboard Toggle word wrap

    Example

    $ ceph-volume lvm zap /dev/sdb
    Copy to Clipboard Toggle word wrap

  3. Create the new OSD with the existing OSD ID:

    ceph-volume lvm create --osd-id $OSD_ID --data $DEVICE
    Copy to Clipboard Toggle word wrap

    Example

    $ ceph-volume lvm create --osd-id 1 --data /dev/sdb
    Copy to Clipboard Toggle word wrap

2.3. Simulating a disk failure

There are two disk failure scenarios: hard and soft. A hard failure means replacing the disk. Soft failure might be an issue with the device driver or some other software component.

In the case of a soft failure, replacing the disk might not be needed. If replacing a disk, then steps need to be followed to remove the failed disk and add the replacement disk to Ceph. In order to simulate a soft disk failure the best thing to do is delete the device. Choose a device and delete the device from the system.

echo 1 > /sys/block/$DEVICE/device/delete
Copy to Clipboard Toggle word wrap

Example

[root@ceph1 ~]# echo 1 > /sys/block/sdb/device/delete
Copy to Clipboard Toggle word wrap

In the Ceph OSD log, on the OSD node, Ceph detected the failure and started the recovery process automatically.

Example

[root@ceph1 ~]# tail -50 /var/log/ceph/ceph-osd.1.log
2017-02-02 12:15:27.490889 7f3e1fa3d800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1: (5) Input/output error^[[0m
2017-02-02 12:34:17.777898 7fb7df1e7800  0 set uid:gid to 167:167 (ceph:ceph)
2017-02-02 12:34:17.777933 7fb7df1e7800  0 ceph version 10.2.3-17.el7cp (ca9d57c0b140eb5cea9de7f7133260271e57490e), process ceph-osd, pid 1752
2017-02-02 12:34:17.788885 7fb7df1e7800  0 pidfile_write: ignore empty --pid-file
2017-02-02 12:34:17.870322 7fb7df1e7800  0 filestore(/var/lib/ceph/osd/ceph-1) backend xfs (magic 0x58465342)
2017-02-02 12:34:17.871028 7fb7df1e7800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2017-02-02 12:34:17.871035 7fb7df1e7800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-02-02 12:34:17.871059 7fb7df1e7800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: splice is supported
2017-02-02 12:34:17.897839 7fb7df1e7800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2017-02-02 12:34:17.897985 7fb7df1e7800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_feature: extsize is disabled by conf
2017-02-02 12:34:17.921162 7fb7df1e7800  1 leveldb: Recovering log #22
2017-02-02 12:34:17.947335 7fb7df1e7800  1 leveldb: Level-0 table #24: started
2017-02-02 12:34:18.001952 7fb7df1e7800  1 leveldb: Level-0 table #24: 810464 bytes OK
2017-02-02 12:34:18.044554 7fb7df1e7800  1 leveldb: Delete type=0 #22
2017-02-02 12:34:18.045383 7fb7df1e7800  1 leveldb: Delete type=3 #20
2017-02-02 12:34:18.058061 7fb7df1e7800  0 filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2017-02-02 12:34:18.105482 7fb7df1e7800  1 journal _open /var/lib/ceph/osd/ceph-1/journal fd 18: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-02-02 12:34:18.130293 7fb7df1e7800  1 journal _open /var/lib/ceph/osd/ceph-1/journal fd 18: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-02-02 12:34:18.130992 7fb7df1e7800  1 filestore(/var/lib/ceph/osd/ceph-1) upgrade
2017-02-02 12:34:18.136547 7fb7df1e7800  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
2017-02-02 12:34:18.142863 7fb7df1e7800  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
2017-02-02 12:34:18.255019 7fb7df1e7800  0 osd.1 51 crush map has features 2200130813952, adjusting msgr requires for clients
2017-02-02 12:34:18.255041 7fb7df1e7800  0 osd.1 51 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons
2017-02-02 12:34:18.255048 7fb7df1e7800  0 osd.1 51 crush map has features 2200130813952, adjusting msgr requires for osds
2017-02-02 12:34:18.296256 7fb7df1e7800  0 osd.1 51 load_pgs
2017-02-02 12:34:18.561604 7fb7df1e7800  0 osd.1 51 load_pgs opened 152 pgs
2017-02-02 12:34:18.561648 7fb7df1e7800  0 osd.1 51 using 0 op queue with priority op cut off at 64.
2017-02-02 12:34:18.562603 7fb7df1e7800 -1 osd.1 51 log_to_monitors {default=true}
2017-02-02 12:34:18.650204 7fb7df1e7800  0 osd.1 51 done with init, starting boot process
2017-02-02 12:34:19.274937 7fb7b78ba700  0 -- 192.168.122.83:6801/1752 >> 192.168.122.81:6801/2620 pipe(0x7fb7ec4d1400 sd=127 :6801 s=0 pgs=0 cs=0 l=0 c=0x7fb7ec42e480).accept connect_seq 0 vs existing 0 state connecting
Copy to Clipboard Toggle word wrap

Looking at osd disk tree we also see the disk is offline.

[root@ceph1 ~]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.28976 root default
-2 0.09659     host ceph3
 1 0.09659         osd.1       down 1.00000          1.00000
-3 0.09659     host ceph1
 2 0.09659         osd.2       up  1.00000          1.00000
-4 0.09659     host ceph2
 0 0.09659         osd.0       up  1.00000          1.00000
Copy to Clipboard Toggle word wrap
Retour au début
Red Hat logoGithubredditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance. Découvrez nos récentes mises à jour.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez le Blog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

Theme

© 2025 Red Hat