12.8. Troubleshooting LVM RAID
You can troubleshoot various issues in LVM RAID devices to correct data errors, recover devices, or replace failed devices.
12.8.1. Checking data coherency in a RAID logical volume (RAID scrubbing)
LVM provides scrubbing support for RAID logical volumes. RAID scrubbing is the process of reading all the data and parity blocks in an array and checking to see whether they are coherent.
Procédure
Optional: Limit the I/O bandwidth that the scrubbing process uses.
When you perform a RAID scrubbing operation, the background I/O required by the
sync
operations can crowd out other I/O to LVM devices, such as updates to volume group metadata. This might cause the other LVM operations to slow down. You can control the rate of the scrubbing operation by implementing recovery throttling.Add the following options to the
lvchange --syncaction
commands in the next steps:--maxrecoveryrate Rate[bBsSkKmMgG]
- Sets the maximum recovery rate so that the operation does crowd out nominal I/O operations. Setting the recovery rate to 0 means that the operation is unbounded.
--minrecoveryrate Rate[bBsSkKmMgG]
-
Sets the minimum recovery rate to ensure that I/O for
sync
operations achieves a minimum throughput, even when heavy nominal I/O is present.
Specify the Rate value as an amount per second for each device in the array. If you provide no suffix, the options assume kiB per second per device.
Display the number of discrepancies in the array, without repairing them:
# lvchange --syncaction check vg/raid_lv
Correct the discrepancies in the array:
# lvchange --syncaction repair vg/raid_lv
NoteThe
lvchange --syncaction repair
operation does not perform the same function as thelvconvert --repair
operation:-
The
lvchange --syncaction repair
operation initiates a background synchronization operation on the array. -
The
lvconvert --repair
operation repairs or replaces failed devices in a mirror or RAID logical volume.
-
The
Optional: Display information about the scrubbing operation:
# lvs -o +raid_sync_action,raid_mismatch_count vg/lv
The
raid_sync_action
field displays the current synchronization operation that the RAID volume is performing. It can be one of the following values:idle
- All sync operations complete (doing nothing)
resync
- Initializing an array or recovering after a machine failure
recover
- Replacing a device in the array
check
- Looking for array inconsistencies
repair
- Looking for and repairing inconsistencies
-
The
raid_mismatch_count
field displays the number of discrepancies found during acheck
operation. -
The
Cpy%Sync
field displays the progress of thesync
operations. The
lv_attr
field provides additional indicators. Bit 9 of this field displays the health of the logical volume, and it supports the following indicators:-
m
(mismatches) indicates that there are discrepancies in a RAID logical volume. This character is shown after a scrubbing operation has detected that portions of the RAID are not coherent. -
r
(refresh) indicates that a device in a RAID array has suffered a failure and the kernel regards it as failed, even though LVM can read the device label and considers the device to be operational. Refresh the logical volume to notify the kernel that the device is now available, or replace the device if you suspect that it failed.
-
Ressources supplémentaires
-
For more information, see the
lvchange(8)
andlvmraid(7)
man pages.
12.8.2. Failed devices in LVM RAID
RAID is not like traditional LVM mirroring. LVM mirroring required failed devices to be removed or the mirrored logical volume would hang. RAID arrays can keep on running with failed devices. In fact, for RAID types other than RAID1, removing a device would mean converting to a lower level RAID (for example, from RAID6 to RAID5, or from RAID4 or RAID5 to RAID0).
Therefore, rather than removing a failed device unconditionally and potentially allocating a replacement, LVM allows you to replace a failed device in a RAID volume in a one-step solution by using the --repair
argument of the lvconvert
command.
12.8.3. Recovering a failed RAID device in a logical volume
If the LVM RAID device failure is a transient failure or you are able to repair the device that failed, you can initiate recovery of the failed device.
Conditions préalables
- The previously failed device is now working.
Procédure
Refresh the logical volume that contains the RAID device:
# lvchange --refresh my_vg/my_lv
Verification steps
Examine the logical volume with the recovered device:
# lvs --all --options name,devices,lv_attr,lv_health_status my_vg
12.8.4. Replacing a failed RAID device in a logical volume
This procedure replaces a failed device that serves as a physical volume in an LVM RAID logical volume.
Conditions préalables
The volume group includes a physical volume that provides enough free capacity to replace the failed device.
If no physical volume with sufficient free extents is available on the volume group, add a new, sufficiently large physical volume using the
vgextend
utility.
Procédure
In the following example, a RAID logical volume is laid out as follows:
# lvs --all --options name,copy_percent,devices my_vg LV Cpy%Sync Devices my_lv 100.00 my_lv_rimage_0(0),my_lv_rimage_1(0),my_lv_rimage_2(0) [my_lv_rimage_0] /dev/sde1(1) [my_lv_rimage_1] /dev/sdc1(1) [my_lv_rimage_2] /dev/sdd1(1) [my_lv_rmeta_0] /dev/sde1(0) [my_lv_rmeta_1] /dev/sdc1(0) [my_lv_rmeta_2] /dev/sdd1(0)
If the
/dev/sdc
device fails, the output of thelvs
command is as follows:# lvs --all --options name,copy_percent,devices my_vg /dev/sdc: open failed: No such device or address Couldn't find device with uuid A4kRl2-vIzA-uyCb-cci7-bOod-H5tX-IzH4Ee. WARNING: Couldn't find all devices for LV my_vg/my_lv_rimage_1 while checking used and assumed devices. WARNING: Couldn't find all devices for LV my_vg/my_lv_rmeta_1 while checking used and assumed devices. LV Cpy%Sync Devices my_lv 100.00 my_lv_rimage_0(0),my_lv_rimage_1(0),my_lv_rimage_2(0) [my_lv_rimage_0] /dev/sde1(1) [my_lv_rimage_1] [unknown](1) [my_lv_rimage_2] /dev/sdd1(1) [my_lv_rmeta_0] /dev/sde1(0) [my_lv_rmeta_1] [unknown](0) [my_lv_rmeta_2] /dev/sdd1(0)
Replace the failed device and display the logical volume:
# lvconvert --repair my_vg/my_lv /dev/sdc: open failed: No such device or address Couldn't find device with uuid A4kRl2-vIzA-uyCb-cci7-bOod-H5tX-IzH4Ee. WARNING: Couldn't find all devices for LV my_vg/my_lv_rimage_1 while checking used and assumed devices. WARNING: Couldn't find all devices for LV my_vg/my_lv_rmeta_1 while checking used and assumed devices. Attempt to replace failed RAID images (requires full device resync)? [y/n]: y Faulty devices in my_vg/my_lv successfully replaced.
Optional: To manually specify the physical volume that replaces the failed device, add the physical volume at the end of the command:
# lvconvert --repair my_vg/my_lv replacement_pv
Examine the logical volume with the replacement:
# lvs --all --options name,copy_percent,devices my_vg /dev/sdc: open failed: No such device or address /dev/sdc1: open failed: No such device or address Couldn't find device with uuid A4kRl2-vIzA-uyCb-cci7-bOod-H5tX-IzH4Ee. LV Cpy%Sync Devices my_lv 43.79 my_lv_rimage_0(0),my_lv_rimage_1(0),my_lv_rimage_2(0) [my_lv_rimage_0] /dev/sde1(1) [my_lv_rimage_1] /dev/sdb1(1) [my_lv_rimage_2] /dev/sdd1(1) [my_lv_rmeta_0] /dev/sde1(0) [my_lv_rmeta_1] /dev/sdb1(0) [my_lv_rmeta_2] /dev/sdd1(0)
Until you remove the failed device from the volume group, LVM utilities still indicate that LVM cannot find the failed device.
Remove the failed device from the volume group:
# vgreduce --removemissing VG