이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 5. Troubleshooting disaster recovery
5.1. Troubleshooting Regional-DR
5.1.1. RBD mirroring scheduling is getting stopped for some images
- Problem
There are a few common causes for RBD mirroring scheduling getting stopped for some images.
After marking the applications for mirroring, for some reason, if it is not replicated, use the toolbox pod and run the following command to see which image scheduling is stopped.
$ rbd snap ls <poolname/imagename> –all
- Resolution
- Restart the manager daemon on the primary cluster
- Disable and immediately re-enable mirroring on the affected images on the primary cluster
5.1.2. Relocation failure
- Problem
- Relocation stalls forever when the relocation is initiated before the peer (target cluster) is in a clean state.
- Resolution
Check the condition Status by running the following command:
$ oc get drpc -A -o wide
Change
DRPC.Spec.Action
back toFailover
, and wait until thePeerReady
condition status isTRUE
. Use this command to change the action:$ oc patch drpc <drpc_name> --type json -p "[{'op': 'add', 'path': '/spec/failoverCluster', 'value': "<failoverCluster_name>"}]" -n <application_namespace> $ oc patch drpc <drpc_name> --type json -p "[{'op': 'add', 'path': '/spec/action', 'value': 'Failover'}]" -n <application_namespace>
Failover verification
To verify if workload has failed over, run the following command to check the status of the available condition in the DRPC resource:
JSONPATH='{range @.status.conditions[*]}{@.type}={@.status};{end}' && oc get drpc busybox-drpc -n busybox-sample -o jsonpath="$JSONPATH" | grep "Available=True"
BZ reference: [2056871]
5.1.3. rbd-mirror
daemon health is in warning state
- Problem
There appears to be numerous cases where WARNING gets reported if mirror service
::get_mirror_service_status
callsCeph
monitor to get service status forrbd-mirror
.Following a network disconnection,
rbd-mirror
daemon health is in thewarning
state while the connectivity between both the managed clusters is fine.- Resolution
Run the following command in the toolbox and look for
leader:false
rbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'
If you see the following in the output:
leader: false
It indicates that there is a daemon startup issue and the most likely root cause could be due to problems reliably connecting to the secondary cluster.
Workaround: Move the
rbd-mirror
pod to a different node by simply deleting the pod and verify that it has been rescheduled on another node.leader: true
or no output
BZ reference: [2118627]
5.1.4. statefulset
application stuck after failover
- Problem
-
Application is in
terminating
state after failover or relocate. - Resolution
Delete the application’s persistent volume claim (pvc) using this command:
$ oc delete pvc <namespace/name>
BZ reference: [2087782]
5.1.5. Application is not running after failover
- Problem
- After failing over an application, the mirrored RBDs can be attached but the filesystems that are still in use cannot be mounted.
- Resolution
Scale down the RBD mirror daemon deployment to
0
until the application pods can recover from the above error.$ oc scale deployment rook-ceph-rbd-mirror-a -n openshift-storage --replicas=0
Post recovery, scale the RBD mirror daemon deployment back to
1
.$ oc scale deployment rook-ceph-rbd-mirror-a -n openshift-storage --replicas=1
BZ reference: [2007376]
5.2. Troubleshooting Metro-DR
5.2.1. A statefulset application stuck after failover
- Problem
- Application is in terminating state after failover or relocate.
- Resolution
If the workload uses
StatefulSets
, then run the following command before failing back or relocating to another cluster:$ oc get drpc -n <namespace> -o wide
- If PeerReady is TRUE then you can proceed with the failback or relocation.
If PeerReady is FALSE then run the following command on the peer cluster:
$ oc get pvc -n <namespace>
For each bounded PVC for that namespace that belongs to the StatefulSet, run
$ oc delete pvc <pvcname> -n namespace
Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.
Run the following command again
$ oc get drpc -n <namespace> -o wide
After a few seconds to a few minutes, the PeerReady column changes to
TRUE
. Then you can proceed with the failback or relocation.
BZ reference: [2118270]
5.2.2. DR policies protect all applications in the same namespace
- Problem
-
While only single application is selected to be used by a DR policy, all applications in the same namespace will be protected. This results in PVCs, that match the
DRPlacementControl
spec.pvcSelector
across multiple workloads or if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individualDRPlacementControl
actions. - Resolution
-
Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl
spec.pvcSelector
to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify thespec.pvcSelector
field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.
BZ reference: [2111163]
5.2.3. During failback of an application stuck in Relocating state
- Problem
-
This issue might occur after performing failover and failback of an application (all nodes or cluster are up). When performing failback application stuck in the
Relocating
state with a message ofWaiting
for PV restore to complete. - Resolution
- Use S3 client or equivalent to clean up the duplicate PV objects from the s3 store. Keep only the one that has a timestamp closer to the failover or relocate time.
BZ reference: [2120201]