Chapter 7. Troubleshooting disaster recovery
This troubleshooting section provides guidance or workarounds on how to fix some of the disaster recovery configuration issues.
7.1. Troubleshooting Metro-DR Copy linkLink copied to clipboard!
Administrators can use this troubleshooting information to understand how to troubleshoot and fix their Metro-DR solution.
7.1.1. A statefulset application stuck after failover Copy linkLink copied to clipboard!
- Problem
While relocating to a preferred cluster, DRPlacementControl is stuck reporting PROGRESSION as "MovingToSecondary".
Previously, before Kubernetes v1.23, the Kubernetes control plane never cleaned up the PVCs created for StatefulSets. This activity was left to the cluster administrator or a software operator managing the StatefulSets. Due to this, the PVCs of the StatefulSets were left untouched when their Pods were deleted. This prevents Ramen from relocating an application to its preferred cluster.
- Resolution
If the workload uses StatefulSets, and relocation is stuck with PROGRESSION as "MovingToSecondary", then run:
oc get pvc -n <namespace>
$ oc get pvc -n <namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For each bounded PVC for that namespace that belongs to the StatefulSet, run
oc delete pvc <pvcname> -n namespace
$ oc delete pvc <pvcname> -n namespaceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.
Run the following command
oc get drpc -n <namespace> -o wide
$ oc get drpc -n <namespace> -o wideCopy to Clipboard Copied! Toggle word wrap Toggle overflow After a few seconds to a few minutes, the PROGRESSION reports "Completed" and relocation is complete.
- Result
- The workload is relocated to the preferred cluster
BZ reference: [2118270]
7.1.2. DR policies protect all applications in the same namespace Copy linkLink copied to clipboard!
- Problem
-
While only a single application is selected to be used by a DR policy, all applications in the same namespace will be protected. This results in PVCs, that match the
DRPlacementControlspec.pvcSelectoracross multiple workloads or if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individualDRPlacementControlactions. - Resolution
-
Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl
spec.pvcSelectorto disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify thespec.pvcSelectorfield for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.
BZ reference: [2128860]
7.1.3. During failback of an application stuck in Relocating state Copy linkLink copied to clipboard!
- Problem
-
This issue might occur after performing failover and failback of an application (all nodes or clusters are up). When performing failback, application is stuck in the
Relocatingstate with a message ofWaitingfor PV restore to complete. - Resolution
- Use S3 client or equivalent to clean up the duplicate PV objects from the s3 store. Keep only the one that has a timestamp closer to the failover or relocate time.
BZ reference: [2120201]
7.1.4. Relocate or failback might be stuck in Initiating state Copy linkLink copied to clipboard!
- Problem
-
When a primary cluster is down and comes back online while the secondary goes down,
relocateorfailbackmight be stuck in theInitiatingstate. - Resolution
To avoid this situation, cut off all access from the old active hub to the managed clusters.
Alternatively, you can scale down the ApplicationSet controller on the old active hub cluster either before moving workloads or when they are in the clean-up phase.
On the old active hub, scale down the two deployments using the following commands:
oc scale deploy -n openshift-gitops-operator openshift-gitops-operator-controller-manager --replicas=0 oc scale statefulset -n openshift-gitops openshift-gitops-application-controller --replicas=0
$ oc scale deploy -n openshift-gitops-operator openshift-gitops-operator-controller-manager --replicas=0 $ oc scale statefulset -n openshift-gitops openshift-gitops-application-controller --replicas=0Copy to Clipboard Copied! Toggle word wrap Toggle overflow
BZ reference: [2243804]
7.2. Troubleshooting Regional-DR Copy linkLink copied to clipboard!
Administrators can use this troubleshooting information to understand how to troubleshoot and fix their Regional-DR solution.
7.2.1. rbd-mirror daemon health is in warning state Copy linkLink copied to clipboard!
- Problem
There appears to be numerous cases where WARNING gets reported if mirror service
::get_mirror_service_statuscallsCephmonitor to get service status forrbd-mirror.Following a network disconnection,
rbd-mirrordaemon health is in thewarningstate while the connectivity between both the managed clusters is fine.- Resolution
Run the following command in the toolbox and look for
leader:falserbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'
rbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you see the following in the output:
leader: falseIt indicates that there is a daemon startup issue and the most likely root cause could be due to problems reliably connecting to the secondary cluster.
Workaround: Move the
rbd-mirrorpod to a different node by simply deleting the pod and verify that it has been rescheduled on another node.leader: trueor no output
BZ reference: [2118627]
7.2.2. volsync-rsync-src pod is in error state as it is unable to resolve the destination hostname Copy linkLink copied to clipboard!
- Problem
VolSyncsource pod is unable to resolve the hostname of the VolSync destination pod. The log of the VolSync Pod consistently shows an error message over an extended period of time similar to the following log snippet.oc logs -n busybox-workloads-3-2 volsync-rsync-src-dd-io-pvc-1-p25rz
$ oc logs -n busybox-workloads-3-2 volsync-rsync-src-dd-io-pvc-1-p25rzCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
VolSync rsync container version: ACM-0.6.0-ce9a280 Syncing data to volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local:22 ... ssh: Could not resolve hostname volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local: Name or service not known
VolSync rsync container version: ACM-0.6.0-ce9a280 Syncing data to volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local:22 ... ssh: Could not resolve hostname volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local: Name or service not knownCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Resolution
Restart
submariner-lighthouse-agenton both nodes.oc delete pod -l app=submariner-lighthouse-agent -n submariner-operator
$ oc delete pod -l app=submariner-lighthouse-agent -n submariner-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow
7.3. Troubleshooting 2-site stretch cluster with Arbiter Copy linkLink copied to clipboard!
Administrators can use this troubleshooting information to understand how to troubleshoot and fix their 2-site stretch cluster with arbiter environment.
7.3.1. Recovering workload pods stuck in ContainerCreating state post zone recovery Copy linkLink copied to clipboard!
- Problem
After performing complete zone failure and recovery, the workload pods are sometimes stuck in
ContainerCreatingstate with the any of the below errors:- MountDevice failed to create newCsiDriverClient: driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers
- MountDevice failed for volume <volume_name> : rpc error: code = Aborted desc = an operation with the given Volume ID <volume_id> already exists
- MountVolume.SetUp failed for volume <volume_name> : rpc error: code = Internal desc = staging path <path> for volume <volume_id> is not a mountpoint
- Resolution
If the workload pods are stuck with any of the above mentioned errors, perform the following workarounds:
For ceph-fs workload stuck in
ContainerCreating:- Restart the nodes where the stuck pods are scheduled
- Delete these stuck pods
- Verify that the new pods are running
For ceph-rbd workload stuck in
ContainerCreatingthat do not self recover after sometime- Restart csi-rbd plugin pods in the nodes where the stuck pods are scheduled
- Verify that the new pods are running