Home
Prodotti
Red Hat OpenShift Data Foundation
4.15
Configuring OpenShift Data Foundation Disaster Recovery for OpenShift Workloads
Chapter 7. Troubleshooting disaster recovery

Questo contenuto non è disponibile nella lingua selezionata.

Chapter 7. Troubleshooting disaster recovery

7.1. Troubleshooting Metro-DR
Copia collegamento

7.1.1. A statefulset application stuck after failover
Copia collegamento

Problem

While relocating to a preferred cluster, DRPlacementControl is stuck reporting PROGRESSION as "MovingToSecondary".

Previously, before Kubernetes v1.23, the Kubernetes control plane never cleaned up the PVCs created for StatefulSets. This activity was left to the cluster administrator or a software operator managing the StatefulSets. Due to this, the PVCs of the StatefulSets were left untouched when their Pods were deleted. This prevents Ramen from relocating an application to its preferred cluster.

Resolution

If the workload uses StatefulSets, and relocation is stuck with PROGRESSION as "MovingToSecondary", then run:
```
oc get pvc -n <namespace>
```
```
$ oc get pvc -n <namespace>
```
Copy to Clipboard Toggle word wrap
For each bounded PVC for that namespace that belongs to the StatefulSet, run
```
oc delete pvc <pvcname> -n namespace
```
```
$ oc delete pvc <pvcname> -n namespace
```
Copy to Clipboard Toggle word wrap
Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.
Run the following command
```
oc get drpc -n <namespace> -o wide
```
```
$ oc get drpc -n <namespace> -o wide
```
Copy to Clipboard Toggle word wrap
After a few seconds to a few minutes, the PROGRESSION reports "Completed" and relocation is complete.

Result

The workload is relocated to the preferred cluster

BZ reference: [2118270]

7.1.2. DR policies protect all applications in the same namespace
Copia collegamento

Problem: While only a single application is selected to be used by a DR policy, all applications in the same namespace will be protected. This results in PVCs, that match the DRPlacementControl spec.pvcSelector across multiple workloads or if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.
Resolution: Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl spec.pvcSelector to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify the spec.pvcSelector field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.

BZ reference: [2128860]

7.1.3. Relocate or failback might be stuck in Initiating state
Copia collegamento

Problem

When a primary cluster is down and comes back online while the secondary goes down, relocate or failback might be stuck in the Initiating state.

Resolution

To avoid this situation, cut off all access from the old active hub to the managed clusters.

Alternatively, you can scale down the ApplicationSet controller on the old active hub cluster either before moving workloads or when they are in the clean-up phase.

On the old active hub, scale down the two deployments using the following commands:

oc scale deploy -n openshift-gitops-operator openshift-gitops-operator-controller-manager --replicas=0
oc scale statefulset -n openshift-gitops openshift-gitops-application-controller --replicas=0

$ oc scale deploy -n openshift-gitops-operator openshift-gitops-operator-controller-manager --replicas=0

$ oc scale statefulset -n openshift-gitops openshift-gitops-application-controller --replicas=0

Copy to Clipboard

Toggle word wrap

BZ reference: [2243804]

7.2. Troubleshooting Regional-DR
Copia collegamento

7.2.1. rbd-mirror daemon health is in warning state
Copia collegamento

Problem

There appears to be numerous cases where WARNING gets reported if mirror service ::get_mirror_service_status calls Ceph monitor to get service status for rbd-mirror.

Following a network disconnection, rbd-mirror daemon health is in the warning state while the connectivity between both the managed clusters is fine.

Resolution

Run the following command in the toolbox and look for leader:false

rbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'

rbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'

Copy to Clipboard

Toggle word wrap

If you see the following in the output:

leader: false

It indicates that there is a daemon startup issue and the most likely root cause could be due to problems reliably connecting to the secondary cluster.

Workaround: Move the rbd-mirror pod to a different node by simply deleting the pod and verify that it has been rescheduled on another node.

leader: true or no output

Contact Red Hat Support.

BZ reference: [2118627]

7.2.2. volsync-rsync-src pod is in error state as it is unable to resolve the destination hostname
Copia collegamento

Problem

VolSync source pod is unable to resolve the hostname of the VolSync destination pod. The log of the VolSync Pod consistently shows an error message over an extended period of time similar to the following log snippet.

oc logs -n busybox-workloads-3-2 volsync-rsync-src-dd-io-pvc-1-p25rz

$ oc logs -n busybox-workloads-3-2 volsync-rsync-src-dd-io-pvc-1-p25rz

Copy to Clipboard

Toggle word wrap

Example output

VolSync rsync container version: ACM-0.6.0-ce9a280
Syncing data to volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local:22 ...
ssh: Could not resolve hostname volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local: Name or service not known

VolSync rsync container version: ACM-0.6.0-ce9a280
Syncing data to volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local:22 ...
ssh: Could not resolve hostname volsync-rsync-dst-dd-io-pvc-1.busybox-workloads-3-2.svc.clusterset.local: Name or service not known

Copy to Clipboard

Toggle word wrap

Resolution

Restart submariner-lighthouse-agent on both nodes.

oc delete pod -l app=submariner-lighthouse-agent -n submariner-operator

$ oc delete pod -l app=submariner-lighthouse-agent -n submariner-operator

Copy to Clipboard

Toggle word wrap

7.2.3. Cleanup and data sync for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post failover
Copia collegamento

Problem

ApplicationSet based workload deployments to managed clusters are not garbage collected in cases when the hub cluster fails. It is recovered to a standby hub cluster, while the workload has been failed over to a surviving managed cluster. The cluster that the workload was failed over from, rejoins the new recovered standby hub.

ApplicationSets that are DR protected, with a regional DRPolicy, hence starts firing the VolumeSynchronizationDelay alert. Further such DR protected workloads cannot be failed over to the peer cluster or relocated to the peer cluster as data is out of sync between the two clusters.

Resolution

The workaround requires that openshift-gitops operators can own the workload resources that are orphaned on the managed cluster that rejoined the hub post a failover of the workload was performed from the new recovered hub. To achieve this the following steps can be taken:

Determine the Placement that is in use by the ArgoCD ApplicationSet resource on the hub cluster in the openshift-gitops namespace.
Inspect the placement label value for the ApplicationSet in this field: spec.generators.clusterDecisionResource.labelSelector.matchLabels
This would be the name of the Placement resource <placement-name>
Ensure that there exists a PlacemenDecision for the ApplicationSet referenced Placement.
```
oc get placementdecision -n openshift-gitops --selector cluster.open-cluster-management.io/placement=<placement-name>
```
```
$ oc get placementdecision -n openshift-gitops --selector cluster.open-cluster-management.io/placement=<placement-name>
```
Copy to Clipboard Toggle word wrap
This results in a single PlacementDecision that places the workload in the currently desired failover cluster.

Create a new PlacementDecision for the ApplicationSet pointing to the cluster where it should be cleaned up.

For example:

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: PlacementDecision
metadata:
  labels:
    cluster.open-cluster-management.io/decision-group-index: "1" # Typically one higher than the same value in the esisting PlacementDecision determined at step (2)
    cluster.open-cluster-management.io/decision-group-name: ""
    cluster.open-cluster-management.io/placement: cephfs-appset-busybox10-placement
  name: <placemen-name>-decision-<n> # <n> should be one higher than the existing PlacementDecision as determined in step (2)
  namespace: openshift-gitops

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: PlacementDecision
metadata:
  labels:
    cluster.open-cluster-management.io/decision-group-index: "1" # Typically one higher than the same value in the esisting PlacementDecision determined at step (2)
    cluster.open-cluster-management.io/decision-group-name: ""
    cluster.open-cluster-management.io/placement: cephfs-appset-busybox10-placement
  name: <placemen-name>-decision-<n> # <n> should be one higher than the existing PlacementDecision as determined in step (2)
  namespace: openshift-gitops

Copy to Clipboard

Toggle word wrap

Update the newly created PlacementDecision with a status subresource.

decision-status.yaml:
status:
  decisions:
  - clusterName: <managedcluster-name-to-clean-up> # This would be the cluster from where the workload was failed over, NOT the current workload cluster
    reason: FailoverCleanup

decision-status.yaml:
status:
  decisions:
  - clusterName: <managedcluster-name-to-clean-up> # This would be the cluster from where the workload was failed over, NOT the current workload cluster
    reason: FailoverCleanup

Copy to Clipboard

Toggle word wrap

oc patch placementdecision -n openshift-gitops <placemen-name>-decision-<n> --patch-file=decision-status.yaml --subresource=status --type=merge

$ oc patch placementdecision -n openshift-gitops <placemen-name>-decision-<n> --patch-file=decision-status.yaml --subresource=status --type=merge

Copy to Clipboard

Toggle word wrap

Watch and ensure that the Application resource for the ApplicationSet has been placed on the desired cluster
```
oc get application -n openshift-gitops  <applicationset-name>-<managedcluster-name-to-clean-up>
```
```
$ oc get application -n openshift-gitops  <applicationset-name>-<managedcluster-name-to-clean-up>
```
Copy to Clipboard Toggle word wrap
In the output, check if the SYNC STATUS shows as Synced and the HEALTH STATUS shows as Healthy.
Delete the PlacementDecision that was created in step (3), such that ArgoCD can garbage collect the workload resources on the <managedcluster-name-to-clean-up>
```
oc delete placementdecision -n openshift-gitops <placemen-name>-decision-<n>
```
```
$ oc delete placementdecision -n openshift-gitops <placemen-name>-decision-<n>
```
Copy to Clipboard Toggle word wrap

ApplicationSets that are DR protected, with a regional DRPolicy, stops firing the VolumeSynchronizationDelay alert.

BZ reference: [2268594]

7.3. Troubleshooting 2-site stretch cluster with Arbiter
Copia collegamento

7.3.1. Recovering workload pods stuck in ContainerCreating state post zone recovery
Copia collegamento

Problem

After performing complete zone failure and recovery, the workload pods are sometimes stuck in ContainerCreating state with the any of the below errors:

MountDevice failed to create newCsiDriverClient: driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers
MountDevice failed for volume <volume_name> : rpc error: code = Aborted desc = an operation with the given Volume ID <volume_id> already exists
MountVolume.SetUp failed for volume <volume_name> : rpc error: code = Internal desc = staging path <path> for volume <volume_id> is not a mountpoint

Resolution

If the workload pods are stuck with any of the above mentioned errors, perform the following workarounds:

For ceph-fs workload stuck in ContainerCreating:
1. Restart the nodes where the stuck pods are scheduled
2. Delete these stuck pods
3. Verify that the new pods are running
For ceph-rbd workload stuck in ContainerCreating that do not self recover after sometime
1. Restart csi-rbd plugin pods in the nodes where the stuck pods are scheduled
2. Verify that the new pods are running

Torna in cima

Questo contenuto non è disponibile nella lingua selezionata.

Chapter 7. Troubleshooting disaster recovery

7.1. Troubleshooting Metro-DR
Copia collegamento

7.1.1. A statefulset application stuck after failover
Copia collegamento

7.1.2. DR policies protect all applications in the same namespace
Copia collegamento

7.1.3. Relocate or failback might be stuck in Initiating state
Copia collegamento

7.2. Troubleshooting Regional-DR
Copia collegamento

7.2.1. rbd-mirror daemon health is in warning state
Copia collegamento

7.2.2. volsync-rsync-src pod is in error state as it is unable to resolve the destination hostname
Copia collegamento

7.2.3. Cleanup and data sync for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post failover
Copia collegamento

7.3. Troubleshooting 2-site stretch cluster with Arbiter
Copia collegamento

7.3.1. Recovering workload pods stuck in ContainerCreating state post zone recovery
Copia collegamento

Formazione

Prova, acquista e vendi

Community

Informazioni sulla documentazione di Red Hat

Rendiamo l’open source più inclusivo

Informazioni su Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Questo contenuto non è disponibile nella lingua selezionata.

Chapter 7. Troubleshooting disaster recovery

7.1. Troubleshooting Metro-DRCopia collegamentoCollegamento copiato negli appunti!

7.1.1. A statefulset application stuck after failoverCopia collegamentoCollegamento copiato negli appunti!

7.1.2. DR policies protect all applications in the same namespaceCopia collegamentoCollegamento copiato negli appunti!

7.1.3. Relocate or failback might be stuck in Initiating stateCopia collegamentoCollegamento copiato negli appunti!

7.2. Troubleshooting Regional-DRCopia collegamentoCollegamento copiato negli appunti!

7.2.1. rbd-mirror daemon health is in warning stateCopia collegamentoCollegamento copiato negli appunti!

7.2.2. volsync-rsync-src pod is in error state as it is unable to resolve the destination hostnameCopia collegamentoCollegamento copiato negli appunti!

7.2.3. Cleanup and data sync for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post failoverCopia collegamentoCollegamento copiato negli appunti!

7.3. Troubleshooting 2-site stretch cluster with ArbiterCopia collegamentoCollegamento copiato negli appunti!

7.3.1. Recovering workload pods stuck in ContainerCreating state post zone recoveryCopia collegamentoCollegamento copiato negli appunti!

Formazione

Prova, acquista e vendi

Community

Informazioni sulla documentazione di Red Hat

Rendiamo l’open source più inclusivo

Informazioni su Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

7.1. Troubleshooting Metro-DR
Copia collegamento

7.1.1. A statefulset application stuck after failover
Copia collegamento

7.1.2. DR policies protect all applications in the same namespace
Copia collegamento

7.1.3. Relocate or failback might be stuck in Initiating state
Copia collegamento

7.2. Troubleshooting Regional-DR
Copia collegamento

7.2.1. rbd-mirror daemon health is in warning state
Copia collegamento

7.2.2. volsync-rsync-src pod is in error state as it is unable to resolve the destination hostname
Copia collegamento

7.2.3. Cleanup and data sync for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post failover
Copia collegamento

7.3. Troubleshooting 2-site stretch cluster with Arbiter
Copia collegamento

7.3.1. Recovering workload pods stuck in ContainerCreating state post zone recovery
Copia collegamento