이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 5. Troubleshooting disaster recovery


5.1. Troubleshooting Regional-DR

5.1.1. RBD mirroring scheduling is getting stopped for some images

Problem

There are a few common causes for RBD mirroring scheduling getting stopped for some images.

After marking the applications for mirroring, for some reason, if it is not replicated, use the toolbox pod and run the following command to see which image scheduling is stopped.

$ rbd snap ls <poolname/imagename> –all
Resolution
  • Restart the manager daemon on the primary cluster
  • Disable and immediately re-enable mirroring on the affected images on the primary cluster

BZ reference: [2067095 and 2121514]

5.1.2. Relocation failure

Problem
Relocation stalls forever when the relocation is initiated before the peer (target cluster) is in a clean state.
Resolution
  1. Check the condition Status by running the following command:

    $ oc get drpc -A -o wide
  2. Change DRPC.Spec.Action back to Failover, and wait until the PeerReady condition status is TRUE. Use this command to change the action:

    $ oc patch drpc <drpc_name> --type json -p "[{'op': 'add', 'path': '/spec/failoverCluster', 'value': "<failoverCluster_name>"}]" -n <application_namespace>
    
    $ oc patch drpc <drpc_name>  --type json -p "[{'op': 'add', 'path': '/spec/action', 'value': 'Failover'}]" -n <application_namespace>
  3. Failover verification

    To verify if workload has failed over, run the following command to check the status of the available condition in the DRPC resource:

    JSONPATH='{range @.status.conditions[*]}{@.type}={@.status};{end}'  && oc get drpc busybox-drpc -n busybox-sample -o jsonpath="$JSONPATH" | grep "Available=True"

BZ reference: [2056871]

5.1.3. rbd-mirror daemon health is in warning state

Problem

There appears to be numerous cases where WARNING gets reported if mirror service ::get_mirror_service_status calls Ceph monitor to get service status for rbd-mirror.

Following a network disconnection, rbd-mirror daemon health is in the warning state while the connectivity between both the managed clusters is fine.

Resolution

Run the following command in the toolbox and look for leader:false

rbd mirror pool status --verbose ocs-storagecluster-cephblockpool | grep 'leader:'

If you see the following in the output:

leader: false

It indicates that there is a daemon startup issue and the most likely root cause could be due to problems reliably connecting to the secondary cluster.

Workaround: Move the rbd-mirror pod to a different node by simply deleting the pod and verify that it has been rescheduled on another node.

leader: true or no output

Contact Red Hat Support.

BZ reference: [2118627]

5.1.4. statefulset application stuck after failover

Problem
Application is in terminating state after failover or relocate.
Resolution

Delete the application’s persistent volume claim (pvc) using this command:

$ oc delete pvc <namespace/name>

BZ reference: [2087782]

5.1.5. Application is not running after failover

Problem
After failing over an application, the mirrored RBDs can be attached but the filesystems that are still in use cannot be mounted.
Resolution
  1. Scale down the RBD mirror daemon deployment to 0 until the application pods can recover from the above error.

    $ oc scale deployment rook-ceph-rbd-mirror-a -n openshift-storage --replicas=0
  2. Post recovery, scale the RBD mirror daemon deployment back to 1.

    $ oc scale deployment rook-ceph-rbd-mirror-a -n openshift-storage --replicas=1

BZ reference: [2007376]

5.2. Troubleshooting Metro-DR

5.2.1. A statefulset application stuck after failover

Problem
Application is in terminating state after failover or relocate.
Resolution
  1. If the workload uses StatefulSets, then run the following command before failing back or relocating to another cluster:

    $ oc get drpc -n <namespace> -o wide
    • If PeerReady is TRUE then you can proceed with the failback or relocation.
    • If PeerReady is FALSE then run the following command on the peer cluster:

      $ oc get pvc -n <namespace>

      For each bounded PVC for that namespace that belongs to the StatefulSet, run

      $ oc delete pvc <pvcname> -n namespace

      Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.

  2. Run the following command again

    $ oc get drpc -n <namespace> -o wide

    After a few seconds to a few minutes, the PeerReady column changes to TRUE. Then you can proceed with the failback or relocation.

BZ reference: [2118270]

5.2.2. DR policies protect all applications in the same namespace

Problem
While only single application is selected to be used by a DR policy, all applications in the same namespace will be protected. This results in PVCs, that match the DRPlacementControl spec.pvcSelector across multiple workloads or if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.
Resolution
Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl spec.pvcSelector to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify the spec.pvcSelector field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.

BZ reference: [2111163]

5.2.3. During failback of an application stuck in Relocating state

Problem
This issue might occur after performing failover and failback of an application (all nodes or cluster are up). When performing failback application stuck in the Relocating state with a message of Waiting for PV restore to complete.
Resolution
Use S3 client or equivalent to clean up the duplicate PV objects from the s3 store. Keep only the one that has a timestamp closer to the failover or relocate time.

BZ reference: [2120201]

Red Hat logoGithubRedditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

© 2024 Red Hat, Inc.