Chapter 7. Known issues


This section describes the known issues in Red Hat OpenShift Data Foundation 4.14.

7.1. Disaster recovery

  • Failover action reports RADOS block device image mount failed on the pod with RPC error still in use

    Failing over a disaster recovery (DR) protected workload might result in pods using the volume on the failover cluster to be stuck in reporting RADOS block device (RBD) image is still in use. This prevents the pods from starting up for a long duration (upto several hours).

    (BZ#2007376)

  • Creating an application namespace for the managed clusters

    Application namespace needs to exist on RHACM managed clusters for disaster recovery (DR) related pre-deployment actions and hence is pre-created when an application is deployed at the RHACM hub cluster. However, if an application is deleted at the hub cluster and its corresponding namespace is deleted on the managed clusters, they reappear on the managed cluster.

    Workaround: openshift-dr maintains a namespace manifestwork resource in the managed cluster namespace at the RHACM hub. These resources need to be deleted after the application deletion. For example, as a cluster administrator, execute the following command on the hub cluster: oc delete manifestwork -n <managedCluster namespace> <drPlacementControl name>-<namespace>-ns-mw.

    (BZ#2059669)

  • ceph df reports an invalid MAX AVAIL value when the cluster is in stretch mode

    When a crush rule for a Red Hat Ceph Storage cluster has multiple "take" steps, the ceph df report shows the wrong maximum available size for the map. The issue will be fixed in an upcoming release.

    (BZ#2100920)

  • Both the DRPCs protect all the persistent volume claims created on the same namespace

    The namespaces that host multiple disaster recovery (DR) protected workloads, protect all the persistent volume claims (PVCs) within the namespace for each DRPlacementControl resource in the same namespace on the hub cluster that does not specify and isolate PVCs based on the workload using its spec.pvcSelector field.

    This results in PVCs, that match the DRPlacementControl spec.pvcSelector across multiple workloads. Or, if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.

    Workaround: Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl spec.pvcSelector to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify the spec.pvcSelector field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.

    Result: PVCs are no longer managed by multiple DRPlacementControl resources and do not cause any operation and data inconsistencies.

    (BZ#2111163)

  • MongoDB pod is in CrashLoopBackoff because of permission errors reading data in cephrbd volume

    The OpenShift projects across different managed clusters have different security context constraints (SCC), which specifically differ in the specified UID range and/or FSGroups. This leads to certain workload pods and containers failing to start post failover or relocate operations within these projects, due to filesystem access errors in their logs.

    Workaround: Ensure workload projects are created on all managed clusters with the same project-level SCC labels, allowing them to use the same filesystem context when failed over or relocated. Pods will no longer fail post-DR actions on filesystem-related access errors.

    (BZ#2114573)

  • Application is stuck in Relocating state during relocate

    Multicloud Object Gateway allowed multiple persistent volume (PV) objects of the same name or namespace to be added to the S3 store on the same path. Due to this, Ramen does not restore the PV because it detected multiple versions pointing to the same claimRef.

    Workaround: Use S3 CLI or equivalent to clean up the duplicate PV objects from the S3 store. Keep only the one that has a timestamp closer to the failover or relocate time.

    Result: The restore operation will proceed to completion and the failover or relocate operation proceeds to the next step.

    (BZ#2120201)

  • Disaster recovery workloads remain stuck when deleted

    When deleting a workload from a cluster, the corresponding pods might not terminate with events such as FailedKillPod. This might cause delay or failure in garbage collecting dependent DR resources such as the PVC, VolumeReplication, and VolumeReplicationGroup. It would also prevent a future deployment of the same workload to the cluster as the stale resources are not yet garbage collected.

    Workaround: Reboot the worker node on which the pod is currently running and stuck in a terminating state. This results in successful pod termination and subsequently related DR API resources are also garbage collected.

    (BZ#2159791)

  • Application failover hangs in FailingOver state when the managed clusters are on different versions of OpenShift Container Platform and OpenShift Data Foundation

    Disaster Recovery solution with OpenShift Data Foundation 4.14 protects and restores persistent volume claim (PVC) data in addition to the persistent volume (PV) data. If the primary cluster is on an older OpenShift Data Foundation version and the target cluster is updated to 4.14 then the failover will be stuck as the S3 store will not have the PVC data.

    Workaround: When upgrading the Disaster Recovery clusters, the primary cluster must be upgraded first and then the post-upgrade steps must be run.

    (BZ#2214306)

  • When DRPolicy is applied to multiple applications under same namespace, volume replication group is not created

    When a DRPlacementControl (DRPC) is created for applications that are co-located with other applications in the namespace, the DRPC has no label selector set for the applications. If any subsequent changes are made to the label selector, the validating admission webhook in the OpenShift Data Foundation Hub controller rejects the changes.

    Workaround: Until the admission webhook is changed to allow such changes, the DRPC validatingwebhookconfigurations can be patched to remove the webhook:

    $ oc patch validatingwebhookconfigurations vdrplacementcontrol.kb.io-lq2kz --type=json --patch='[{"op": "remove", "path": "/webhooks"}]'

    (BZ#2210762)

  • Failover of apps from c1 to c2 cluster hang in FailingOver

    The failover action is not disabled by Ramen when data is not uploaded to the s3 store due to s3 store misconfiguration.This means the cluster data is not available on the failover cluster during the failover. Therefore, failover cannot be completed.

    Workaround: Inspect the ramen logs after initial deployment to insure there are no s3 configuration errors reported.

    $ oc get drpc -o yaml

    (BZ#2248723)

  • Potential risk of data loss after hub recovery

    A potential data loss risk exists following hub recovery due to an eviction routine designed to clean up orphaned resources. This routine identifies and marks AppliedManifestWorks instances lacking corresponding ManifestWorks for collection. A hardcoded grace period of one hour is provided. After this period elapses, any resources associated with the AppliedManifestWork become subject to garbage collection.

    If the hub cluster fails to regenerate corresponding ManifestWorks within the initial one hour window, data loss could occur. This highlights the importance of promptly addressing any issues that might prevent the recreation of ManifestWorks post-hub recovery to minimize the risk of data loss.

    (BZ-2252933)

7.1.1. DR upgrade

This section describes the issues and workarounds related to upgrading Red Hat OpenShift Data Foundation from version 4.13 to 4.14 in disaster recovery environment.

  • Incorrect value cached status.preferredDecision.ClusterNamespace

    When OpenShift Data Foundation is upgraded from version 4.13 to 4.14, the disaster recovery placement control (DRPC) might have incorrect value cached in status.preferredDecision.ClusterNamespace. As a result, the DRPC incorrectly enters the WaitForFencing PROGRESSION instead of detecting that the failover is already complete. The workload on the managed clusters is not affected by this issue.

    Workaround:

    1. To identify the affected DRPCs, check for any DRPC that is in the state FailedOver as CURRENTSTATE and are stuck in the WaitForFencing PROGRESSION.
    2. To clear the incorrect value edit the DRPC subresource and delete the line, status.PreferredCluster.ClusterNamespace:

      $ oc edit --subresource=status drpc -n <namespace>  <name>
    3. To verify the DRPC status, check if the PROGRESSION is in COMPLETED state and FailedOver as CURRENTSTATE.

      (BZ#2215442)

7.2. Ceph

  • Poor performance of the stretch clusters on CephFS

    Workloads with many small metadata operations might exhibit poor performance because of the arbitrary placement of metadata server (MDS) on multi-site Data Foundation clusters.

    (BZ#1982116)

  • SELinux relabelling issue with a very high number of files

    When attaching volumes to pods in Red Hat OpenShift Container Platform, the pods sometimes do not start or take an excessive amount of time to start. This behavior is generic and it is tied to how SELinux relabelling is handled by the Kubelet. This issue is observed with any filesystem based volumes having very high file counts. In OpenShift Data Foundation, the issue is seen when using CephFS based volumes with a very high number of files. There are different ways to workaround this issue. Depending on your business needs you can choose one of the workarounds from the knowledgebase solution https://access.redhat.com/solutions/6221251.

    (Jira#3327)

  • Ceph is inaccessible after crash or shutdown tests are run

    In a stretch cluster, when a monitor is revived and is in the probing stage for other monitors to receive the latest information such as MonitorMap or OSDMap, it is unable to enter stretch_mode at the time it is in the probing stage. This prevents it from correctly setting the elector’s disallowed_leaders list.

    Assuming that the revived monitor actually has the best score, it will think that it is best fit to be a leader in the current election round and will cause the election phase of the monitors to get stuck because it will keep proposing itself and will keep getting rejected by the surviving monitors because of the disallowed_leaders list. This leads to the monitors getting stuck in election, and Ceph eventually becomes unresponsive.

    To workaround this issue, when stuck in election and Ceph becomes unresponsive, reset the Connectivity Scores of each monitor by using the command:

    `ceph daemon mon.{name} connection scores reset`

    If this doesn’t work, restart the monitors one by one. Election will then be unstuck, monitors will be able to elect a leader, form a quorum, and Ceph will become responsive again.

    (BZ#2241937)

  • Ceph reports no active mgr after workload deployment

    After workload deployment, Ceph manager loses connectivity to MONs or is unable to respond to its liveness probe.

    This causes the ODF cluster status to report that there is "no active mgr". This causes multiple operations that use the Ceph manager for request processing to fail. For example, volume provisioning, creating CephFS snapshots, and others.

    To check the status of the ODF cluster, use the command oc get cephcluster -n openshift-storage. In the status output, the status.ceph.details.MGR_DOWN field will have the message "no active mgr" if your cluster has this issue.

    To workaround this issue, restart the Ceph manager pods using the following commands:

    # oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=0
    # oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=1

    After running these commands, the ODF cluster status reports a healthy cluster, with no warnings or errors regarding MGR_DOWN.

    (BZ#2244873)

  • CephBlockPool creation fails when custom deviceClass is used in StorageCluster

    Due to a known issue, CephBlockPool creation fails when custom deviceClass is used in StorageCluster.

    (BZ#2248487)

7.3. CSI Driver

  • Automatic flattening of snapshots does not work

    When there is a single common parent RBD PVC, if volume snapshot, restore, and delete snapshot are performed in a sequence more than 450 times, it is further not possible to take volume snapshot or clone of the common parent RBD PVC.

    To workaround this issue, instead of performing volume snapshot, restore, and delete snapshot in a sequence, you can use PVC to PVC clone to completely avoid this issue.

    If you hit this issue, contact customer support to perform manual flattening of the final restore PVCs to continue to take volume snapshot or clone of the common parent PVC again.

    (BZ#2232163)

7.4. OpenShift Data Foundation console

  • Missing NodeStageVolume RPC call blocks new pods from going into Running state

    NodeStageVolume RPC call is not being issued blocking some pods from going into Running state. The new pods are stuck in Pending forever.

    To workaround this issue, scale down all the affected pods at once or do a node reboot. After applying the workaround, all pods should go into Running state.

    (BZ#2244353)

  • Backups are failing to transfer data

    In some situations, backups fail to transfer data, and snapshot PVC is stuck in Pending state.

    (BZ#2248117)

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.