7장. Known issues


This section describes the known issues in Red Hat OpenShift Data Foundation 4.15.

7.1. Disaster recovery

  • Creating an application namespace for the managed clusters

    Application namespace needs to exist on RHACM managed clusters for disaster recovery (DR) related pre-deployment actions and hence is pre-created when an application is deployed at the RHACM hub cluster. However, if an application is deleted at the hub cluster and its corresponding namespace is deleted on the managed clusters, they reappear on the managed cluster.

    Workaround: openshift-dr maintains a namespace manifestwork resource in the managed cluster namespace at the RHACM hub. These resources need to be deleted after the application deletion. For example, as a cluster administrator, execute the following command on the hub cluster:

    $ oc delete manifestwork -n <managedCluster namespace> <drPlacementControl name>-<namespace>-ns-mw

    (BZ#2059669)

  • ceph df reports an invalid MAX AVAIL value when the cluster is in stretch mode

    When a crush rule for a Red Hat Ceph Storage cluster has multiple "take" steps, the ceph df report shows the wrong maximum available size for the map. The issue will be fixed in an upcoming release.

    (BZ#2100920)

  • Both the DRPCs protect all the persistent volume claims created on the same namespace

    The namespaces that host multiple disaster recovery (DR) protected workloads, protect all the persistent volume claims (PVCs) within the namespace for each DRPlacementControl resource in the same namespace on the hub cluster that does not specify and isolate PVCs based on the workload using its spec.pvcSelector field.

    This results in PVCs, that match the DRPlacementControl spec.pvcSelector across multiple workloads. Or, if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.

    Workaround: Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl spec.pvcSelector to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify the spec.pvcSelector field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.

    Result: PVCs are no longer managed by multiple DRPlacementControl resources and do not cause any operation and data inconsistencies.

    (BZ#2128860)

  • MongoDB pod is in CrashLoopBackoff because of permission errors reading data in cephrbd volume

    The OpenShift projects across different managed clusters have different security context constraints (SCC), which specifically differ in the specified UID range and/or FSGroups. This leads to certain workload pods and containers failing to start post failover or relocate operations within these projects, due to filesystem access errors in their logs.

    Workaround: Ensure workload projects are created on all managed clusters with the same project-level SCC labels, allowing them to use the same filesystem context when failed over or relocated. Pods will no longer fail post-DR actions on filesystem-related access errors.

    (BZ#2081855)

  • Disaster recovery workloads remain stuck when deleted

    When deleting a workload from a cluster, the corresponding pods might not terminate with events such as FailedKillPod. This might cause delay or failure in garbage collecting dependent DR resources such as the PVC, VolumeReplication, and VolumeReplicationGroup. It would also prevent a future deployment of the same workload to the cluster as the stale resources are not yet garbage collected.

    Workaround: Reboot the worker node on which the pod is currently running and stuck in a terminating state. This results in successful pod termination and subsequently related DR API resources are also garbage collected.

    (BZ#2159791)

  • When DRPolicy is applied to multiple applications under same namespace, volume replication group is not created

    When a DRPlacementControl (DRPC) is created for applications that are co-located with other applications in the namespace, the DRPC has no label selector set for the applications. If any subsequent changes are made to the label selector, the validating admission webhook in the OpenShift Data Foundation Hub controller rejects the changes.

    Workaround: Until the admission webhook is changed to allow such changes, the DRPC validatingwebhookconfigurations can be patched to remove the webhook:

    $ oc patch validatingwebhookconfigurations vdrplacementcontrol.kb.io-lq2kz --type=json --patch='[{"op": "remove", "path": "/webhooks"}]'

    (BZ#2210762)

  • Application failover hangs in FailingOver state when the managed clusters are on different versions of OpenShift Container Platform and OpenShift Data Foundation

    Disaster Recovery solution with OpenShift Data Foundation protects and restores persistent volume claim (PVC) data in addition to the persistent volume (PV) data. If the primary cluster is on an older OpenShift Data Foundation version and the target cluster is updated to 4.15 then the failover will be stuck as the S3 store will not have the PVC data.

    Workaround: When upgrading the Disaster Recovery clusters, the primary cluster must be upgraded first and then the post-upgrade steps must be run.

    (BZ#2215462)

  • Failover of apps from c1 to c2 cluster hang in FailingOver

    The failover action is not disabled by Ramen when data is not uploaded to the s3 store due to s3 store misconfiguration. This means the cluster data is not available on the failover cluster during the failover. Therefore, failover cannot be completed.

    Workaround: Inspect the Ramen logs after initial deployment to ensure there are no s3 configuration errors reported.

    $ oc get drpc -o yaml

    (BZ#2248723)

  • Potential risk of data loss after hub recovery

    A potential data loss risk exists following hub recovery due to an eviction routine designed to clean up orphaned resources. This routine identifies and marks AppliedManifestWorks instances lacking corresponding ManifestWorks for collection. A hardcoded grace period of one hour is provided. After this period elapses, any resources associated with the AppliedManifestWork become subject to garbage collection.

    If the hub cluster fails to regenerate corresponding ManifestWorks within the initial one hour window, data loss could occur. This highlights the importance of promptly addressing any issues that might prevent the recreation of ManifestWorks post-hub recovery to minimize the risk of data loss.

  • Regional DR Cephfs based application failover show warning about subscription

    After the application is failed over or relocated, the hub subscriptions show up errors stating, "Some resources failed to deploy. Use View status YAML link to view the details." This is because the application persistent volume claims (PVCs) that use CephFS as the backing storage provisioner, deployed using Red Hat Advanced Cluster Management for Kubernetes (RHACM) subscriptions, and are DR protected are owned by the respective DR controllers.

    Workaround: There are no workarounds to rectify the errors in the subscription status. However, the subscription resources that failed to deploy can be checked to make sure they are PVCs. This ensures that the other resources do not have problems. If the only resources in the subscription that fail to deploy are the ones that are DR protected, the error can be ignored.

    (BZ-2264445)

  • Disabled PeerReady flag prevents changing the action to Failover

    The DR controller executes full reconciliation as and when needed. When a cluster becomes inaccessible, the DR controller performs a sanity check. If the workload is already relocated, this sanity check causes the PeerReady flag associated with the workload to be disabled, and the sanity check does not complete due to the cluster being offline. As a result, the disabled PeerReady flag prevents you from changing the action to Failover.

    Workaround: Use the command-line interface to change the DR action to Failover despite the disabled PeerReady flag.

    (BZ-2264765)

  • Ceph becomes inaccessible and IO is paused when connection is lost between the two data centers in stretch cluster

    When two data centers lose connection with each other but are still connected to the Arbiter node, there is a flaw in the election logic that causes an infinite election between the monitors. As a result, the monitors are unable to elect a leader and the Ceph cluster becomes unavailable. Also, IO is paused during the connection loss.

    Workaround: Shut down the monitors in one of the data centers where monitors are out of quorum (you can find this by running ceph -s command) and reset the connection scores of the remaining monitors.

    As a result, monitors can form a quorum and Ceph becomes available again and IOs resume.

    (Partner BZ#2265992)

  • Cleanup and data synchronization for ApplicationSet workloads remain stuck after older primary managed cluster is recovered post the failover

    ApplicationSet based workload deployments to the managed clusters are not garbage collected in cases when the hub cluster fails. It is recovered to a standby hub cluster while the workload has been failed over to a surviving managed cluster. The cluster that the workload failed over from, rejoins the new recovered standby hub.

    ApplicationSets that are disaster recovery (DR) protected and with a regional DRPolicy starts firing the VolumeSynchronizationDelay alert. Further such DR protected workloads cannot be failed over to the peer cluster or relocated to the peer cluster as data is out of sync between the two clusters.

    For a workaround, see the Troubleshooting section for Regional-DR in Configuring OpenShift Data Foundation Disaster Recovery for OpenShift Workloads.

    (BZ#2268594)

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 문서 정보

Legal Notice

Theme

© 2026 Red Hat
맨 위로 이동