Chapter 7. Known issues

This section describes the known issues in Red Hat OpenShift Data Foundation 4.16.

7.1. Disaster recovery
Copy link

Creating an application namespace for the managed clusters
Application namespace needs to exist on RHACM managed clusters for disaster recovery (DR) related pre-deployment actions and hence is pre-created when an application is deployed at the RHACM hub cluster. However, if an application is deleted at the hub cluster and its corresponding namespace is deleted on the managed clusters, they reappear on the managed cluster.
Workaround: openshift-dr maintains a namespace manifestwork resource in the managed cluster namespace at the RHACM hub. These resources need to be deleted after the application deletion. For example, as a cluster administrator, execute the following command on the hub cluster:
```
oc delete manifestwork -n <managedCluster namespace> <drPlacementControl name>-<namespace>-ns-mw
```
```
$ oc delete manifestwork -n <managedCluster namespace> <drPlacementControl name>-<namespace>-ns-mw
```
Copy to Clipboard Toggle word wrap
(BZ#2059669)

ceph df reports an invalid MAX AVAIL value when the cluster is in stretch mode
When a crush rule for a Red Hat Ceph Storage cluster has multiple "take" steps, the ceph df report shows the wrong maximum available size for the map. The issue will be fixed in an upcoming release.
(BZ#2100920)

Both the DRPCs protect all the persistent volume claims created on the same namespace
The namespaces that host multiple disaster recovery (DR) protected workloads, protect all the persistent volume claims (PVCs) within the namespace for each DRPlacementControl resource in the same namespace on the hub cluster that does not specify and isolate PVCs based on the workload using its spec.pvcSelector field.
This results in PVCs that match the DRPlacementControl spec.pvcSelector across multiple workloads. Or, if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.
Workaround: Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl spec.pvcSelector to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify the spec.pvcSelector field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.
Result: PVCs are no longer managed by multiple DRPlacementControl resources and do not cause any operation and data inconsistencies.
(BZ#2128860)

MongoDB pod is in CrashLoopBackoff because of permission errors reading data in cephrbd volume
The OpenShift projects across different managed clusters have different security context constraints (SCC), which specifically differ in the specified UID range and/or FSGroups. This leads to certain workload pods and containers failing to start post failover or relocate operations within these projects, due to filesystem access errors in their logs.
Workaround: Ensure workload projects are created on all managed clusters with the same project-level SCC labels, allowing them to use the same filesystem context when failed over or relocated. Pods will no longer fail post-DR actions on filesystem-related access errors.
(BZ#2081855)

Disaster recovery workloads remain stuck when deleted
When deleting a workload from a cluster, the corresponding pods might not terminate with events such as FailedKillPod. This might cause delay or failure in garbage collecting dependent DR resources such as the PVC, VolumeReplication, and VolumeReplicationGroup. It would also prevent a future deployment of the same workload to the cluster as the stale resources are not yet garbage collected.
Workaround: Reboot the worker node on which the pod is currently running and stuck in a terminating state. This results in successful pod termination and subsequently related DR API resources are also garbage collected.
(BZ#2159791)

Regional DR CephFS based application failover show warning about subscription
After the application is failed over or relocated, the hub subscriptions show up errors stating, "Some resources failed to deploy. Use View status YAML link to view the details." This is because the application persistent volume claims (PVCs) that use CephFS as the backing storage provisioner, deployed using Red Hat Advanced Cluster Management for Kubernetes (RHACM) subscriptions, and are DR protected are owned by the respective DR controllers.
Workaround: There are no workarounds to rectify the errors in the subscription status. However, the subscription resources that failed to deploy can be checked to make sure they are PVCs. This ensures that the other resources do not have problems. If the only resources in the subscription that fail to deploy are the ones that are DR protected, the error can be ignored.
(BZ-2264445)

Disabled PeerReady flag prevents changing the action to Failover
The DR controller executes full reconciliation as and when needed. When a cluster becomes inaccessible, the DR controller performs a sanity check. If the workload is already relocated, this sanity check causes the PeerReady flag associated with the workload to be disabled, and the sanity check does not complete due to the cluster being offline. As a result, the disabled PeerReady flag prevents you from changing the action to Failover.
Workaround: Use the command-line interface to change the DR action to Failover despite the disabled PeerReady flag.
(BZ-2264765)

Ceph becomes inaccessible and IO is paused when connection is lost between the two data centers in stretch cluster
When two data centers lose connection with each other but are still connected to the Arbiter node, there is a flaw in the election logic that causes an infinite election between the monitors. As a result, the monitors are unable to elect a leader and the Ceph cluster becomes unavailable. Also, IO is paused during the connection loss.
Workaround: Shut down the monitors in one of the data centers where monitors are out of quorum (you can find this by running ceph -s command) and reset the connection scores of the remaining monitors.
As a result, monitors can form a quorum and Ceph becomes available again and IOs resume.
(Partner BZ#2265992)

RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster
For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap.
Workaround: Update the rook-ceph-csi-mapping-config configmap with cephBlockPoolID’s mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.
(BZ#2267731)

Information about lastGroupSyncTime is lost after hub recovery for the workloads which are primary on the unavailable managed cluster
Applications that are previously failed over to a managed cluster do not report a lastGroupSyncTime, thereby causing the trigger of the alert VolumeSynchronizationDelay. This is because when the ACM hub and a managed cluster that are part of the DRPolicy are unavailable, a new ACM hub cluster is reconstructed from the backup.
Workaround: If the managed cluster to which the workload was failed over is unavailable, you can still failover to a surviving managed cluster.
(BZ#2275320)

MCO operator reconciles the veleroNamespaceSecretKeyRef and CACertificates fields
When the OpenShift Data Foundation operator is upgraded, the CACertificates and veleroNamespaceSecretKeyRef fields under s3StoreProfiles in the Ramen config are lost.
Workaround: If the Ramen config has the custom values for the CACertificates and veleroNamespaceSecretKeyRef fields, then set those custom values after the upgrade is performed.
(BZ#2277941)

Instability of the token-exchange-agent pod after upgrade
The token-exchange-agent pod on the managed cluster is unstable as the old deployment resources are not cleaned up properly. This might cause application failover action to fail.
Workaround: Refer the knowledgebase article, "token-exchange-agent" pod on managed cluster is unstable after upgrade to ODF 4.16.0.
Result: If the workaround is followed, "token-exchange-agent" pod is stabilized and failover action works as expected.
(BZ#2293611)

virtualmachines.kubevirt.io resource fails restore due to mac allocation failure on relocate
When a virtual machine is relocated to the preferred cluster, it might fail to complete relocation due to unavailability of the mac address. This happens if the virtual machine is not fully cleaned up on the preferred cluster when it is failed over to the failover cluster.
Ensure that the workload is completely removed from the preferred cluster before relocating the workload.
(BZ#2295404)

Post hub recovery, subscription app pods are not coming up after Failover
Post hub recovery, the subscription application pods do not come up after failover from primary to the secondary managed clusters. RBAC error occurs in AppSub subscription resource on managed cluster. This is due to a timing issue in the backup and restore scenario. When application-manager pod is restarted on each managed cluster, the hub subscription and channel resources are not recreated in the new hub. As a result, the child AppSub subscription resource is reconciled with an error.
Workaround:
Fetch the name of the appsub using the following command:
```
% oc get appsub -n <namespace of sub app>
```
```
% oc get appsub -n <namespace of sub app>
```
Copy to Clipboard Toggle word wrap
Add a new label with any value to the AppSub on the hub using the following command:
```
% oc edit appsub -n <appsub-namespace> <appsub>-subscription-1
```
```
% oc edit appsub -n <appsub-namespace> <appsub>-subscription-1
```
Copy to Clipboard Toggle word wrap
In case the child appsub error still exists showing unknown certificate issue, restart the application-manager pod on the managed cluster to which the workloads are failedover.
```
% oc delete pods -n open-cluster-management-agent-addon application-manager-<>-<>
```
```
% oc delete pods -n open-cluster-management-agent-addon application-manager-<>-<>
```
Copy to Clipboard Toggle word wrap
(BZ#2295782)

Failover process fails when the ReplicationDestination resource has not been created yet
If the user initiates a failover before the LastGroupSyncTime is updated, the failover process might fail. This failure is accompanied by an error message indicating that the ReplicationDestination does not exist.
Workaround:
Edit the ManifestWork for the VRG on the hub cluster.
Delete the following section from the manifest:
```
/spec/workload/manifests/0/spec/volsync
```
```
/spec/workload/manifests/0/spec/volsync
```
Copy to Clipboard Toggle word wrap
Save the changes.
Applying this workaround correctly ensures that the VRG skips attempting to restore the PVC using the ReplicationDestination resource. If the PVC already exists, the application uses it as is. If the PVC does not exist, a new PVC is created.
(BZ#2283038)

7.2. Multicloud Object Gateway
Copy link

OpenShift Data Foundation does not support automatic data integrity checks through AWS SDKs
AWS Software Development Kits (SDKs) support data integrity checks by default. Previously, these checks were opt-in. OpenShift Data Foundation does not support the default behavior of automatic data integrity checks.
(DFBUGS-1537)

Multicloud Object Gateway instance fails to finish initialization
Due to a race in timing between the pod code run and OpenShift loading the Certificate Authority (CA) bundle into the pod, the pod is unable to communicate with the cloud storage service. As a result, the default backing store cannot be created.
Workaround: Restart the Multicloud Object Gateway (MCG) operator pod:
```
oc delete pod noobaa-operator-<ID>
```
```
$ oc delete pod noobaa-operator-<ID>
```
Copy to Clipboard Toggle word wrap
With the workaround the backing store is reconciled and works.
(BZ#2271580)

Upgrade to OpenShift Data Foundation 4.16 results in noobaa-db pod CrashLoopBackOff state
Upgrading to OpenShift Data Foundation 4.16 from OpenShift Data Foundation 4.15 fails when the PostgreSQL upgrade fails in Multicloud Object Gateway which always start with PostgresSQL version 15. If there is a PostgreSQL upgrade failure, the NooBaa-db-pg-0 pod fails to start.
Workaround: Refer to the knowledgebase article Recover NooBaa’s PostgreSQL upgrade failure in OpenShift Data Foundation 4.16.
(BZ#2298152)

7.3. Ceph
Copy link

Poor performance of the stretch clusters on CephFS
Workloads with many small metadata operations might exhibit poor performance because of the arbitrary placement of metadata server (MDS) on multi-site Data Foundation clusters.
(BZ#1982116)

SELinux relabelling issue with a very high number of files
When attaching volumes to pods in Red Hat OpenShift Container Platform, the pods sometimes do not start or take an excessive amount of time to start. This behavior is generic and it is tied to how SELinux relabelling is handled by the Kubelet. This issue is observed with any filesystem based volumes having very high file counts. In OpenShift Data Foundation, the issue is seen when using CephFS based volumes with a very high number of files. There are different ways to workaround this issue. Depending on your business needs you can choose one of the workarounds from the knowledgebase solution https://access.redhat.com/solutions/6221251.
(Jira#3327)

Ceph reports no active mgr after workload deployment
After workload deployment, Ceph manager loses connectivity to MONs or is unable to respond to its liveness probe.
This causes the OpenShift Data Foundation cluster status to report that there is "no active mgr". This causes multiple operations that use the Ceph manager for request processing to fail. For example, volume provisioning, creating CephFS snapshots, and others.
To check the status of the OpenShift Data Foundation cluster, use the command oc get cephcluster -n openshift-storage. In the status output, the status.ceph.details.MGR_DOWN field will have the message "no active mgr" if your cluster has this issue.
Workaround: Restart the Ceph manager pods using the following commands:
```
oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=0
```
```
# oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=0
```
Copy to Clipboard Toggle word wrap
```
oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=1
```
```
# oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=1
```
Copy to Clipboard Toggle word wrap
After running these commands, the OpenShift Data Foundation cluster status reports a healthy cluster, with no warnings or errors regarding MGR_DOWN.
(BZ#2244873)

7.4. CSI Driver
Copy link

Automatic flattening of snapshots is not working
When there is a single common parent RBD PVC, if volume snapshot, restore, and delete snapshot are performed in a sequence more than 450 times, it is further not possible to take volume snapshot or clone of the common parent RBD PVC.
To workaround this issue, instead of performing volume snapshot, restore, and delete snapshot in a sequence, you can use PVC to PVC clone to completely avoid this issue.
If you hit this issue, contact customer support to perform manual flattening of the final restored PVCs to continue to take volume snapshot or clone of the common parent PVC again.
(BZ#2232163)

7.5. OpenShift Data Foundation console
Copy link

Optimize DRPC creation when multiple workloads are deployed in a single namespace
When multiple applications refer to the same placement, then enabling DR for any of the applications enables it for all the applications that refer to the placement.
If the applications are created after the creation of the DRPC, the PVC label selector in the DRPC might not match the labels of the newer applications.
Workaround: In such cases, disabling DR and enabling it again with the right label selector is recommended.
(BZ#2294704)

Last snapshot synced is missing for appset based applications on the DR monitoring dashboard
ApplicationSet type applications do not display last volume snapshot sync time on the monitoring dashboard.
Workaround: Go to Applications navigation under ACM perspective and filter the desired application from the list. Then from the Data policy column (popover) note the "Sync status".
(BZ#2295324)

7.6. OCS operator
Copy link

Incorrect unit for the ceph_mds_mem_rss metric in the graph
When you search for the ceph_mds_mem_rss metrics in the OpenShift user interface (UI), the graphs show the y-axis in Megabytes (MB), as Ceph returns ceph_mds_mem_rss metric in Kilobytes (KB). This can cause confusion while comparing the results for the MDSCacheUsageHigh alert.
Workaround: Use ceph_mds_mem_rss * 1000 while searching this metric in the OpenShift UI to see the y-axis of the graph in GB. This makes it easier to compare the results shown in the MDSCacheUsageHigh alert.
(BZ#2261881)

Increasing MDS memory is erasing CPU values when pods are in CLBO state
When the metadata server (MDS) memory is increased while the MDS pods are in a crash loop back off (CLBO) state, CPU request or limit for the MDS pods is removed. As a result, the CPU request or the limit that is set for the MDS changes.
Workaround: Run the oc patch command to adjust the CPU limits.
For example:
```
oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "3"},
```
```
$ oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "3"},
    "requests": {"cpu": "3"}}}}}'
```
Copy to Clipboard Toggle word wrap
(BZ#2265563)

7.7. Non-availability of IBM Z platform
Copy link

IBM Z platform is not available with OpenShift Data foundation 4.16 release. IBM Z will be available with full features and functionality in an upcoming release.

(BZ#2279527)

Chapter 7. Known issues

7.1. Disaster recovery
Copy link

7.2. Multicloud Object Gateway
Copy link

7.3. Ceph
Copy link

7.4. CSI Driver
Copy link

7.5. OpenShift Data Foundation console
Copy link

7.6. OCS operator
Copy link

7.7. Non-availability of IBM Z platform
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 7. Known issues

7.1. Disaster recoveryCopy linkLink copied to clipboard!

7.2. Multicloud Object GatewayCopy linkLink copied to clipboard!

7.3. CephCopy linkLink copied to clipboard!

7.4. CSI DriverCopy linkLink copied to clipboard!

7.5. OpenShift Data Foundation consoleCopy linkLink copied to clipboard!

7.6. OCS operatorCopy linkLink copied to clipboard!

7.7. Non-availability of IBM Z platformCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

7.1. Disaster recovery
Copy link

7.2. Multicloud Object Gateway
Copy link

7.3. Ceph
Copy link

7.4. CSI Driver
Copy link

7.5. OpenShift Data Foundation console
Copy link

7.6. OCS operator
Copy link

7.7. Non-availability of IBM Z platform
Copy link