Chapter 8. Known issues

This section describes the known issues in Red Hat OpenShift Data Foundation 4.18.

8.1. Disaster recovery
Copy link

Node crash results in kubelet service failure causing Data Foundation in error state
An unexpected node crash in an OpenShift cluster might lead to node being stuck in NotReady state and affect storage cluster.
Workaround:
Get the pending CSR:
```
oc get csr | grep Pending
```
```
oc get csr | grep Pending
```
Copy to Clipboard Toggle word wrap
Approve the pending CSR:
```
Approve the pending CSR
```
```
Approve the pending CSR
```
Copy to Clipboard Toggle word wrap
(DFBUGS-3636)

CIDR range does not persist in csiaddonsnode object when the respective node is down
When a node is down, the Classless Inter-Domain Routing (CIDR) information disappears from the csiaddonsnode object. This impacts the fencing mechanism when it is required to fence the impacted nodes.
Workaround: Collect the CIDR information immediately after the NetworkFenceClass object is created.
(DFBUGS-2948)

After node replacement, new mon pod is failing to schedule
After node replacement, the new mon pod fails to schedule itself in the newly added node. As a result, mon pod is stuck in the Pending state, which impacts the storagecluster status with a mon being unavailable.
Workaround: Manually update the new mon deployment with the correct nodeSelector.
(DFBUGS-2918)

ceph df reports an invalid MAX AVAIL value when the cluster is in stretch mode
When a CRUSH rule in a Red Hat Ceph Storage cluster has multiple take steps, the ceph df report shows the wrong maximum available size for associated pools.
(DFBUGS-1748)

DRPCs protect all persistent volume claims created on the same namespace
The namespaces that host multiple disaster recovery (DR) protected workloads protect all the persistent volume claims (PVCs) within the namespace for each DRPlacementControl resource in the same namespace on the hub cluster that does not specify and isolate PVCs based on the workload using its spec.pvcSelector field.
This results in PVCs that match the DRPlacementControl spec.pvcSelector across multiple workloads. Or, if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.
Workaround: Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl spec.pvcSelector to disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify the spec.pvcSelector field for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.
Result: PVCs are no longer managed by multiple DRPlacementControl resources and do not cause any operation and data inconsistencies.
(DFBUGS-1749)

Disabled PeerReady flag prevents changing the action to Failover
The DR controller executes full reconciliation as and when needed. When a cluster becomes inaccessible, the DR controller performs a sanity check. If the workload is already relocated, this sanity check causes the PeerReady flag associated with the workload to be disabled, and the sanity check does not complete due to the cluster being offline. As a result, the disabled PeerReady flag prevents you from changing the action to Failover.
Workaround: Use the command-line interface to change the DR action to Failover despite the disabled PeerReady flag.
(DFBUGS-665)

Ceph becomes inaccessible and IO is paused when connection is lost between the two data centers in stretch cluster
When two data centers lose connection with each other but are still connected to the Arbiter node, there is a flaw in the election logic that causes an infinite election among Ceph Monitors. As a result, the Monitors are unable to elect a leader and the Ceph cluster becomes unavailable. Also, IO is paused during the connection loss.
Workaround: Shutdown the monitors of any one data zone by bringing down the zone nodes. Additionally, you can reset the connection scores of surviving Monitor pods.
As a result, Monitors can form a quorum and Ceph becomes available again and IOs resumes.
(DFBUGS-425)

RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster
For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap.
Workaround: Update the rook-ceph-csi-mapping-config configmap with cephBlockPoolID’s mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.
(DFBUGS-527)

Information about lastGroupSyncTime is lost after hub recovery for the workloads which are primary on the unavailable managed cluster
Applications that are previously failed over to a managed cluster do not report a lastGroupSyncTime, thereby causing the trigger of the alert VolumeSynchronizationDelay. This is because when the ACM hub and a managed cluster that are part of the DRPolicy are unavailable, a new ACM hub cluster is reconstructed from the backup.
Workaround: If the managed cluster to which the workload was failed over is unavailable, you can still failover to a surviving managed cluster.
(DFBUGS-376)

MCO operator reconciles the veleroNamespaceSecretKeyRef and CACertificates fields
When the OpenShift Data Foundation operator is upgraded, the CACertificates and veleroNamespaceSecretKeyRef fields under s3StoreProfiles in the Ramen config are lost.
Workaround: If the Ramen config has the custom values for the CACertificates and veleroNamespaceSecretKeyRef fields, then set those custom values after the upgrade is performed.
(DFBUGS-440)

For discovered apps with CephFS, sync stop after failover
For CephFS-based workloads, synchronization of discovered applications may stop at some point after a failover or relocation. This can occur with a Permission Denied error reported in the ReplicationSource status.
Workaround:
- For Non-Discovered Applications
  - Delete the VolumeSnapshot:
    
    $ oc delete volumesnapshot -n <vrg-namespace> <volumesnapshot-name>
    
    Copy to Clipboard Toggle word wrap
    
    The snapshot name usually starts with the PVC name followed by a timestamp.
  - Delete the VolSync Job:
    
    $ oc delete job -n <vrg-namespace> <pvc-name>
    
    Copy to Clipboard Toggle word wrap
    
    The job name matches the PVC name.
- For Discovered Applications
  Use the same steps as above, except <namespace> refers to the application workload namespace, not the VRG namespace.
- For Workloads Using Consistency Groups
  - Delete the ReplicationGroupSource:
    
    $ oc delete replicationgroupsource -n <namespace> <name>
    
    Copy to Clipboard Toggle word wrap
  - Delete All VolSync Jobs in that Namespace:
    
    $ oc delete jobs --all -n <namespace>
    
    Copy to Clipboard Toggle word wrap
    
    In this case, <namespace> refers to the namespace of the workload (either discovered or not), and <name> refers to the name of the ReplicationGroupSource resource.
    (DFBUGS-2883)

Remove DR option is not available for discovered apps on the Virtual machines page

The Remove DR option is not available for discovered applications listed on the Virtual machines page.

Workaround:

Add the missing label to the DRPlacementControl:

{{oc label drplacementcontrol <drpcname> \
odf.console.selector/resourcetype=virtualmachine \
-n openshift-dr-ops}}

{{oc label drplacementcontrol <drpcname> \
odf.console.selector/resourcetype=virtualmachine \
-n openshift-dr-ops}}

Copy to Clipboard

Toggle word wrap

Add the PROTECTED_VMS recipe parameter with the virtual machine name as its value:

{{oc patch drplacementcontrol <drpcname> \
-n openshift-dr-ops \
--type='merge' \
-p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}

{{oc patch drplacementcontrol <drpcname> \
-n openshift-dr-ops \
--type='merge' \
-p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}

Copy to Clipboard

Toggle word wrap

(DFBUGS-2823)

DR Status is not displayed for discovered apps on the Virtual machines page

DR Status is not displayed for discovered applications listed on the Virtual machines page.

Workaround:

Add the missing label to the DRPlacementControl:

{{oc label drplacementcontrol <drpcname> \
odf.console.selector/resourcetype=virtualmachine \
-n openshift-dr-ops}}

{{oc label drplacementcontrol <drpcname> \
odf.console.selector/resourcetype=virtualmachine \
-n openshift-dr-ops}}

Copy to Clipboard

Toggle word wrap

Add the PROTECTED_VMS recipe parameter with the virtual machine name as its value:

{{oc patch drplacementcontrol <drpcname> \
-n openshift-dr-ops \
--type='merge' \
-p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}

{{oc patch drplacementcontrol <drpcname> \
-n openshift-dr-ops \
--type='merge' \
-p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}

Copy to Clipboard

Toggle word wrap

(DFBUGS-2822)

PVCs deselected after failover doesn’t cleanup the stale entries in the secondary VRG causing the subsequent relocate to fail
If PVCs were deselected after a workload failover, and a subsequent relocate operation is performed back to the preferredCluster, stale PVCs may still be reported in VRG. As a result, the DRPC may report its Protected condition as False, with a message similar to the following:
VolumeReplicationGroup (/) on cluster is not reporting any lastGroupSyncTime as primary, retrying till status is met.
Workaround:
To resolve this issue, manually clean up the stale PVCs (that is, those deselected after failover) from VRG status.
1. Identify the stale PVCs that were deselected after failover and are no longer intended to be protected.
2. Edit the VRG status on the ManagedCluster named <managed-cluster-name>:
  $ oc edit --subresource=status -n <vrg-namespace> <vrg-name>
  Copy to Clipboard Toggle word wrap
3. Remove the stale PVC entries from the status.protectedPVCs section.
  Once the stale entries are removed, the DRPC will recover and report as healthy.
  (DFBUGS-2932)

Secondary PVCs aren’t removed when DR protection is removed for discovered apps
On the secondary cluster, CephFS PVCs linked to a workload are usually managed by the VolumeReplicationGroup (VRG). However, when a workload is discovered using the Discovered Applications feature, the associated CephFS PVCs are not marked as VRG-owned. As a result, when the workload is disabled, these PVCs are not automatically cleaned up and become orphaned.
Workaround: To clean up the orphaned CephFS PVCs after disabling DR protection for a discovered workload, manually delete them using the following command:
```
oc delete pvc <pvc-name> -n <pvc-namespace>
```
```
$ oc delete pvc <pvc-name> -n <pvc-namespace>
```
Copy to Clipboard Toggle word wrap
(DFBUGS-2827)

DRPC in Relocating state after minor upgrade
After upgrading from version 4.19 to 4.20, the DRPC (Disaster Recovery Placement Control) may enter a Relocating state. During this process, a new VGR (VolumeGroupReplication) is created with a different naming convention, resulting in two VGRs attempting to claim the same PVC. This conflict can cause temporary instability in the DRPC status.
Workaround: Delete the old VGR (the one with the previous naming convention). The new VGR will then successfully claim the PVC, and the DRPC will return to a healthy state after some time.
(DFBUGS-4450)

Ceph in warning state after adding capacity to cluster
After device replacement or capacity addition it is observed that Ceph is in HEALTH_WARN state with mon reporting slow ops. However, there is no impact to the usability of the cluster.
(DFBUGS-1273)

OSD pods restart during add capacity
OSD pods restart after performing cluster expansion by adding capacity to the cluster. However, no impact to the cluster is observed apart from pod restarting.
(DFBUGS-1426)

Sync stops after PVC deselection
When a PersistentVolumeClaim (PVC) is added to or removed from a group by modifying its label to match or unmatch the group criteria, sync operations may unexpectedly stop. This occurs due to stale protected PVC entries remaining in the VolumeReplicationGroup (VRG) status.
Workaround:
Manually edit the VRG’s status field to remove the stale protected PVC:
```
oc edit vrg <vrg-name> -n <vrg-namespace> --subresource=status
```
```
$ oc edit vrg <vrg-name> -n <vrg-namespace> --subresource=status
```
Copy to Clipboard Toggle word wrap
(DFBUGS-4012)

PVs Remain Stuck in Released State After Workload Deletion
After the final sync, all temporary PVs/PVCs are deleted; however, for some PVs, the persistentVolumeReclaimPolicy remains set to Retain, causing the PVs to stay in the Released state.
Workaround:
Edit the PVs persistentVolumeReclaimPolicy using the command:
```
oc edit pv <pv-name>
```
```
$ oc edit pv <pv-name>
```
Copy to Clipboard Toggle word wrap
Change persistentVolumeReclaimPolicy to Delete. The stuck PVs will disappear.
(DFBUGS-4535)

8.2. Multicloud Object Gateway
Copy link

Unable to create new OBCs using Noobaa
When provisioning an NSFS bucket via ObjectBucketClaim (OBC), the default filesystem path is expected to use the bucket name. However, if path is set in OBC.Spec.AdditionalConfig, it should take precedence. This behavior is currently inconsistent, resulting in failures when creating new OBCs.
(DFBUGS-3817)

8.3. Ceph
Copy link

Poor CephFS performance on stretch clusters
Workloads with many small metadata operations might exhibit poor performance because of the arbitrary placement of metadata server pods (MDS) on multi-site Data Foundation clusters.
(DFBUGS-1753)

SELinux relabelling issue with a very high number of files
When attaching volumes to pods in Red Hat OpenShift Container Platform, the pods sometimes do not start or take an excessive amount of time to start. This behavior is generic and it is tied to how SELinux relabelling is handled by Kubelet. This issue is observed with any filesystem based volumes having very high file counts. In OpenShift Data Foundation, the issue is seen when using CephFS based volumes with a very high number of files. There are multiple ways to work around this issue. Depending on your business needs you can choose one of the workarounds from the knowledgebase solution https://access.redhat.com/solutions/6221251.
(Jira#3327)

8.4. OpenShift Data Foundation console
Copy link

UI shows "Unauthorized" error and Blank screen with loading temporarily during ODF operator installation
During OpenShift Data Foundation operator installation, sometimes the InstallPlan transiently goes missing which causes the page to show unknown status. This does not happen regularly. As a result, the messages and title go missing for a few seconds.
(DFBUGS-3574)

Optimize DRPC creation when multiple workloads are deployed in a single namespace
When multiple applications refer to the same placement, then enabling DR for any of the applications enables it for all the applications that refer to the placement.
If the applications are created after the creation of the DRPC, the PVC label selector in the DRPC might not match the labels of the newer applications.
Workaround: In such cases, disabling DR and enabling it again with the right label selector is recommended.
(DFBUGS-120)

8.5. OCS operator
Copy link

8.6. ODF-CLI
Copy link

ODF-CLI tools misidentify stale volumes
Stale subvolume CLI tool misidentifies the valid CephFS persistent volume claim (PVC) as stale due to an issue in the stale subvolume identification tool. As a result, stale subvolume identification functionality will not be available till the issue is fixed.
(DFBUGS-3778)

Chapter 8. Known issues

8.1. Disaster recovery
Copy link

8.2. Multicloud Object Gateway
Copy link

8.3. Ceph
Copy link

8.4. OpenShift Data Foundation console
Copy link

8.5. OCS operator
Copy link

8.6. ODF-CLI
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 8. Known issues

8.1. Disaster recoveryCopy linkLink copied to clipboard!

8.2. Multicloud Object GatewayCopy linkLink copied to clipboard!

8.3. CephCopy linkLink copied to clipboard!

8.4. OpenShift Data Foundation consoleCopy linkLink copied to clipboard!

8.5. OCS operatorCopy linkLink copied to clipboard!

8.6. ODF-CLICopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

8.1. Disaster recovery
Copy link

8.2. Multicloud Object Gateway
Copy link

8.3. Ceph
Copy link

8.4. OpenShift Data Foundation console
Copy link

8.5. OCS operator
Copy link

8.6. ODF-CLI
Copy link