Este contenido no está disponible en el idioma seleccionado.
Chapter 7. Known issues
This section describes the known issues in Red Hat OpenShift Data Foundation 4.18.
7.1. Disaster recovery
- After Relocation of consistency groups based workload, synchronization is stopped - When applications using CephRBD volumes with volume consistency groups enabled are running, and the secondary managed cluster goes offline, replication for these volumes might halt indefinitely. This issue can persist even after the secondary cluster comes back online. - The - Volume SynchronizationDelayalert is triggered, initially with a- Warningstatus and later escalating to- Critical. This indicates that replication has stopped for the CephRBD volumes within the volume consistency groups for the impacted applications.- Workaround: Contact Red Hat Support. 
- Node crash results in kubelet service failure causing Data Foundation in error state - An unexpected node crash in an OpenShift cluster might lead to node being stuck in - NotReadystate and affect storage cluster.- Workaround: 
- Get the pending CSR: - oc get csr | grep Pending - oc get csr | grep Pending- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Approve the pending CSR: - Approve the pending CSR - Approve the pending CSR- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Missing s3StoreProfile in ramen-hub-operator-config after upgrading from 4.18 to 4.19 - When a - configmapis overridden with the default values, the custom S3Profiles and other such details added by Multicluster Orchestrator (MCO) operator is lost. This happens because after the Ramen-DR hub operator is upgraded, OLM overwrites the existing- ramen-hub-operator-config- configmapwith the default values provided by Ramen-hub CSV.- Workaround: Restart the MCO operator on the hub cluster. As result, the required values like S3profiles are updated in the - configmap.
- CIDR range does not persist in - csiaddonsnodeobject when the respective node is down- When a node is down, the Classless Inter-Domain Routing (CIDR) information disappears from the - csiaddonsnodeobject. This impacts the fencing mechanism when it is required to fence the impacted nodes.- Workaround: Collect the CIDR information immediately after the - NetworkFenceClassobject is created.
- After node replacement, new mon pod is failing to schedule - After node replacement, the new - monpod fails to schedule itself in the newly added node. As a result,- monpod is stuck in the- Pendingstate, which impacts the storagecluster status with a- monbeing unavailable.- Workaround: Manually update the new mon deployment with the correct - nodeSelector.
- Disaster Recovery is misconfigured after upgrade from v4.17.z to v4.18 - When ODF Multicluster Orchestrator and Openshift DR Hub Operator are upgraded from 4.17.z to 4.18, certain Disaster Recovery resources are misconfigured in internal mode deployments. This impacts Disaster Recovery of workloads using - ocs-storagecluster-ceph-rbdand- ocs-storagecluster-ceph-rbd-virtualizationStorageClasses.- To workaround this, issue, follow the instructions in this knowledgebase article. 
- ceph dfreports an invalid- MAX AVAILvalue when the cluster is in stretch mode- When a CRUSH rule in a Red Hat Ceph Storage cluster has multiple - takesteps, the- ceph dfreport shows the wrong maximum available size for associated pools.
- DRPCs protect all persistent volume claims created on the same namespace - The namespaces that host multiple disaster recovery (DR) protected workloads protect all the persistent volume claims (PVCs) within the namespace for each DRPlacementControl resource in the same namespace on the hub cluster that does not specify and isolate PVCs based on the workload using its - spec.pvcSelectorfield.- This results in PVCs that match the DRPlacementControl - spec.pvcSelectoracross multiple workloads. Or, if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.- Workaround: Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl - spec.pvcSelectorto disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify the- spec.pvcSelectorfield for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.- Result: PVCs are no longer managed by multiple DRPlacementControl resources and do not cause any operation and data inconsistencies. 
- Disaster recovery workloads remain stuck when deleted - When deleting a workload from a cluster, the corresponding pods might not terminate with events such as - FailedKillPod. This might cause delay or failure in garbage collecting dependent DR resources such as the- PVC,- VolumeReplication, and- VolumeReplicationGroup. It would also prevent a future deployment of the same workload to the cluster as the stale resources are not yet garbage collected.- Workaround: Reboot the worker node on which the pod is currently running and stuck in a terminating state. This results in successful pod termination and subsequently related DR API resources are also garbage collected. 
- Regional-DR CephFS based application failover show warning about subscription - After the application is failed over or relocated, the hub subscriptions show up errors stating, "Some resources failed to deploy. Use the View status YAML link to view the details." This is because application persistent volume claims (PVCs) that use CephFS as the backing storage provisioner, deployed using Red Hat Advanced Cluster Management for Kubernetes (RHACM) subscriptions, and are DR protected are owned by the respective DR controllers. - Workaround: There are no workarounds to rectify the errors in the subscription status. However, the subscription resources that failed to deploy can be checked to make sure they are PVCs. This ensures that the other resources do not have problems. If the only resources in the subscription that fail to deploy are the ones that are DR protected, the error can be ignored. 
- Disabled - PeerReadyflag prevents changing the action to Failover- The DR controller executes full reconciliation as and when needed. When a cluster becomes inaccessible, the DR controller performs a sanity check. If the workload is already relocated, this sanity check causes the - PeerReadyflag associated with the workload to be disabled, and the sanity check does not complete due to the cluster being offline. As a result, the disabled- PeerReadyflag prevents you from changing the action to Failover.- Workaround: Use the command-line interface to change the DR action to Failover despite the disabled - PeerReadyflag.
- Ceph becomes inaccessible and IO is paused when connection is lost between the two data centers in stretch cluster - When two data centers lose connection with each other but are still connected to the Arbiter node, there is a flaw in the election logic that causes an infinite election among Ceph Monitors. As a result, the Monitors are unable to elect a leader and the Ceph cluster becomes unavailable. Also, IO is paused during the connection loss. - Workaround: Shutdown the monitors of any one data zone by bringing down the zone nodes. Additionally, you can reset the connection scores of surviving Monitor pods. - As a result, Monitors can form a quorum and Ceph becomes available again and IOs resumes. 
- RBD applications fail to Relocate when using stale Ceph pool IDs from replacement cluster - For the applications created before the new peer cluster is created, it is not possible to mount the RBD PVC because when a peer cluster is replaced, it is not possible to update the CephBlockPoolID’s mapping in the CSI configmap. - Workaround: Update the - rook-ceph-csi-mapping-configconfigmap with cephBlockPoolID’s mapping on the peer cluster that is not replaced. This enables mounting the RBD PVC for the application.
- Information about - lastGroupSyncTimeis lost after hub recovery for the workloads which are primary on the unavailable managed cluster- Applications that are previously failed over to a managed cluster do not report a - lastGroupSyncTime, thereby causing the trigger of the alert- VolumeSynchronizationDelay. This is because when the ACM hub and a managed cluster that are part of the DRPolicy are unavailable, a new ACM hub cluster is reconstructed from the backup.- Workaround: If the managed cluster to which the workload was failed over is unavailable, you can still failover to a surviving managed cluster. 
- MCO operator reconciles the - veleroNamespaceSecretKeyRefand- CACertificatesfields- When the OpenShift Data Foundation operator is upgraded, the - CACertificatesand- veleroNamespaceSecretKeyReffields under- s3StoreProfilesin the Ramen config are lost.- Workaround: If the Ramen config has the custom values for the - CACertificatesand- veleroNamespaceSecretKeyReffields, then set those custom values after the upgrade is performed.
- Instability of the token-exchange-agent pod after upgrade - The - token-exchange-agentpod on the managed cluster is unstable as the old deployment resources are not cleaned up properly. This might cause application failover action to fail.- Workaround: Refer the knowledgebase article, "token-exchange-agent" pod on managed cluster is unstable after upgrade to ODF 4.17.0. - Result: If the workaround is followed, "token-exchange-agent" pod is stabilized and failover action works as expected. 
- virtualmachines.kubevirt.ioresource fails restore due to mac allocation failure on relocate- When a virtual machine is relocated to the preferred cluster, it might fail to complete relocation due to unavailability of its MAC address. This happens if the virtual machine is not fully cleaned up on the preferred cluster when it is failed over to the failover cluster. - Ensure that the workload is completely removed from the preferred cluster before relocating the workload. 
- Disabling DR for a CephFS application with consistency groups enabled may leave some resources behind - Disabling DR for a CephFS application with consistency groups enabled may leave some resources behind. In such cases, manual cleanup might be required. - Workaround: Clean up resources manually by following the steps below: - On the Secondary Cluster: - Manually delete the ReplicationGroupDestination. - oc delete rgd -n <namespace> - $ oc delete rgd -n <namespace>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Confirm that the following resources have been deleted: - ReplicationGroupDestination
- VolumeSnapshot
- VolumeSnapshotContent
- ReplicationDestination
- VolumeReplicationGroup
 
 
- On the Primary Cluster: - Manually delete the ReplicationGroupSource. - oc delete rgs -n <namespace> - $ oc delete rgs -n <namespace>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Confirm that the following resources have been deleted: - ReplicationGroupSource
- VolumeGroupSnapshot
- VolumeGroupSnapshotContent
- VolumeSnapshot
- VolumeSnapshotContent
- ReplicationSource
- VolumeReplicationGroup 
 
 
 
- For discovered apps with CephFS, sync stop after failover - For CephFS-based workloads, synchronization of discovered applications may stop at some point after a failover or relocation. This can occur with a - Permission Deniederror reported in the- ReplicationSourcestatus.- Workaround: - For Non-Discovered Applications - Delete the VolumeSnapshot: - oc delete volumesnapshot -n <vrg-namespace> <volumesnapshot-name> - $ oc delete volumesnapshot -n <vrg-namespace> <volumesnapshot-name>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The snapshot name usually starts with the PVC name followed by a timestamp. 
- Delete the VolSync Job: - oc delete job -n <vrg-namespace> <pvc-name> - $ oc delete job -n <vrg-namespace> <pvc-name>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - The job name matches the PVC name. 
 
- For Discovered Applications - Use the same steps as above, except - <namespace>refers to the application workload namespace, not the VRG namespace.
- For Workloads Using Consistency Groups - Delete the ReplicationGroupSource: - oc delete replicationgroupsource -n <namespace> <name> - $ oc delete replicationgroupsource -n <namespace> <name>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Delete All VolSync Jobs in that Namespace: - oc delete jobs --all -n <namespace> - $ oc delete jobs --all -n <namespace>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - In this case, - <namespace>refers to the namespace of the workload (either discovered or not), and- <name>refers to the name of the ReplicationGroupSource resource.
 
 
- Remove DR option is not available for discovered apps on the Virtual machines page - The Remove DR option is not available for discovered applications listed on the Virtual machines page. - Workaround: - Add the missing label to the DRPlacementControl: - {{oc label drplacementcontrol <drpcname> \ odf.console.selector/resourcetype=virtualmachine \ -n openshift-dr-ops}}- {{oc label drplacementcontrol <drpcname> \ odf.console.selector/resourcetype=virtualmachine \ -n openshift-dr-ops}}- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Add the - PROTECTED_VMSrecipe parameter with the virtual machine name as its value:- {{oc patch drplacementcontrol <drpcname> \ -n openshift-dr-ops \ --type='merge' \ -p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}- {{oc patch drplacementcontrol <drpcname> \ -n openshift-dr-ops \ --type='merge' \ -p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- DR Status is not displayed for discovered apps on the Virtual machines page - DR Status is not displayed for discovered applications listed on the Virtual machines page. - Workaround: - Add the missing label to the DRPlacementControl: - {{oc label drplacementcontrol <drpcname> \ odf.console.selector/resourcetype=virtualmachine \ -n openshift-dr-ops}}- {{oc label drplacementcontrol <drpcname> \ odf.console.selector/resourcetype=virtualmachine \ -n openshift-dr-ops}}- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Add the - PROTECTED_VMSrecipe parameter with the virtual machine name as its value:- {{oc patch drplacementcontrol <drpcname> \ -n openshift-dr-ops \ --type='merge' \ -p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}- {{oc patch drplacementcontrol <drpcname> \ -n openshift-dr-ops \ --type='merge' \ -p '{"spec":{"kubeObjectProtection":{"recipeParameters":{"PROTECTED_VMS":["<vm-name>"]}}}}'}}- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
 
- PVCs deselected after failover doesn’t cleanup the stale entries in the secondary VRG causing the subsequent relocate to fail - If PVCs were deselected after a workload failover, and a subsequent relocate operation is performed back to the preferredCluster, stale PVCs may still be reported in VRG. As a result, the DRPC may report its - Protectedcondition as- False, with a message similar to the following:- VolumeReplicationGroup (/) on cluster is not reporting any lastGroupSyncTime as primary, retrying till status is met.- Workaround: - To resolve this issue, manually clean up the stale PVCs (that is, those deselected after failover) from VRG status. - Identify the stale PVCs that were deselected after failover and are no longer intended to be protected.
- Edit the VRG status on the ManagedCluster named <managed-cluster-name>: - oc edit --subresource=status -n <vrg-namespace> <vrg-name> - $ oc edit --subresource=status -n <vrg-namespace> <vrg-name>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Remove the stale PVC entries from the - status.protectedPVCssection.- Once the stale entries are removed, the DRPC will recover and report as healthy. 
 
- Secondary PVCs aren’t removed when DR protection is removed for discovered apps - On the secondary cluster, CephFS PVCs linked to a workload are usually managed by the VolumeReplicationGroup (VRG). However, when a workload is discovered using the Discovered Applications feature, the associated CephFS PVCs are not marked as VRG-owned. As a result, when the workload is disabled, these PVCs are not automatically cleaned up and become orphaned. - Workaround: To clean up the orphaned CephFS PVCs after disabling DR protection for a discovered workload, manually delete them using the following command: - oc delete pvc <pvc-name> -n <pvc-namespace> - $ oc delete pvc <pvc-name> -n <pvc-namespace>- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Failover process fails when the - ReplicationDestinationresource has not been created yet- If the user initiates a failover before the - LastGroupSyncTimeis updated, the failover process might fail. This failure is accompanied by an error message indicating that the- ReplicationDestinationdoes not exist.- Workaround: - Edit the - ManifestWorkfor the VRG on the hub cluster.- Delete the following section from the manifest: - /spec/workload/manifests/0/spec/volsync - /spec/workload/manifests/0/spec/volsync- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Save the changes. - Applying this workaround ensures that the VRG skips attempting to restore the PVC using the - ReplicationDestinationresource. If the PVC already exists, the application uses it as is. If the PVC does not exist, a new PVC is created.
- Ceph in warning state after adding capacity to cluster - After device replacement or capacity addition it is observed that Ceph is in - HEALTH_WARNstate with mon reporting slow ops. However, there is no impact to the usability of the cluster.
- OSD pods restart during add capacity - OSD pods restart after performing cluster expansion by adding capacity to the cluster. However, no impact to the cluster is observed apart from pod restarting. 
7.2. Multicloud Object Gateway
- NooBaa Core cannot assume role with web identity due to a missing entry in the role’s trust policy - For OpenShift Data Foundation deployments on AWS using AWS Security Token Service (STS), you need to add another entry in the trust policy for - noobaa-coreaccount. This is because with the release of OpenShift Data Foundation 4.17, the service account has changed from- noobaato- noobaa-core.- For instructions to add an entry in the trust policy for - noobaa-coreaccount, see the final bullet in the prerequisites section of Updating Red Hat OpenShift Data Foundation 4.16 to 4.17.
- Upgrade to OpenShift Data Foundation 4.17 results in noobaa-db pod - CrashLoopBackOffstate- Upgrading to OpenShift Data Foundation 4.17 from OpenShift Data Foundation 4.15 fails when the PostgreSQL upgrade fails in Multicloud Object Gateway which always start with PostgresSQL version 15. If there is a PostgreSQL upgrade failure, the - NooBaa-db-pg-0pod fails to start.- Workaround: Refer to the knowledgebase article Recover NooBaa’s PostgreSQL upgrade failure in OpenShift Data Foundation 4.17. 
7.3. Ceph
- Poor CephFS performance on stretch clusters - Workloads with many small metadata operations might exhibit poor performance because of the arbitrary placement of metadata server pods (MDS) on multi-site Data Foundation clusters. 
- SELinux relabelling issue with a very high number of files - When attaching volumes to pods in Red Hat OpenShift Container Platform, the pods sometimes do not start or take an excessive amount of time to start. This behavior is generic and it is tied to how SELinux relabelling is handled by Kubelet. This issue is observed with any filesystem based volumes having very high file counts. In OpenShift Data Foundation, the issue is seen when using CephFS based volumes with a very high number of files. There are multiple ways to work around this issue. Depending on your business needs you can choose one of the workarounds from the knowledgebase solution https://access.redhat.com/solutions/6221251. 
7.4. CSI Driver
- Automatic flattening of snapshots is not working - When there is a single common parent RBD PVC, when volume snapshot, restore, and delete snapshot are performed in a sequence more than 450 times, it is not possible to take additional volume snapshots or clones of the common parent RBD PVC. - To workaround this issue, instead of performing volume snapshot, restore, and delete operations in a sequence, you can use PVC to PVC cloning to completely avoid this issue. - If you encounter this issue, contact customer support to perform manual flattening of the final restored PVCs to continue to take volume snapshot or clone of the common parent PVC again. 
7.5. OpenShift Data Foundation console
- UI shows "Unauthorized" error and Blank screen with loading temporarily during ODF operator installation - During OpenShift Data Foundation operator installation, sometimes the - InstallPlantransiently goes missing which causes the page to show unknown status. This does not happen regularly. As a result, the messages and title go missing for a few seconds.
- Warning message in the UI right after creation of StorageCluster - A popup warning is seen when a StorageSystem or StorageCluster is created from the user interface (UI). This is because the Virtualization StorageClass is not annotated as - storageclass.kubevirt.io/is-default-virt-class: "true"by default after the deployment.- Workaround: After the deployment, annotate the StorageClass from the command-line interface (CLI) as follows: - `oc patch storagecluster ocs-storagecluster -nopenshift-storage --type json -p '[ {"path": "/spec/managedResources/cephBlockPools/defaultVirtualizationStorageClass", "op": "add", "value": true} ]'`- `oc patch storagecluster ocs-storagecluster -nopenshift-storage --type json -p '[ {"path": "/spec/managedResources/cephBlockPools/defaultVirtualizationStorageClass", "op": "add", "value": true} ]'`- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Virtualization StorageClass is annotated and can be used now. However, the popup warning message is still seen during the deployment as UI is setting the wrong field in the StorageCluster CR. 
- Optimize DRPC creation when multiple workloads are deployed in a single namespace - When multiple applications refer to the same placement, then enabling DR for any of the applications enables it for all the applications that refer to the placement. - If the applications are created after the creation of the DRPC, the PVC label selector in the DRPC might not match the labels of the newer applications. - Workaround: In such cases, disabling DR and enabling it again with the right label selector is recommended. 
7.6. OCS operator
- Increasing MDS memory is erasing CPU values when pods are in CLBO state - When the metadata server (MDS) memory is increased while the MDS pods are in a crash loop back off (CLBO) state, CPU request or limit for the MDS pods is removed. As a result, the CPU request or the limit that is set for the MDS changes. - Workaround: Run the - oc patchcommand to adjust the CPU limits.- For example: - oc patch -n openshift-storage storagecluster ocs-storagecluster \ --type merge \ --patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "3"},- $ oc patch -n openshift-storage storagecluster ocs-storagecluster \ --type merge \ --patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "3"}, "requests": {"cpu": "3"}}}}}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Error while reconciling: Service "ocs-provider-server" is invalid: spec.ports[0].nodePort: Invalid value: 31659: provided port is already allocated - From OpenShift Data Foundation 4.18, the - ocs-oepratordeploys a service with the port- 31659, which might conflict with the existing service- nodePort. Due to this any other service cannot use this port if it is already in use. As a result,- ocs-oepratorwill always error out while deploying the service. This causes the upgrade reconciliation to be stuck.- Workaround: Replace nodePort to ClusterIP to avoid the collision: - oc patch -nopenshift-storage storagecluster ocs-storagecluster --type merge -p '{"spec": {"providerAPIServerServiceType": "ClusterIP"}}'- oc patch -nopenshift-storage storagecluster ocs-storagecluster --type merge -p '{"spec": {"providerAPIServerServiceType": "ClusterIP"}}'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- prometheus-operatorpod is missing toleration in Red Hat OpenShift Service on AWS (ROSA) with hosted control planes (HCP) deployments- Due to a known issue during Red Hat OpenShift Data Foundation on ROSA HCP deployment, toleration needs to be manually applied for - prometheus-operatorafter pod creation. To apply the toleration, run the following- patchcommand:- oc patch csv odf-prometheus-operator.v4.18.0-rhodf -n odf-storage --type=json -p='[{"op": "add", "path": "/spec/install/spec/deployments/0/spec/template/spec/tolerations", "value": [- $ oc patch csv odf-prometheus-operator.v4.18.0-rhodf -n odf-storage --type=json -p='[{"op": "add", "path": "/spec/install/spec/deployments/0/spec/template/spec/tolerations", "value": [ {"key": "node.ocs.openshift.io/storage", "operator": "Equal", "value": "true", "effect": "NoSchedule" } ]}]'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
7.7. ODF-CLI
- ODF-CLI tools misidentify stale volumes - Stale subvolume CLI tool misidentifies the valid CephFS persistent volume claim (PVC) as stale due to an issue in the stale subvolume identification tool. As a result, stale subvolume identification functionality will not be available till the issue is fixed.