OpenShift Container Storage is now OpenShift Data Foundation starting with version 4.9.
Questo contenuto non è disponibile nella lingua selezionata.
Chapter 7. Known issues
This section describes the known issues in Red Hat OpenShift Data Foundation 4.11.
7.1. Disaster recovery Copia collegamentoCollegamento copiato negli appunti!
Creating an application namespace for the managed clusters
Application namespace needs to exist on RHACM managed clusters for disaster recovery (DR) related pre-deployment actions and hence is pre-created when an application is deployed at the RHACM hub cluster. However, if an application is deleted at the hub cluster and its corresponding namespace is deleted on the managed clusters, they reappear on the managed cluster.
Workaround:
openshift-drmaintains a namespacemanifestworkresource in the managed cluster namespace at the RHACM hub. These resources need to be deleted after the application deletion. For example, as a cluster administrator, execute the following command on the hub cluster:oc delete manifestwork -n <managedCluster namespace> <drPlacementControl name>-<namespace>-ns-mw.
Failover action reports RADOS block device image mount failed on the pod with RPC error still in use
Failing over a disaster recovery (DR) protected workload might result in pods using the volume on the failover cluster to be stuck in reporting RADOS block device (RBD) image is still in use. This prevents the pods from starting up for a long duration (upto several hours).
Failover action reports RADOS block device image mount failed on the pod with RPC error fsck
Failing over a disaster recovery (DR) protected workload may result in pods not starting with volume mount errors that state the volume has file system consistency check (fsck) errors. This prevents the workload from failing over to the failover cluster.
Relocation fails when failover and relocate is performed within a few minutes of each action
When the user starts relocating an application from one cluster to another before the
PeerReadycondition status isTRUE, the condition status is seen through the DRPC YAML file or by running the followingoccommand:oc get drpc -o yaml -n <application-namespace>
$ oc get drpc -o yaml -n <application-namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow where
<application-namespace>is the namespace where the workloads are present for deploying the application.If the Relocation is initiated before the peer (target cluster) is in a clean state, then the relocation will stall forever.
Workaround: Change the DRPC
.Spec.Actionback toFailover, and wait until thePeerReadycondition status is TRUE. After applying the workaround, change the Action to Relocate, and the relocation will take effect.
User is able to set the value to zero minutes as the Sync schedule while creating DR Policy and it reports 'Sync' as Replication policy and gets validated on a Regional-DR setup
The
DRPolicyListpage uses thesyncinterval value to display the replication type. If it is set to zero then the replication type is considered as Sync(synchronous) for the metro as well as regional clusters. This issue confuses the users because the backend is consideringAsynceven when the user interface shows it asSyncscheduling type.Workaround: Need to fetch Ceph
Fsidfrom DRCluster CR status to decidesyncorasync.
Deletion of the Application deletes the pods but not PVCs
When deleting an application from the RHACM console, DRPC does not get deleted. Not deleting DRPC leads to not deleting the VRG as well as the VR. If the VRG/VR is not deleted, the PVC finalizer list will not be cleaned up, causing the PVC to stay in a
Terminatingstate.Workaround: Manually delete DRPC on the hub cluster using the following command
oc delete drpc <name> -n <namespace>
$ oc delete drpc <name> -n <namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Result:
- DRPC deletes the VRG
- VRG deletes VR
- VR removes its finalizer from the PVC’s finalizer list
- VRG removes its finalizer from the PVC’s finalizer list
Both the DRPCs protect all the persistent volume claims created on the same namespace
The namespaces that host multiple disaster recovery (DR) protected workloads, protect all the persistent volume claims (PVCs) within the namespace for each DRPlacementControl resource in the same namespace on the hub cluster that does not specify and isolate PVCs based on the workload using its
spec.pvcSelectorfield.This results in PVCs, that match the DRPlacementControl
spec.pvcSelectoracross multiple workloads or if the selector is missing across all workloads, replication management to potentially manage each PVC multiple times and cause data corruption or invalid operations based on individual DRPlacementControl actions.Workaround: Label PVCs that belong to a workload uniquely, and use the selected label as the DRPlacementControl
spec.pvcSelectorto disambiguate which DRPlacementControl protects and manages which subset of PVCs within a namespace. It is not possible to specify thespec.pvcSelectorfield for the DRPlacementControl using the user interface, hence the DRPlacementControl for such applications must be deleted and created using the command line.Result: PVCs are no longer managed by multiple DRPlacementControl resources and do not cause any operation and data inconsistencies.
RBD mirror scheduling is getting stopped for some images
The Ceph manager daemon gets blocklisted due to different reasons, which causes the scheduled RBD mirror snapshot from being triggered on the cluster where the image(s) are primary. All RBD images that are mirror enabled (hence DR protected) do not list a schedule when examined using
rbd mirror snapshot schedule status -p ocs-storagecluster-cephblockpool, and hence are not actively mirrored to the peer site.Workaround: Restart the Ceph manager deployment, on the managed cluster where the images are primary, to overcome the blocklist against the currently running instance, this can be done by scaling down and then later scaling up the ceph manager deployment as follows:
oc -n openshift-storage scale deployments/rook-ceph-mgr-a --replicas=0 oc -n openshift-storage scale deployments/rook-ceph-mgr-a --replicas=1
$ oc -n openshift-storage scale deployments/rook-ceph-mgr-a --replicas=0 $ oc -n openshift-storage scale deployments/rook-ceph-mgr-a --replicas=1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Result: Images that are DR enabled and denoted as primary on a managed cluster start reporting mirroring schedules when examined using
rbd mirror snapshot schedule status -p ocs-storagecluster-cephblockpool
Ceph does not recognize the global IP assigned by Globalnet
Ceph does not recognize global IP assigned by Globalnet, so disaster recovery solution cannot be configured between clusters with overlapping service CIDR using Globalnet. Due to this disaster recovery solution does not work when service
CIDRoverlaps.
Volume replication group deletion is stuck on a fresh volume replication created during deletion, which is stuck as the persistent volume claim cannot be updated with a finalizer
Due to a bug in the disaster recovery (DR) reconciler, during deletion of the internal
VolumeReplicaitonGroupresource on a managed cluster, from where a workload failed over or relocated from, a persistent volume claim (PVC) is attempted to be protected. The resulting cleanup operation does not complete and reports thePeerReadycondition on theDRPlacementControlfor the application.This results in the application that was failed over or relocated, cannot be relocated or failed over again due to
DRPlacementControlresource reporting itsPeerReadycondition asfalse.Workaround: Before applying the workaround, determine if the cause is due to protecting a PVC during
VolumeReplicationGroupdeletion as follows:Ensure the
VolumeReplicationGroupresource in the workload namespace on the managed cluster from where it was relocated or failed over from has the following values:-
VRG
metadata.deletionTimestampisnon-zero -
VRG
spec.replicationStateisSecondary
-
VRG
List the
VolumeReplicationresources in the workload namespace as above, and ensure the resource have the following values:-
metadata.generationis set to1 -
spec.replicationStateis set toSecondary - The VolumeReplication resource reports no status
-
-
For each VolumeReplication resource in the above state, their corresponding PVC resource (as seen in the VR
spec.dataSourcefield) should have the valuesmetadata.deletionTimestampasnon-zero To recover, remove the finalizer
-
volumereplicationgroups.ramendr.openshift.io/vrg-protectionfrom the VRG resource -
volumereplicationgroups.ramendr.openshift.io/pvc-vr-protectionfrom the respective PVC resources
-
Result:
DRPlacementControlat the hub cluster reportsPeerReadycondition astrueand enables further workload relocation or failover actions. (BZ#2116605)
MongoDB pod is in
CrashLoopBackoffbecause of permission errors reading data incephrbdvolumeThe Openshift projects across different managed clusters have different security Context constraints (SCC), which specifically differ in the specified UID range and/or
FSGroups. This leads to certain workload pods and containers failing to start post failover or relocate operations within these projects, due to filesystem access errors in their logs.Workaround: Ensure workload projects are created on all managed clusters with the same project-level SCC labels, allowing them to use the same filesystem context when failed over or relocated. Pods will no longer fail post-DR actions on filesystem-related access errors.
While failover to secondary cluster, some of PVC remains in Primary cluster
The behavior before Kubernetes v1.23 was that the Kubernetes control plane never cleaned up the PVCs created for StatefulSets. That’s left to the cluster administrator or a software operator managing the StatefulSets. Due to this, the PVCs of the StatefulSets were left untouched when their Pods are deleted. This prevents Ramen from failing back an application to its original cluster.
Workaround: If the workload uses StatefulSets, then do the following before failing back or relocating to another cluster
-
Run
oc get drpc -n <namespace> -o wide If the PeerReady column shows "TRUE" then you can proceed with the failback or relocation. Otherwise, do the following on the peer cluster:
-
Run
oc get pvc -n <namespace> -
For each bounded PVC for that namespace that belongs to the StatefulSet, run
oc delete pvc <pvcname> -n namespace - Once all PVCs are deleted, Volume Replication Group (VRG) transitions to secondary, and then gets deleted.
-
Run
-
Run the following command again
oc get drpc -n <namespace> -o wide. After a few seconds to a few minutes, the PeerReady column changes toTRUE. Then you can proceed with the failback or relocation.
Result: The peer cluster gets cleaned up and ready for new 'Action'. (BZ#2118270)
-
Run
Application is stuck in Relocating state during failback
Multicloud Object Gateway allowed multiple persistent volume (PV) objects of the same name or namespace to be added to the S3 store on the same path. Due to this, Ramen does not restore the PV because it detected multiple versions pointing to the same
claimRef.Workaround: Use S3 CLI or equivalent to clean up the duplicate PV objects from the S3 store. Keep only the one that has a timestamp closer to the failover or relocate time.
Result: The restore operation will proceed to completion and the failover or relocate operation proceeds to the next step.
Application is stuck in a FailingOver state when a zone is down
At the time of a failover or relocate, if none of the s3 stores are reachable then the failover or relocate process hangs. If the Openshift DR logs indicate that the S3 store is not reachable, then troubleshooting and getting the s3 store operational will allow the OpenShift DR to proceed with the failover or relocate operation.
ceph dfreports an invalid MAX AVAIL value when the cluster is in stretch modeWhen a crush rule for a Red Hat Ceph Storage cluster has multiple "take" steps, the
ceph dfreport shows the wrong maximum available size for the map. The issue will be fixed in an upcoming release.
7.2. Multicloud Object Gateway Copia collegamentoCollegamento copiato negli appunti!
rook-ceph-operator-configConfigMap is not updated when OpenShift Container Storage is upgraded from version 4.5 to other versionocs-operatoruses the rook-ceph-operator-config ConfigMap to configure rook-ceph-operator behaviors, however it only creates it once and then does not reconcile it. This raises the problem that it will not update the default values for the product as they evolve.Workaround: Administrators can manually change the rook-ceph-operator-config values.
Storage cluster and storage system
ocs-storageclusteris in an error state for a few minutes when installing the storage systemDuring storage cluster creation, there is a small window of time where it will appear in an error state before moving on to a successful or ready state. This is an intermittent state, so it will usually resolve by itself and become successful or ready.
Workaround: Wait and watch status messages or logs for more information.
7.3. CephFS Copia collegamentoCollegamento copiato negli appunti!
Ceph OSD snap trimming is no longer blocked by a running scrub
Previously, OSD snap trimming, once blocked by a running scrub, was not restarted. As a result, no trimming was performed until an OSD reset. This release fixes the handling of restarting the trimming if blocked after the scrub and snap trimming works as expected.
Poor performance of the stretch clusters on CephFS
Workloads with many small metadata operations might exhibit poor performance because of the arbitrary placement of metadata server (MDS) on multi-site OpenShift Data Foundation clusters.
Restoring snapshot fails with size constraint when the parent PVC is expanded
If we create a new restored persistent volume claim (PVC) from a volume snapshot whose size is the same as the volume snapshot, then it will fail if the parent PVC is resized after taking the volume snapshot and before creating the new restored PVC.
Workaround: You can use any one of the following workarounds
- Do not resize the parent PVC if you have any volume snapshot created from it and you have a plan to restore the volume snapshot to a new PVC.
- Create a restored PVC of the same size as the parent PVC.
- If the restored PVC is already created and is in the pending state, delete the PVC and recreate it with the same size as the parent PVC.
7.4. OpenShift Data Foundation operator Copia collegamentoCollegamento copiato negli appunti!
PodSecurityViolation alert starts to fire when the OpenShift Data Foundation operator is installed
OpenShift introduced Pod Security Admission to enforce security restrictions on Pods when scheduled such that OpenShift 4.11 has an audit and warn events with enforcing privileged (same as 4.10).
As a result, you will see warnings in events since the
openshift-storagenamespace doesn’t have the required enforcement labels for Pod Security Admission.(BZ#2110628)