Chapter 6. Bug fixes
This section describes the notable bug fixes introduced in Red Hat OpenShift Data Foundation 4.14.
6.1. Disaster recovery
Blocklisting no longer leads to pods stuck in an error state
Previously, blocklisting due to either network issues or a heavily overloaded or imbalanced cluster with huge tail latency spikes. Because of this, pods got stuck in
CreateContainerError
with the messageError: relabel failed /var/lib/kubelet/pods/cb27938e-f66f-401d-85f0-9eb5cf565ace/volumes/kubernetes.io~csi/pvc-86e7da91-29f9-4418-80a7-4ae7610bb613/mount: lsetxattr /var/lib/kubelet/pods/cb27938e-f66f-401d-85f0-9eb5cf565ace/volumes/kubernetes.io~csi/pvc-86e7da91-29f9-4418-80a7-4ae7610bb613/mount/#ib_16384_0.dblwr: read-only file system
.With this fix, blocklisting no longer leads to pods stuck in an error state.
Ceph now recognizes the global IP assigned by Globalnet
Previously, Ceph did not recognize global IP assigned by Globalnet, so disaster recovery solutions could not be configured between clusters with overlapping service CIDR using Globalnet. This issue has been fixed, and now the disaster recovery solution works when service CIDR overlaps.
PeerReady
state is no longer set totrue
when a workload is failed over or relocated to the peer cluster until the cluster from where it was failed over or relocated from is cleaned upPreviously, after a disaster recovery (DR) action was initiated, the
PeerReady
condition was initially set totrue
for the duration when the workload was failed over or relocated to the peer cluster. After this it was set tofalse
until the cluster from where it was failed over or relocated from was cleaned up for future actions. A user looking atDRPlacementControl
status conditions for future actions may have recognized this intermediatePeerReady
state as a peer was ready for action and perform the same. This would result in the operation pending or failing and may have required user intervention to recover from.With this fix,
PeerReady
state is no longer set totrue
on failed or relocated workloads until the cluster is cleaned up, so there is no longer any confusion for the user.
The application no longer stays in Cleaningup state when the ACM hub is recovered after a disaster
Previously, when the ACM hub was lost during a disaster and was recovered using the backups, VRG ManifestWork and DRPC status were not restored. This caused the application to stay in Cleaningup state.
With this fix, Ramen now ensures that VRG ManifestWork is part of the ACM backup and rebuilds the DRPC status if it is empty after a hub recovery, and the application successfully migrates to the failover cluster.
STS based applications can now be relocated as expected
Previously, relocating STS based applications would fail due to an underlying bug. This bug has been fixed, and relocating STS based applications now works as expected.
Ramen reconciles as expected after hub restore
Previously, while working with an active/passive Hub Metro-DR setup, you may have come across a rare scenario where the Ramen reconciler stops running after exceeding its allowed rate-limiting parameters. As reconciliation is specific to each workload, only that workload was impacted. In such an event, all disaster recovery orchestration activities related to that workload stopped until the Ramen pod was restarted.
This issue has been fixed, and Ramen reconciles as expected after hub restore.
Managed resources are no longer deleted during hub recovery
Previously, during hub recovery, OpenShift Data Foundation encountered a known issue with Red Hat Advanced Cluster Management version 2.7.4 (or higher) where certain managed resources associated with the subscription-based workload might have been unintentionally deleted.
This issue has been fixed, and no managed resources are deleted during hub recovery.
6.1.1. DR upgrade
This section describes bug fixes related to upgrading Red Hat OpenShift Data Foundation from version 4.13 to 4.14 in disaster recovery environment.
Failover or relocate is no longer blocked for workloads that existed prior to upgrade
Previously, a failover or a relocate was blocked for workloads that existed prior to the upgrade. This was because OpenShift Data Foundation Disaster Recovery solution protects persistent volume claim (PVC) data in addition to the persistent volume (PV) data, and the workload did not have the PVC data backed up.
With this fix, failover or relocate is no longer blocked for workloads that existed prior to upgrade.
DRPC no longer has incorrect values cached
Previously, when OpenShift Data Foundation was upgraded, the disaster recovery placement control (DRPC) may have had an incorrect value cached in
status.preferredDecision.ClusterNamespace
. This issue has been fixed, and the incorrect value is no longer cached.
6.2. Multicloud Object Gateway
Virtual-host style is now available on NooBaa buckets
Previously, Virtual-host style did not work on NooBaa buckets because the NooBaa endpoints and core were not aware of the port of the DNS configuration.
With this update, the NooBaa operator passes the port of the DNS to the core and endpoints, making Virtual-host style available.
Dummy credentials are no longer printed to the logs
Previously, dummy credentials were printed to the logs which could lead to confusion. This bug has been fixed, and the credentials are no longer printed.
NooBaa now falls back to using a backing store with type pv-pool when credentials are not provided in the limited time
When the cloud credentials operator cannot or fails to provide a secret after the cloud credential request was created, for example, before installing NooBaa the cloud credential operator mode was set to manual mode and no additional necessary actions were done. The provided secret includes the credentials needed for creating the target bucket for the default backing store. This means, the default backing store was not created and Noobaa was stuck in phase Configuring.
With this fix, if the cloud credential request was sent and we could not get the secret in the limited time that was defined (10 minutes), then NooBaa would fall back to using a backing store with type pv-pool. This means the system should be in status Ready and the default backing store should be with type pv-pool.
Postgresql DB password no longer displayed in clear text in core and endpoint logs
Previously, the internal Postgresql client in noobaa-core printed a connections parameters object to the log, and this object contained the password to connect to Postgresql DB.
With this fix, the password information is omitted from the connection object that is printed to the log, and the messages to the logs contain only the nonsensitive connection details.
6.3. Ceph container storage interface (CSI)
CSI CephFS and RBD holder Pods no longer use the old
cephcsi
image after upgradePreviously, after upgrade CSI CephFS and RBD holder Pods were not getting updated because they were using the old
cephcsi
image.With this fix, the daemonset object for CSI CephFS and RBD holder is also upgraded, and the CSI holders pods use the latest
cephcsi
image.
More reliable and controlled resynchronization process
Previously, the
resync
command was not triggered effectively leading to sync issues and inability to disable image mirroring. It was because CephCSI had a dependency on the image mirror state to issueresync
commands which was unreliable due to the unpredictable changes in the state.With this fix, when a volume is being demoted, CephCSI saves the timestamp of the image creation. When the
resync
command is issued, CephCSI compares the saved timestamp with the current creation timestamp and theresync
proceeds only if the timestamps match. Also, CephCSI examines the state of the images and the last snapshot timestamps to determine whether resync is required or if an error message needs to be displayed. This results in a more reliable and controlled resynchronization process.
6.4. OpenShift Data Foundation operator
There is no longer unnecessary network latency because of S3 clients not able to talk to RGW in same zone
Previously, when using the Ceph object store, and while requesting transfer to another zone, there was unnecessary network latency because the S3 clients were unable to talk to RGW in the same zone.
With this fix, the annotation, "service.kubernetes.io/topology-mode" is added to the RGW service so that the request is routed to the RGW server in the same zone. As a result, pods are routed to the nearest RGW service.
6.5. OpenShift Data Foundation console
Volume type dropdown is removed from the user interface
Previously, for the internal OpenShift Data Foundation installations, the user interface showed HDD, SSD, or both in the Volume type drop down for the existing clusters even though the internal installations should have assumed the disks to be SSD.
With this fix, Volume type dropdown is removed from the user interface and always assume it to be SSD.
OpenShift Data Foundation Topology rook-ceph-operator deployment now shows correct resources
Previously, the owner references for CSI pods and other pods were set to rook-ceph-operator that caused the mapping to show these pods as part of the deployment too.
With this fix, the mapping pods approach is changed to top down instead of bottom up, which ensures that only the pods that are related to the deployment are shown.
CSS properties are set to dynamically adjust the height of the resource list to changes in window size
Previously, the sidebar of the topology view resources list was not adjusting to its length based on the window size because the CSS properties were not applied properly to the sidebar.
With this fix, the CSS properties are set to dynamically adjust the height of the resource list to the changes in window size both in full screen and normal screen mode.
Add capacity operation no longer fails when moving from LSO to default storage classes
Previously, the add capacity operation used to fail when moving from LSO to default storage classes because the persistent volumes (PVs) for expansion were not created correctly.
With this fix, the add capacity operation using a non-LSO storage class is not allowed when a storage cluster is initially created using a LSO based storage class.
Resource utilization of OpenShift Data Foundation topology now matches the metrics
Previously, the resource utilization of OpenShift Data Foundation topology did not match the metrics because the metrics query used in the sidebar for resources list of nodes and deployment were different.
With this fix, the metric queries are made the same and as a result the values are same in both the places.
Topology view for external mode is now disabled
Previously, topology view showed a blank screen for external mode as external mode is not supported in topology view.
With this fix, external mode is disabled and a message is appears instead of the blank screen.
Topology no longer shows rook-ceph-operator on every node
Previously, the topology view showed the Rook-Ceph operator deployment in all the nodes as Rook-Ceph operator deployment is an owner of multiple pods that are actually not related to it.
With this fix, the mapping mechanism of deployment to node in the topology view is changed and as a result, Rook-Ceph operator deployment is shown only in one node.
The console user interface no longer shows SDN instead of OVN
Previously, the console user interface showed SDN instead of OVN even though OpenShift Container Platform has moved from SDN to OVN.
With this fix, the text has been changed from SDN to OVN and as a result, the text for managing network shows OVN.
Resource names must follow the rule, "starts and ends with a lowercase letter or number", or regex returns an error
Previously, due to invalid regex validation for input name of object bucket claim (OBC), backing store, blocking pool, namespace store, and bucket class, the rule "starts and ends with a lowercase letter or number" was violated when symbols or capital letter was entered in the beginning of the name.
With this release, the issue is fixed and if the resource name does not follow the rule, "starts and ends with a lowercase letter or number", regex returns an error.
6.6. Rook
ODF monitoring is no longer missing any metric values
Previously, there was a missing port for the service monitor of ceph-exporter. This meant that Ceph daemons performance metrics were missing.
With this fix, the port for ceph-exporter service monitor has been added, and Ceph daemons performance metrics are visible in prometheus.
OSD pods no longer continue flapping if there is a network issue
Previously, if OSD pods started flapping because of a network issue, they would continue flapping. This would adversely impact the system.
With this fix, flapping OSD pods are marked as down after a certain amount of time, and no longer impact the system.
MDS are no longer unnecessarily restarted
Previously, MDS pods were unnecessarily restarted because the liveness probe restarted the MDS without checking the
ceph fs dump
.With this fix, the liveness probe monitors the MDS in
ceph fs dump
and restarts the MDS only if MDS is missed in the dump output.