Chapter 6. Bug fixes
This section describes the notable bug fixes introduced in Red Hat OpenShift Data Foundation 4.16.
6.1. Disaster recovery
FailOver of applications are hung in FailingOver state
Previously, applications were not DR protected successfully because of the errors in protecting required resources to the provided S3 stores. So, failing over such applications resulted in FailingOver state.
With this fix, a metric and a related alert is added to the application DR protection health that shows an alert to rectify protection issues after DR protects the applications. As a result, the applications that are successfully protected are failed over.
Post hub recovery, applications which were in FailedOver state consistently report FailingOver
Previously, the Ramen hub operator on a recovered hub cluster reported the status of a managed cluster that survived a loss of both the hub and its peer managed cluster as Ready
for the future failover actions without ensuring if the surviving cluster is reporting such a status.
With this fix, Ramen hub operator ensures if the target cluster is ready for a failover operation before initiating the action. As a result, any failover initiated is successful or if stale resources still exist on the failover target cluster, the operator stalls the failover till the stale resources are cleaned up.
6.2. Multicloud Object Gateway
Multicloud Object Gateway (MCG) DB PVC consumption more than 400GB
Previously, the Multicloud Object Gateway (MCG) Database (DB) showed increased DB size when it was not necessary as activity logs were being saved to the DB.
With this fix, the object activity logs are converted to regular debug logs. As a result, NooBaa DB no longer shows increased DB size.
Log based replication works even after removing the replication policy from the OBC
Previously, it was not possible to remove a log based replication policy from the object bucket claims (OBCs) as replication policy evaluation resulted in an error when presented with an empty string.
With this fix, replication policy evaluation method is modified to enable removal of replication policy from the OBCs.
Multicloud Object Gateway (MCG) component security context fix
Previously, when the default security context constraint (SCC) for the Multicloud Object Gateway (MCG) pods was updated to avoid the defaults set by OpenShift, the security scanning process failed.
With this fix, when the SCC is updated to override the defaults, MCG’s behaviour does not change so that the security scan passes.
NooBaa operator logs expose AWS secret
Previously, the NooBaa operator logs exposed the AWS secret as plain text, which caused potential risk that anyone with logs could access the buckets.
With this fix, noobaa-operator logs no longer expose AWS secret.
AWS S3 list takes a long time
Previously, AWS S3 took a long time to list the objects as two database queries were used instead of a single one.
With this fix, the queries are restructured into a single one, thereby reducing the calls to the database and time to complete the list objects operation.
After upgrade to OpenShift Data Foundation the standalone MCG backing store gets rejected
Previously, when trying to use a persistent volume (PV) pool, xattr
is used to save metadata of objects. However, updates to that metadata fails as Filesystem on PV does not support xattr
.
With this fix, there is a failback if Filesystem does not support xattr
and metadata is saved in a file.
Multicloud Object Gateway database persistent volume claim (PVC) consumption rising continuously
Previously, object bucket claim (OBC) deletion, which resulted in deletion of all its objects took time to free up from the database. This was because of the limited work by MCG’s database cleaner causing a slow and limited deletion of entries from the database.
With this fix, updating the DB Cleaner configuration for MCG is possible. The DB Cleaner is a process that removes old deleted entries from the MCG Database. The exposed configurations are frequency of runs and age of entries to be deleted.
Multicloud Object Gateway bucket lifecycle policy does not delete all objects
Previously, the velocity of deletion of expired objects was very low.
With this fix, the batch size and number of runs per day to delete expired objects is increased.
(BZ#2279964) (BZ#2283753)
HEAD-request returns the HTTP 200 Code for the prefix path instead of 404 from the API
Previously, when trying to read or head an object which is a directory on the NamespaceStore Filesystem bucket of Multicloud Object Gateway, if the trailing /
character is missing, the request returned HTTP 200 code for the prefix path instead of 404.
With this fix, ENOENT
is returned when the object is a directory but the key is missing the trailing /
.
Multicloud Object Gateway Backingstore In Phase: "Connecting" with "Invalid URL"
Previously, the operator failed to get the system information in the reconciliation loop which prevented the successful completion of the reconciliation. This was due to a bug in the URL parsing that caused the parsing to fail when the address was IPv6.
With this fix, the case of IPv6 address is handled as the URL host. As a result, the operator successfully completes the system reconciliation.
6.3. Ceph container storage interface (CSI) driver
PVC cloning failed with an error "RBD image not found"
Previously, restore of volume snapshot failed when the parent of the snapshot did not exist as the CephCSI driver falsely identified an RBD image in trash to exist due to a bug in the driver.
With this fix, the CephCSI driver bug is fixed to identify the images in trash appropriately and as a result, the volume snapshot is restored successfully even when the parent of the snapshot does not exist.
Incorrect warning logs from fuserecovery.go
even when FUSE mount is not used
Previously, the warning logs from the fuse recovery functions, such as fuserecovery.go
were logged even when the kernel mounter was chosen and that was misleading.
With this fix, the fuse recovery functions are attempted or called only when the fuse mounter is chosen and as a result, the logs from fuesreovery.go
are not logged when the kernel mounter is chosen.
6.4. OCS Operator
StorageClasses are not created if the RGW endpoint is not reachable
Previously, storage classes were dependent on RADOS gateway (RGW) storage class creation as RADOS Block Device (RBD) and CephFS storage classes were not created if the RGW endpoint was not reachable.
With this fix, the storage class creation is made independent and as a result storage classes are no longer dependent on RGW storage class creation.
6.5. OpenShift Data Foundation console
Status card reflects the status of standalone MCG deployment
Previously, Multicloud Object Gateway (MCG) standalone mode was not showing any health status in OpenShift cluster Overview dashboard and an unknown icon was seen for Storage.
With this fix, MCG health metrics are pushed when the cluster is deployed in standalone mode and as a result the storage health is shown in cluster Overview dashboard.
Create StorageSystem wizard overlaps Project dropdown
Previously, the unused Project dropdown on top of the Create StorageSystem page caused confusion and was not used in any scenario.
With this fix, the Project dropdown is removed and as a result the StorageSystem creation namespace is populated in the header of the page.
Capacity and Utilization cards do not include custom storage classes
Previously, the Requested capacity and Utilization cards displayed only data for the default storage classes created by the OCS operator as part of the storage system creation. The cards do not include any custom storage classes that were created later. This was due to the refactoring of the prometheus to support multiple storage clusters.
With this fix, the queries are updated and the cards now show report capacity for both default and custom created storage classes.
6.6. Rook
Rook-Ceph operator deployment fail when storage class device sets are deployed with duplicate names
Previously, when StorageClassDeviceSets were added into the StorageCluster CR with duplicate names, the OSDs failed leaving Rook confused about the OSD configuration.
With this fix, if the duplicate device set names are found in the CR, Rook refuses to reconcile the OSDs until it is fixed. An error is seen in the rook operator log about failing to reconcile the OSDs.
Rook-ceph-mon pods listen to both 3300 and 6789 port
Previously, when a cluster was deployed with MSGRv2, the mon pods were listening unnecessarily on port 6789 for MSGR1 traffic.
With this fix, the mon daemons start with flags to suppress listening on the v1 port 6789 and only listen exclusively on the v2 port 3300 thereby reducing the attack surface area.
Legacy LVM-based OSDs are in crashloop state
Previously, starting from OpenShift Data Foundation 4.14, the legacy OSDs were crashing in the init container that resized the OSD. This was because the legacy OSDs that were created in OpenShift Container Storage 4.3 and since upgraded to a future version might have failed.
With this fix, the crashing resize init container was removed from the OSD pod spec. As a result, the legacy OSD starts, however it is recommended that the legacy OSDs are replaced soon.
(BZ#2273398) (BZ#2274757)
6.7. Ceph monitoring
Quota alerts overlapping
Previously, redundant alerts were fired when object bucket claim (OBC) quota limit was reached. This is because when OBC quota reached 100%, both ObcQuotaObjectsAlert
(when OBC object quota crosses 80% of its limit) and ObcQuotaObjectsExhausedAlert
(when quota reaches 100%) alerts were fired.
With this fix, the queries of the alerts were changed to make sure that only one alert is triggered at a time indicating the issue. As a result, when the quota crosses 80%, ObcQuotaObjectsAlert
is triggered and when quota is at 100%, ObcQuotaObjectsExhausedAlert
is triggered.
PrometheusRule evaluation failing for pool-quota rule
Previously, none of the Ceph pool quota alerts were displayed because in a multi-cluster setup, PrometheusRuleFailures
alert was fired due to pool-quota
rules. The queries in the pool-quota
section were unable to distinguish the cluster from which the alert was fired in a multi-cluster setup.
With this fix, a managedBy
label was added to all the queries in the pool-quota
to generate unique results from each cluster. As a result, PrometheusRuleFailures
alert is no longer seen and all the alerts in pool-quota
work as expected.
Wrong help text shown in runbooks for some alerts
Previously, wrong help text was shown in the runbooks for some alerts as there was wrong text in runbook markdown files of those alerts.
With this fix, the text in the runbook markdown files is corrected so that the alerts show the correct help text.
PrometheusRuleFailures alert after installation or upgrade
Previously, Ceph quorum related alerts were not seen as prometheus failure alert, PrometheusRuleFailures
was fired, which is usually fired when the queries produced ambiguous results. In a multi-cluster scenario, queries in quorum-alert
rules were giving indistinguishable results, as it could not identify from which cluster the quorum alerts were fired.
With this fix, a unique managedBy
label was added to each query in quorum rules so that the query results contained the data about the cluster name from which the result was received. As a result, prometheus failure is not fired and the clusters are able to trigger all the Ceph mon quorum related alerts.
Low default interval duration for two ServiceMonitors, rook-ceph-exporter
and rook-ceph-mgr
Previously, the exporter data collected by prometheus added load to the system as the prometheus scrapePVC interval provided for service monitors, rook-ceph-exporter
and rook-ceph-mgr
was only 5 seconds.
With this fix, the interval is increased to 30 seconds to balance the prometheus scrapping, thereby reducing the system load.
Alert when there are LVM backed legacy OSDs during upgrade
Previously, when OpenShift Data Foundation with legacy OSDs was upgraded from version 4.12 to 4.14, it was noticed that all the OSDs were stuck in a crash loop and down. This led to potential data unavailability and service disruption.
With this fix, a check is included to detect legacy OSDs based on local volume manager (LVM) and to alert if such OSDs are present during the upgrade process. As a result, a warning is displayed during upgrade to indicate about the legacy OSDs so that appropriate actions can be taken.