Chapter 6. Bug fixes
This section describes the notable bug fixes introduced in Red Hat OpenShift Data Foundation 4.15.
6.1. Disaster recovery
Fencing takes more time than expected
Previously, fencing operations took more time than expected. This was due to reconcile of Ramen hub controller a couple of times and requeue with delay as extra checks were added to ensure that the fencing operation was complete on the managed cluster.
With this fix, the hub controller is registered for the updates in fencing state. As a result, the updates of the fencing status change is received immediately and it takes less time to finish fencing operation.
6.2. Multicloud Object Gateway
Multicloud Object Gateway failing to use the new internal certificate after rotation
Previously, Multicloud Object Gateway (MCG) client was not able to connect to S3 using the new certificate unless the MCG endpoint pods were restarted. Even though the MCG endpoint pods were loading the certificate for the S3 service at the start of the pod, the changes in the certificate were not watched, which means that rotating a certificate was not affecting the endpoint till the pods were restarted.
With this fix, a watch to check for the changes in certificate of the endpoint pods are added. As a result, the pods load the new certificate without the need for a restart.
Regenerating S3 credentials for OBC in all namespaces
Previously, the Multicloud Object Gateway command for obc regenerate
did not have the flag app-namespace
. This flag is available for the other object bucket claim (OBC) operations such as creation and deletion of OBC. With this fix, the app-namespace
flag is added to the obc generate
command. As a result, OBC regenerates S3 credentials in all namespaces.
Signature validation failure
Previously, in Multicloud Object Gateway, there was failure to verify signatures when operations fail as AWS’s C++ software development kit (SDK) does not encode the "=" sign in signature calculations when it appears as a part of the key name.
With this fix, MCG’s decoding of the path in the HTTP request is fixed to successfully verify the signature.
6.3. Ceph
Metadata server run out of memory and reports over-sized cache
Previously, metadata server (MDS) would run out of memory as the standby-replay MDS daemons would not trim their caches.
With this fix, the MDS trims its cache when in standby-replay. As a result MDS would not run out of memory.
Ceph is inaccessible after crash or shutdown tests are run
Previously, in a stretch cluster, when a monitor is revived and is in the probing stage for other monitors to receive the latest information such as MonitorMap
or OSDMap
, it is unable to enter stretch_mode
. This prevents it from correctly setting the elector’s disallowed_leaders
list, which leads to the Monitors getting stuck in election
and Ceph eventually becomes unresponsive.
With this fix, the marked-down monitors are unconditionally added to the disallowed_leaders
list. This fixes the problem of newly revived monitors having different disallowed_leaders
set and getting stuck in an election.
6.4. Ceph container storage interface (CSI)
Snapshot persistent volume claim in pending state
Previously, creation of readonlymany (ROX) CephFS persistent volume claim (PVC) from snapshot source failed when a pool parameter was present in the storage class due to a bug.
With this fix, the check for the pool parameter is removed as it is not required. As a result, creation of ROX CephFS PVC from a snapshot source will be successful.
6.5. OpenShift Data Foundation console
Incorrect tooltip message for the raw capacity card
Previously, the tooltip for the raw capacity card in the block pool page showed an incorrect message. With this fix, the tooltip content for the raw capacity card has been changed to display an appropriate message, "Raw capacity shows the total physical capacity from all the storage pools in the StorageSystem".
System raw capacity card not showing external mode StorageSystem
Previously, the System raw capacity card did not display Ceph external StorageSystem as the Multicloud Object Gateway (MCG) standalone and Ceph external StorageSystems were filtered out from the card.
With this fix, only the StorageSystems that do not report the total capacity as per the information reported by the odf_system_raw_capacity_total_bytes
metric is filtered out. As a result, any StorageSystem that reports the total raw capacity is displayed on the System raw capacity card and only the StorageSystems that do not report the total capacity is not displayed in the card.
6.6. Rook
Provisioning object bucket claim with the same bucket name
Previously, for the green field use case, creation of two object bucket claims (OBCs) with the same bucket name was successful from the user interface. Even though two OBCs were created, the second one pointed to invalid credentials.
With this fix, creation of the second OBC with the same bucket name is blocked and it is no longer possible to create two OBCs with the same bucket name for green field use cases.
Change of the parameter name for the Python script used in external mode deployment
Previously, while deploying OpenShift Data Foundation using Ceph storage in external mode, the Python script used to extract Ceph cluster details had a parameter name, --cluster-name
, which could be misunderstood to be the name of the Ceph cluster. However, it represented the name of the OpenShift cluster that the Ceph administrator provided.
With this fix, the --cluster-name
flag is changed to --k8s-cluster-name`
. The legacy flag --cluster-name
is also supported to cater to the upgraded clusters used in automation.
Incorrect pod placement configurations while detecting Multus Network Attachment Definition CIDRS
Previously, some OpenShift Data Foundation clusters failed where the network "canary" pods were scheduled on nodes without Multus cluster networks, as OpenShift Data Foundation did not process pod placement configurations correctly while detecting Multus Network Attachment Definition CIDRS.
With this fix, OpenShift Data Foundation was fixed to process pod placement for Multus network "canary" pods. As a result, network "canary" scheduling errors are no longer experienced.
Deployment strategy to avoid rook-ceph-exporter pod restart
Previously, the rook-ceph-exporter
pod restarted multiple times on a freshly installed HCI cluster that resulted in crashing of the exporter pod and the Ceph health showing the WARN status. This was because restarting the exporter using RollingRelease
caused a race condition resulting in crash of the exporter.
With this fix, the deployment strategy is changed to Recreate
. As a result, exporter pods no longer crash and there is no more health WARN status of Ceph.
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a
pod stuck in CrashLoopBackOff
state
Previously, the rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a
pod was stuck in CrashLoopBackOff
state as the RADOS Gateway (RGW) multisite zonegroup was not getting created and fetched, and the error handling was reporting wrong text.
With this release, the error handling bug in multisite configuration is fixed and fetching the zonegroup is improved by fetching it for a particular rgw-realm
that was created earlier. As a result, the multisite configuration and rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a
pod gets created successfully.
6.7. Ceph monitoring
TargetDown alert reported for ocs-metrics-exporter
Previously, metrics endpoint of the ocs-metrics-exporter
used to be unresponsive as persistent volume resync by ocs-metrics-exporter
was blocked indefinitely.
With this fix, the blocking operations from persistent volume resync in ocs-metrics-exporter
is removed and the metrics endpoint is responsive. Also, the TargetDown
alert for ocs-metrics-exporter
no longer appears.
Label references of object bucket claim alerts
Previously, label for the object bucket claim alerts was not displayed correctly as the format for the label-template
was wrong. Also, a blank object bucket claim name was displayed and the description text was incomplete.
With this fix, the format is corrected. As a result, the description text is correct and complete with appropriate object bucket claim name.
Discrepancy in storage metrics
Previously, the capacity of a pool was reported incorrectly as a wrong metrics query was used in the Raw Capacity card in the Block Pool dashboard.
With this fix, the metrics query in the user interface is updated. As a result, the metrics of the total capacity of a block pool is reported correctly.
Add managedBy
label to rook-ceph-exporter metrics and alerts
Previously, the metrics generated by rook-ceph-exporter
did not have the managedBy
label. So, it was not possible for the OpenShift console user interface to identify from which StorageSystem the metrics are generated.
With this fix, the managedBy
label, which has the name of the StorageSystem as a value, is added through the OCS operator to the storage cluster’s Monitoring
spec. This spec is read by the Rook operator and it relabels the ceph-exporter’s ServiceMonitor
endpoint labels. As a result, all the metrics generated from this exporter will have the new label managedBy
.
6.8. Must gather
Must gather logs not collected after upgrade
Previously, the must-gather
tool failed to collect logs after the upgrade as Collection started <time>
was seen twice.
With this fix, the must-gather
tool is updated to run the pre-install script only once. As a result, the tool is able to collect the logs successfully after upgrade.