Chapter 5. Bug fixes
This section describes the notable bug fixes introduced in Red Hat OpenShift Data Foundation 4.17.
5.1. Disaster recovery
FailOver of applications are hung in FailingOver state
Previously, applications were not DR protected successfully because of the errors in protecting required resources to the provided S3 stores. So, failing over such applications resulted in FailingOver state.
With this fix, a metric and a related alert is added to the application DR protection health that shows an alert to rectify protection issues after DR protects the applications. As a result, the applications that are successfully protected are failed over.
Post hub recovery, applications which were in FailedOver state consistently report FailingOver
Previously, after recovering a DR setup from a hub and a ManageCluster loss to a passive hub, applications which were in
FailedOver
state to the lost ManagedCluster consistently reportedFailingOver
status. Failing over such applications to the surviving cluster was allowed but required checks were missing on the surviving cluster to ensure that the failover can be initiated.With this fix, Ramen hub operator ensures if the target cluster is ready for a failover operation before initiating the action. As a result, any failover initiated is successful or if stale resources still exist on the failover target cluster, the operator stalls the failover till the stale resources are cleaned up.
Post hub recovery, subscription app pods now come up after Failover
Previously, post hub recovery, the subscription application pods did not come up after failover from primary to the secondary managed clusters. This caused RBAC error occurs in AppSub subscription resource on managed cluster due to a timing issue in the backup and restore scenario.
This issue has been fixed, and subscription app pods now come up after failover from primary to secondary managed clusters.
Application namespaces are no longer left behind in managed clusters after deleting the application
Previously, if an application was deleted on the RHACM hub cluster and its corresponding namespace was deleted on the managed clusters, the namespace reappeared on the managed cluster.
With this fix, once the corresponding namespace is deleted, the application no longer reappears.
odf-client-info
config map is now createdPreviously, the controller inside MCO was not properly filtering the
ManagedClusterView
resource. This lead to a key config mapodf-client-info
to not be created.With this update, the filtering mechanism has been fixed, and
odf-client-info
config map is created as expected.
5.2. Multicloud Object Gateway
Ability to change log level of backingstore pods
Previously, there was no way to change the log level of backingstore pods. With this update, changing the NOOBAA_LOG_LEVEL in the config map will now change the debug level of the pv-pools backingstore pods accordingly.
STS token expiration now works as expected
Previously, incorrect STS token expiration time calculations and printings caused STS tokens to remain valid long past after their expiration time. Users would see the wrong expiration time when trying to assume a role.
With this update, the STS code was revamped and modified to fix the problems, as well as added support for the CLI flag
--duration-seconds
. Now STS token expiration works as expected, and is shown to the user properly.
Block deletion of OBC via regular S3 flow
S3 buckets can be created both via object bucket claim (OBC) and directly via the S3 operation. When a bucket is created with an OBC and deleted via S3, it leaves the OBC entity dangling and the state is inconsistent. With this update, deleting an OBC via regular S3 flow is blocked, avoiding an inconsistent state.
NooBaa Backingstore no longer stuck in
Connecting
post upgradePrevioulsy, NooBaa backingstore blocked upgrade as it remained in the
Connecting
phase leaving the storagecluster.yaml in phaseProgressing
. This issue has been fixed, and upgrade progresses as expected.
NooBaa DB cleanup no longer fails
Previously, NooBaa DB’s cleanup would stop after
DB_CLEANER_BACK_TIME
elapsed from the start time of noobaa-core pod. This meant NooBaa DB PVC consumption would rise. This issue has been fixed, and NooBaa DB cleanup works as expected.
MCG standalone upgrade working as expected
Previously, a bug caused NooBaa pods to have incorrect affinity settings, leaving them stuck in the pending state.
This fix ensures that any previously incorrect affinity settings on the NooBaa pods are cleared. Affinity is now only applied when the proper conditions are met, preventing the issue from recurring after the upgrade.
After upgrading to the fixed version, the pending NooBaa pods won’t automatically restart. To finalize the upgrade, manually delete the old pending pods. The new pods will then start with the correct affinity settings, allowing them to run successfully.
5.3. Ceph
New restored or cloned CephFS PVC creation no longer slows down due to parallel clone limit
Previously, upon reaching the limit of parallel CephFS clones, the rest of the clones would queue up, slowing down the cloning.
With this enhancement, upon reaching the limit of parallel clones at one time, the new clone creation requests are rejected. The default parallel clone creation limit is 4.
To increase the limit, contact customer support.
5.4. OpenShift Data Foundation console
Pods created in
openshift-storage
by end users no longer cause errorsPreviously, when a pod was created in
openshift-storage
by an end user it would cause the console topology page to break. This was because pods without anyownerReferences
were not considered to be part of the design.With this fix, pods without owner references are filtered out, and only pods with correct
ownerReferences
are shown. This allows for the topology page to work correctly even when pods are arbitrarily added to theopenshift-storage
namespace.
Applying an object bucket claim (OBC) no longer causes an error
Previously, when attaching an OBC to a deployment using the OpenShift Web Console, the error
Address form errors to proceed
was shown even when there were no errors in the form. With this fix, the form validations have been changed, and there is no longer an error.
Automatic mounting of service account tokens disabled to increase security
By default, OpenShift automatically mounts a service account token into every pod, regardless of whether the pod needs to interact with the OpenShift API. This behavior can expose the pod’s service account token to unintended use. If a pod is compromised, the attacker could gain access to this token, leading to possible privilege escalation within the cluster.
If the default service account token is unnecessarily mounted, and the pod becomes compromised, the attacker can use the service account credentials to interact with the OpenShift API. This access could lead to serious security breaches, such as unauthorized actions within the cluster, exposure of sensitive information, or privilege escalation across the cluster.
To mitigate this vulnerability, the automatic mounting of service account tokens is disabled unless explicitly needed by the application running in the pod. In the case of ODF console pod the fix involved disabling the automatic mounting of the default service account token by setting the
automountServiceAccountToken: false
in the pod or service account definition.With this fix, pods no longer automatically mount the service account token unless explicitly needed. This reduces the risk of privilege escalation or misuse of the service account in case of a compromised pod.
Provider mode clusters no longer have the option to connect to external RHCS cluster
Previously, during provider mode deployment there was the option to deploy external RHCS. This resulted in an unsupported deployment.
With this fix, connecting to external RHCS is now blocked so users do not end with an unsupported deployment.
5.5. Rook
Rook.io Operator no longer gets stuck when removing a mon from quorum
Previously, mon quorum could be lost when removing a mon from quorum due to a race condition. This was because there might not have been enough quorum to complete the removal of the mon from quorum.
This issue has been fixed, and the Rook.io Operator no longer gets stuck when removing a mon from quorum.
Network Fence for non-graceful node shutdown taint no longer blocks volume mount on surviving zone
Previously, Rook was creating NetworkFence CR with an incorrect IP address when a node was tainted as out-of-service. Fencing the wrong IP address was blocking the application pods from moving to another node when a taint was added.
With this fix, auto NetworkFence has been disabled in Rook when the out-of-service taint is added on the node, and application pods are no longer blocked from moving to another node.
5.6. Ceph monitoring
Invalid KMIP configurations now treated as errors
Previously, Thales Enterprise Key Management (KMIP) was not added in the recognized KMS services. This meant that whenever an invalid KMIP configuration was provided, it was not treated as an error.
With this fix, Thales KMIP service has been added as a valid KMS service. This enables KMS services to propagate KMIP configuration statuses correctly. Therefore, any mis-configurations are treated as errors.
5.7. CSI Driver
Pods no longer get stuck during upgrade
Previously, if there was a node with an empty label, PVC mount would fail during upgrade.
With this fix, nodes labeled with empty value aren’t considered for the
crush_location
mount, so they no longer block PVC mounting.