Chapter 4. Bug fixes
This section describes notable bug fixes introduced in Red Hat OpenShift Container Storage 4.7.
MGR pod restarts even if the MONs are down
Previously, when the nodes restarted the MGR pod might get stuck in a pod initialisation state which resulted in the inability to create new persistent volumes (PVs). With this update, the MGR pod restarts even if the MONs are down.
Multicloud Object Gateway is now available when hugepages are enabled on OpenShift Container Platform
Previously, Multicloud Object Gateway (MCG) db pod crashed as the Postgres failed to run on kubernetes when hugepages were enabled. With the current update, the hugepages for the MCG Postgres pods are disabled, and hence the MCG db pods do not crash.
PodDisruptionBudget
alert no longer continiously shown
Previously, the PodDisruptionBudget
alert, which is an OpenShift Container Platform alert, was continuously shown for object storage devices (OSDs). The underlying issue has been fixed, and the alert no longer shows.
must-gather
log collection fail
Earlier, the copy pod did not try to re-flush the data at regular intervals causing the must-gather
command to fail after the default 10 minutes time out. With this update, the copy pod keeps trying to collect the data at regular intervals generated by the must-gather
command and now the must-gather
commands run to completion.
You cannot create a PVC from a volume snapshot in the absence of volumesnapshotclass
A PVC can not be created from a volume snapshot in the absence of volumesnapshotclass
. This issue is caused because the status of the volume snapshot changes to a not ready state on deleting the volumesnapshotclass
. This issue has been fixed in OCP 4.7.0 and higher.
Core dump not propogated if a process crashed
Previously, core dumps were not propagated if a process crashed. With this release, a log-collector - a sidecar running next to the main ceph daemon has been introduced. On this, a ShareProcessNamespace
flag is enabled and with this flag signals can be intercepted between containers allowing the coredumps to be generated.
Mulitple OSD removal job no longer fails
Previously, when triggering the job for multiple OSD removal, the template included a comma with the OSD IDs in the job name. This was causing the job template to fail. With this update, the OSD IDs have been removed from the job name to maintain a valid format. The job names have been changed from ocs-osd-removal-${FAILED_OSD_IDS}
to ocs-osd-removal-job
.
Increased mon
failover timeout
With this update mon
failover timeout has been increased to 15 minutes on IBM Cloud. Previously, the mons
would begin to failover while they were still coming up.
Rook now refuses to deploy OSD with a message on detecting unclean disks from previous OpenShift Container Storage installation
Previously, if a disk that had not been cleaned from a previous installation of OpenShift Container Storage was reused, Rook failed abruptly. With this update, Rook can now detect that the disk belongs to a different cluster and reject OSD deployment in that disk with an error message (BZ#1922954)
mon failover no longer makes Ceph inaccessible
Previously, if a mon went down while another mon was failing over, it caused the mons to lose quorum. When mons lose quorum Ceph becomes inaccessible. This update prevents voluntary mon drains while a mon is failing over so that Ceph never becomes inaccessible.
cpehcsi
node plugin pods preoccupying ports for GRPC metrics
Previously, the cephcsi
pods exposed GRPC metrics for debugging purposes, and hence the cephcsi
node plugin pods used ports 9090 for RBD and 9091 for CephFS. As a result, the cephsi
pods failed to come up due to the unavailability of the ports. With this release, GRPC metrics are disabled by default as it only required for debugging purposes and now cephcsi
does not use the ports 9091 and 9090 on the node where node plugin pods are running.
rook-ceph-mds
did not register the pod IP on monitor servers
Earlier, the rook-ceph-mds
did not register the pod IP on the monitor servers and hence every mount on the filesystem timed out and PVCs could not be provisioned resulting in CephFS volume provisioning failure. With this release, an argument --public-addr=podIP
is added to the MDS pod when the host network is not enabled. Hence, now the CephFS volume provisioning does not fail.
Errors in must gather
due to failed rule evaluation
Earlier, the recording rule record: cluster:ceph_disk_latency:join_ceph_node_disk_irate1m
did not get evaluated because many-to-many match is not allowed in Prometheus. As a result, there were errors in the must gather
and in the deployment due to this failed rule evaluation. With this release, the query for recording rule has been updated to eliminate the many-to-many match scenarios, and hence now the Prometheus rule evaluations should not fail and there should not be any errors seen in the deployment.