1.5. Bug fixes
apiserver-auth
-
Previously,
oc login
was performing an HTTP request to decide which CA bundle to use to connect to the remote login server. This generated aremote error: tls: bad certificate
error in the OAuth server logs upon every login attempt, even though the login would succeed. The server certificate chain is now retrieved from an insecure TLS handshake, so the correct CA bundle is chosen and the OAuth server no longer logs bad certificate errors on login attempts. (BZ#1819688) - Previously, the incomplete security context of the OAuth server pods might cause the pods to crashloop when they pick up a custom security context constraint (SCC) that reverts the default behavior. The security context of the OAuth server pods was modified and a custom SCC no longer prevents the OAuth server pods from running. (BZ#1824800)
-
Previously, the Cluster Authentication Operator always disabled challenge authentication flows for any OIDC identity provider, which meant that logging in with
oc login
was not successful. Now, when an OIDC identity provider is configured, the Cluster Authentication Operator checks whether it allows for the Resource Owner Password Credentials grant and allows challenge-based login if it does. You can now log in usingoc login
for OIDC identity providers that allow the Resource Owner Password Credentials authorization grant. (BZ#1727983) - Previously, the Cluster Authentication Operator did not properly close connections to the OAuth server, causing the rate of traffic to the OAuth server to grow as connections were being opened faster than they were being dropped. The connections are now properly closed and the Cluster Authentication Operator does not degrade the service of its own payload. (BZ#1826341)
-
Previously, the
oauth-proxy
container exited with an error if there was an error reaching thekube-apiserver
during configuration. This caused multiple container restarts if thekube-apiserver
and controllers were not stable or fast enough. Now, multiple attempts to perform checks against thekube-apiserver
are allowed when theoauth-proxy
container starts, so that it only fails when the underlying infrastructure is truly broken. (BZ#1779388)
Bare Metal Hardware Provisioning
-
Because the UEFI boot process was using the
ipxe.efi
binary when using IPv4 networks, the boot process reported that there were no network devices found. As a result, the Preboot eXecution Environment (PXE) boots the machines with No network devices. Thednsmasq.conf
file has been updated to use thesnponly.efi
binary for IPv4 networks. The machines booting with PXE utilize the UEFI network drivers and are able to deploy as they have network connectivity. (BZ#1830161) - If a cluster has networking issues during install (for example a slow image download) the install could fail. To address this problem, the PXE boot has been changed to include retries and the networking maximum number of retries has been increased for communication between the bare metal provisioner and the nodes being provisioned. The installer will now handle slow network conditions. (BZ#1822763)
Build
-
Before starting a build, the OpenShift Container Platform builder would parse the supplied
Dockerfile
and reconstruct a modified version of it to use for the build. This process included adding labels and handling substitutions of the images named inFROM
instructions. The generatedDockerfile
did not always correctly reconstructENV
andLABEL
instructions; sometimes the generatedDockerfile
would include=
characters, although the originalDockerfile
did not include them. This caused the build to fail with a syntax error. When generating the modifiedDockerfile
, the original text forENV
andLABEL
instructions are now used verbatim, fixing this issue. (BZ#1821858) - Previously, the last few lines of error logs were not being attached to a build if a failure occurred in a build pod init container. Subsequently, build errors in init containers, such as malformed Git URLs, were hard to diagnose. The build controller has been updated so that error logs are attached to a build when failures occur in init containers. Build failures are now easier to diagnose. (BZ#1809862)
-
Previously, build failures caused by failed image imports or invalid
Dockerfiles
were only categorized as generic build errors. Non-default build logging levels were required to diagnose such issues. New failure reasons have now been introduced for failed image imports and invalidDockerfiles
. Build failures relating to failed image imports or invalidDockerfiles
can now be identified within the build object status. (BZ#1809861) - Previously, build label generation and validation did not include complete Kubernetes validation routines. Builds with certain valid build configuration names would fail due to an invalid build label value being created. The build controller and build API server now use complete Kubernetes validation routines to ensure added build labels meet label criteria. Builds with any valid build configuration name will now result in a valid build label value being created. (BZ#1777337)
-
Previously Buildah interpreted variables in
Dockerfiles
literally, rather than parsing the value contained within a variable. Subsequently, builds would fail whenDockerfiles
contained variables. Buildah has been updated to expandDockerfile
variables. Buildah will now parseDockerfile
environment variable values when building container images. (BZ#1810174) -
With the
RunOnceDuration
admission plug-in being disabled in OpenShift 4, anactiveDeadlineSeconds
value was not automatically applied to build pods. Pods withactiveDeadlineSeconds
set to nil are matched to resource quotas that includeNotTerminating
scope. Subsequently, build pods failed to start due to quota limitations, in namespaces that had resource quotas withNotTerminating
scope defined. The build controller now applies a suitable defaultactiveDeadlineSeconds
value to build pods. Build pods are now handled properly in namespaces that have resource quotas that includeNotTerminating
scope. (BZ#1829447)
Cloud compute
- The cluster autoscaler expects provider IDs across node and machine objects to be an exact match. Previously, if a machine configuration included a resource group name that had a mix of upper and lower case characters, the cluster autoscaler would terminate the machine after fifteen minutes, given that a match was not found. Resource group names are now sanitized so that all characters are set to lowercase. Now, matching provider IDs are correctly identified even when resource group names are entered using a mix of upper and lower case characters. (BZ#1837341)
-
Previously, the
metadata
field within machine and machine set specifications was not validated when machine sets were created or updated. Invalid metadata caused unmarshalling errors leading to controllers not being able to process objects. Themetadata
field is now validated when machine sets are created or updated and invalid entries return an error. Invalid metadata is now identified before machine sets are created so that subsequent object processing errors are prevented. (BZ#1702089) - Occasionally during scale down operations, the last machine in a machine set will contain deletion annotations. That machine will not be removed by the autoscaler if the minimum machine set size is reached before its deletion. Previously, the last machine’s deletion annotations would not have been removed after a scale down. A fix has been introduced that changes the way machine annotations are unmarked after a scale down. Now, the annotations no longer persist on the last machine in the machine set. (BZ#1820410)
- Previously, the AWS Identity and Access Management (IAM) role assigned to worker nodes did not have sufficient permissions to access the AWS Key Management Service (KMS) key to decrypt the Amazon Elastic Block Store (EBS) volume on mount. Subsequently, Amazon Elastic Compute Cloud (EC2) instances would be accepted, but they would fail to start because they could not read from their root drive. The required permissions have now been granted for EC2 instances to be able to decrypt KMS encrypted EBS volumes with Customer Managed Keys. When using a Customer Managed Key for encrypting EBS volumes, instances now have the required permissions to start successfully. (BZ#1815219)
-
The
replicas
field in a machine set specification can be set to nil. Previously, if the autoscaler could not determine the number of replicas within a machine set, autoscaling operations were prevented. Now, if thereplicas
field is not set, the autoscaler makes a scaling decision based on the last number of observed replicas according to the machine set. Autoscaling operations can now proceed even if thereplicas
field in a machine set specification is set to nil, assuming that the machine set controller has recently synchronized the number of replicas toMachineSet.Status.Replicas
. (BZ#1820654) -
Previously, the autoscaler would reduce the size of a node group by one on every call to
DeleteNodes
, even if an existing node deletion had not yet completed. This resulted in a cluster having less than the minimum required node count. Now, if a node’s machine already has a deletion timestamp, the size of the node group is not reduced further. This prevents the autoscaler from reducing the node count to less than the required capacity when it callsDeleteNodes
. (BZ#1804738)
Cloud Credential Operator
-
Cloud Credential Operator (CCO) could crash loop when the original cluster was installed with OpenShift Container Platform 4.1. CCO would be unable to reconcile the permissions requests found in the
CredentialsRequest
objects. This bug fix updates CCO to no longer assume that parts of theInfrastructure
fields are available. As a result, CCO can work with clusters that were originally installed with OCP 4.1. (BZ#1813343) - Cloud Credential Operator (CCO) no longer bypasses Security Context Constraints (SCCs). Previously, CCO would run with excess permissions that are not required for CCO to perform its tasks. With this enhancement, there is no unnecessary bypassing of SCCs for CCO. (BZ#1806892)
Cluster Version Operator
- The Cluster Version Operator (CVO) had a race condition where it would consider a timed-out update reconciliation cycle a successful update. This only happened for restricted network clusters where the Operator timed out attempting to fetch release image signatures. This bug caused the CVO to enter its shuffled-manifest reconciliation mode, which could break the cluster if the manifests were applied in an order that the components could not handle. The CVO now treats timed-out updates as failures, so it no longer enters reconciling mode before the update succeeds. (BZ#1843526)
-
Failures to roll-out deployments during updates was logged only in CVO logs and only a general error message was reported to the
ClusterVersion
object. This general error message made it difficult for users and teams to debug the failure unless looking at the CVO logs. This bug fix updates CVO to expose underlying errors to roll-out to theClusterVersion
object. As a result, debugging is now easier for deployment roll-outs during upgrades. (BZ#1768260)
Console Kubevirt plugin
- With this release, if a VM is configured to use a disk with an invalid or unrecommended bus type, the Disks tab on the created VM view displays a disk interface warning. (BZ#1803780)
-
Previously, all
DataVolume
objects were categorized as VM disk imports. This incorrect categorization caused the the Activity card to disappear forDataVolume
objects that did not have an owner reference to a VM. With this release, onlyDataVolume
objects with an owner reference to a VM are categorized as VM disk imports and the Activity card does not disappear forDataVolume
objects that do not have an owner reference to a VM. (BZ#1815138) -
Previously,
DataVolume
objects and their associated persistent volume claims (PVCs) were not deleted when the VM disk was removed. These objects were only deleted when the VM was deleted, and there was not an option to preserve aDataVolume
object during VM deletion. With this release, the user can choose to preserve or deleteDataVolume
objects and PVCs when deleting a VM disk or VM. This does not apply for disks that are deleted using the CD-ROM modal. (BZ#1820192) - Previously, the number of disks in the inventory did not match the number of disks in the disk list. The inventory view is now updated to show CD-ROMs and disks separately. (BZ#1803677)
- Previously, it was not possible to create a VM with the default YAML used by the VM wizard because the default YAML VM template did not contain values required by the VM wizard. With this release, the default YAML VM template contains all required values. (BZ#1793962)
- The web console previously reported that failed VM migrations had succeeded. When migrating a VM, the web console now correctly reports when a VM migration fails. (BZ#1806974)
-
Previously, the VM wizard did not generate the
cloud-init
configuration in correct format, and as a result it was not applied on the VM. With this release, the format generated by the wizard has been corrected and thecloud-init
configuration that is provided in the VM wizard is applied on the VM. (BZ#1821024) - Previously, VM template sockets were not reflected in the final VM created by the VM wizard, which caused the number of vCPUs to be doubled after the VM was created. With this release, the VM template sockets, cores, and threads are reflected when creating a VM and the resulting number of vCPUs is correct. (BZ#1810372)
- A change in the URL for the VM templates list caused the user to be redirected to the wrong page after deleting a VM template. The URL has been fixed in this release. (BZ#1810379)
-
Previously, when a running VM was removed, the associated VMI appeared in the Virtual Machine list with the status
VM error
. With this release, stale VMIs with deleted associated VMs are no longer listed. (BZ#1803666) - Previously, the disk import process only expected VM import resources. As a result, the VM resource link for import activity from a VM template or VMI pointed to a nonexistent VM. With this release, the import process recognizes VM templates and VMIs that are import resources and links to the correct resource. (BZ#1840661)
-
With this release, the VM disk import process no longer reports a process value of
NaN%
. (BZ#1836801) - Previously, the virtual machine wizard used virtIO as the default interface for the VM root disk instead of using interface specified in the common template. However, the virtIO interface is not compatible with all operating systems. With this release, the correct default interface for the operating system is selected based on common template used. (BZ#1803132)
Console Metal3 plugin
- Previously, there was no space between the Powering on/off message and the bare metal host link in the web console. A space has been added so that the message now reads properly. (BZ#1819614)
- Previously, for bare metal installations, the Bare Metal Host Details page would not load when some nodes were not available. Now the Bare Metal Host Details page shows zero pods instead. (BZ#1827490)
Web console (Developer perspective)
- Previously, it was difficult to see the list of pods or resources associated with a Knative service in the Topology view. With this bug fix, when you select the Knative service, the sidebar displays a list of pods along with a link to see the logs. (BZ#1801752)
- When you edited an existing query using the PromQL editor in the metrics tab of the Monitoring view, the cursor moved to the end of the line. With this bug fix, the PromQL editor works as expected. (BZ#1806114)
-
For Knative images, in the Add
From Git option, the Advanced Options for Routing would not provide a prefetched container port option. Also, if you created the service without updating the default port value of 8080
, the revisions would not show. With this bug fix, the user can select from the available port options using the drop-down list or provide input if they want to use another port and the revisions are shown as expected. (BZ#1806552) - Previously, a Knative service created using the CLI could not be edited using the console because the images could not be fetched. Now, if the associated image streams are not found while editing, the value provided by the user for the container image in the YAML file is used. This allows the user to edit the service using the console, even if the service was created using the CLI. (BZ#1806994)
- In the Topology view, editing the image name in the external image registry for a Knative service did not create a new revision. With this bug fix, a new revision of the service is created when the name of the service is changed. (BZ#1807868)
-
When you used the Add
Container Image option, and then selected the Image stream tag from internal registry option, the ImageStreams drop-down list did not list the option to deploy images from the OpenShift namespace. However, you were able to access them through the CLI. With this bug fix, all users have access to images in the OpenShift namespace through the console and the CLI. (BZ#1822112) - Previously, in the Pipeline Builder, when you edited a Pipeline that referenced a Task that did not exist, the entire screen would go white. This fix now displays an icon to indicate that an action is required and a drop-down list is displayed to easily update the Task reference. (BZ#1839883)
- In the Pipelines Details page, when you changed existing fields in the Parameters and the Resources tabs, the Save button was disabled even though the new changes were detected. The validation criteria has now been modified and the Save button is enabled to submit changes. (BZ#1804852)
-
In the Add
From Git option, the Pipeline templates provided by the OpenShift Pipelines Operator would fail when the Deployment or Knative Services resource options were selected. This bug fix adds support to use the resource type as well as the runtime to determine the Pipeline template, thus providing resource-specific Pipeline templates. (BZ#1796185) - When a Pipeline was created using the Pipeline Builder and a Task parameter of the type array was used, the Pipeline did not start. With this bug fix, both array and string type parameters are supported. (BZ#1813707)
- In the Topology view, filtering nodes by application returned an error when the namespace had Operator-backed services. This bug fix adds the logic to filter out the Operator-backed service nodes based on the selected application group. (BZ#1810532)
- The Developer Catalog showed no catalog results until you selected the Clear All Filters option. With this bug fix, all catalog items are seen by default and you do not need to clear all filters. (BZ#1835548)
-
Previously, users were unable to add environment variables for
knative
services. As a result, apps whereenvVariables
would be needed might not have worked as expected. Now, support has been added for environment variables. (BZ#1839114) - The Developer Console Navigation menu is now available and is aligned with the latest UX designs. (BZ#1801278)
- Time Range and Refresh Interval drop menus have been added in the Monitoring dashboard tab in Developer Perspective. (BZ#1807210)
- No Pipeline Resources were created in the namespace although the Start Pipeline modal required one. The user would see a disabled and empty dropdown above fields, losing some context of what the fields were for. With this bug fix, Create Pipeline Resource gives the user context of what they were doing inline in the Start Pipeline modal. The user now has a better experience starting a Pipeline from the start modal when there are no Pipeline Resources created in the namespace. (BZ#1826526)
- Layout padding was missing, which allowed the title to flow over the Close button. If text was over the Close button, it made it difficult to click. The layout is now fixed to prevent the title from overlapping the Close button and the button is now always accessible via mouse click. (BZ#1796516)
-
Pipeline Builder incorrectly interpreted a default value of an empty string (
''
) as having no default. Some Operator-provided tasks needed this to be the default and, therefore, had issues working without it. Check for a default property and do not assume the validity of the value. Now, any values that the OpenShift Pipeline Operator deems as a valid default value are respected. (BZ#1829567) -
The Pipeline Builder reads
Task
/ClusterTask
definitions and incorrectly assumed that all parameters were of typestring
. When aTask
parameter of typearray
was encountered, it would cast the array to a string and represent it, losing the type; it would produce a value to theTask
parameter asstring
, thus breaking the contract with theTask
object. Thearray
type is now supported in the web console and the type is properly retrained. Managing both types allows the Pipeline Builder to work the way it was intended. (BZ#1813707) - The Pipeline page was inconsistent with other pages. The Create Pipeline button was always enabled and did not take into consideration when no projects were available. The Create Pipeline button is now removed when the Getting Started guide is enabled. (BZ#1792693)
- Metrics queries for the Dashboard & Metrics tab got updated in the design document. The code needed to be synced with with respect to queries. The queries are now updated and the order of the metrics queries and their labels are synced with the design. (BZ#1806518)
- The tile description variable was incorrectly set to be the CRD description appended with the CSV description. This caused the tile descriptions to be wrong. The tile descriptions are now back to the original value and the appended value is now moved to its own variable. (BZ#1814639)
-
The
eventSources
API Group is updated to the latest supported API Group,sources.knative.dev
. This update allows sources generated by the new API Group to be recognized in the Topology view of the web console. (BZ#1836805) - With the release of Red Hat OpenShift Serverless 1 Serverless Operator version 1.7.1, the Operator is generally available. The Tech Preview badge in the Developer perspective of the web console has been removed. (BZ#1827042)
DNS
-
Previously, CoreDNS metrics were being exposed over an insecure channel within a cluster. Now the proper TLS components and a
kube-rbac-proxy
sidecar have been added to secure the CoreDNS metrics endpoint and expose CoreDNS metrics over a secure channel. (BZ#1809197) -
Previously, adding arbitrary taints to nodes could cause problems related to the DNS Operator’s operand. Now the DNS Operator’s operand tolerates any taint added to a node. The operand runs on and updates
/etc/hosts
on all Linux node hosts. Missing CNI default network events may be observed when the operand starts on a node that is still initializing, but such errors are transient and can be ignored. (BZ#1813479) - Previously, there was a dependency on having specific DNS names for master nodes. Now any legal hostname can be used for master nodes. (BZ#1807234)
-
Previously, when the
dnses.operator.openshift.io/default
object existed but its corresponding daemon set was not available,clusteroperators/dns
reported theAvailable
condition with an incorrectNoDNS
reason andNo DNS resource exists
message. Now under these same conditions the correct reason and message will appear. (BZ#1835725)
etcd
-
Previously, the etcd peer certificate did not include the IPv6 localhost address and failed to connect on
https://[::1]:2379
messages. This bug fix includes the::1
as one of the hosts in the peer certificate. Now repeated failed attempts to connect usinghttps://[::1]:2379
are no longer shown. (BZ#1810997) - Previously, the CVO was overwriting certificates in a config map every 10 minutes. This caused a lot of overhead and negatively impacted cluster performance and stability. Now, certificates are created only once in a config map for improved performance and stability. (BZ#1819472)
- Previously, the cluster etcd Operator health status reporting was hard to understand. This was caused by improper log messaging construction, which often resulted in uncertainty of the cluster’s status. This has been fixed by properly analyzing the Operator statuses in a separate function to construct a proper log message and event about the etcd status. Now the status of the etcd pods on all master nodes are more meaningful. (BZ#1821286)
- Previously, the TLS certificates were mistakenly signed for 10 years, even though the documentation said that they were signed for three years. Now, the certificates are signed for only three years. (BZ#1837594)
- gRPC-go 1.23.0 had client-side load-balancer bug. This bug could could cause deadlock. gRPC-go has been upgraded to version 1.23.1, in which the bug was fixed. (BZ#1823993)
-
After stopping all pods, the restore process only restarted
etcd
,api-server
,api-scheduler
, andcontroller-manager
. It did not restart network pods. As a result, kubelets could not communicate, and bare metal clusters could not stand up. Now, the restore service no longer stops pods that it cannot restart. Clusters stand up after the restore process runs. (BZ#1835146)
Etcd Operator
-
Previously, there were missing properties in the etcd spec, causing the
oc explain etcd
command to incorrectly list properties referenced from the spec. The applicable CRD has been updated to describe the missing properties. Now theoc explain etcd
command fully describes the properties of etcd. (BZ#1809282) - The Etcd Operator was performing improper health checks, leading to incorrect event reports and misleading log messages. Health statuses are now detected correctly with improved messaging, providing accurate health statuses. (BZ#1832986)
Image
-
Previously, the Nodeca daemon was created only when the registry was set to
managed
. When the registry was removed, the Nodeca daemon is not created. With this bug fix, Nodeca daemons are always created and Nodeca daemons are created even if the registry is removed. (BZ#1807471)
Image registry
- Previously if you deleted registry configuration without a proper storage configuration, the resource was never finalized due to the lack of storage configuration and the Operator could not remove the storage because it did not know about it. This bug fix made storage configuration optional, which allows the resource to be completely finalized. (BZ#1798618)
-
Previously, the Image Registry Operator was not setting the
nodeSelector
label on Image Registry created resources. This could have caused future issue because of not specifying in what nodes resources can run, and could end up running the registry over unsupported platforms. This bug fix added the missing label to created resources. Now, it is possible to see the label on the created resources. (BZ#1809005) -
Previously, pushing an image to a namespace that does not exist caused the Image Registry to return a
500
error code. This bug fix changed the return code to indicate the lack of permissions. Now when pushing images to a namespace that does not exist a permission denied error is returned. (BZ#1804160) - The Azure infrastructure name is used for generated Azure containers and storage accounts. Therefore, if the Azure infrastructure name contained uppercase letters, the container would successfully be created, but the storage account creation would fail. This bug fix adjusts the container name creation logic to discard invalid characters, allowing the image registry to deploy on an infrastructure that contains invalid characters in its name. (BZ#1827807)
- When deleting a non-empty image registry with GCP storage, the image registry hostname was not being removed from the image configuration file. This prevented you from creating a new image registry. The code has been changed to remove the image registry hostname from the image configuration file when you delete an image registry. As a result, you can delete and create image registries as expected. (BZ#1827075)
- Because the image registry was not removing objects from a bucket before it removes the bucket, you could not delete a bucket with images. The code has been changed to remove images before removing a bucket. You can delete non-empty buckets as expected. (BZ#1827075)
-
Because images in the image registry were not clearing their yum cache, the image sizes can get large. The image registry
Dockerfile
was changed to include ayum clean all
command. The size of the images are smaller. (BZ#1804493) -
The
keepYoungerThan
parameter in an image pruning custom resource uses nanoseconds and cannot be configured to use a larger period of time. Nanoseconds are not an appropriate period to use in an image pruner. A new parameter has been added to the image pruning custom resource,keepYoungerThanDuration
that replaces and overrides thekeepYoungerThan
parameter. (BZ#1835004) -
The Image Registry Operator did not properly clean up the storage status when the user changed the Operator to
Removed
state. As a result, when the user changed the Operator back toManaged
, the Operator could not create a new storage pod. The Operator was changed to properly clean up the storage status and the Operator can create a new storage pod. (BZ#1785534) - Because the Image Registry Operator was not cleaning logs, you could see improper messages in the logs. The code has been changed to clean the logs to remove these improper messages. The logs now display proper information. (BZ#1797840)
- Because the default Image Registry Operator was configured with zero replicas, problems could result unless the value is manually changed. The Operator was updated to install with one. (BZ#1811846)
- Registry credentials used during the cluster install were not available to specific namespaces, the user needed to create new credentials. The code was changed so that if the credentials for a registry were provided during the install, users can import images using those credentials. (BZ#1816534)
- Because the Image Registry Operator was being installed with only one pod, it did not meet requirements. The Operator is now installed with two pods for high availability. (BZ#1810317)
Installer
-
On the Azure platform, the
cifs-utils
package is required to create volume mounts for pods. With this release,cifs-utils
is included in the packages installed for RHEL 7 hosts when installing OpenShift Container Platform. (BZ#1827982) -
When recovering from an expired control plane certificate, the cluster is unable to connect to the recovery API server on port 7443. This is caused by the recovery API server’s port conflicting with the HAProxy port used for OpenStack, oVirt, bare metal, and vSphere. This results in an
Unable to connect to the server: x509: certificate signed by unknown authority
error. HAProxy now listens on port 9443, allowing the recovery API server to use port 7443 to facilitate the recovery process for an expired control plane certificate. (BZ#1821720) -
Previously, the RHOSP installer created security groups using
remote_group_id
to allow traffic origins. Using theremote_group_id
in the security rules was very inefficient, triggering a lot of computation by the OVS agent to generate the flows. This process sometimes exceeded the time allocated for flow generation. In such cases, especially in environments already under stress, master nodes would be unable to communicate with worker nodes, leading the deployment to fail. Now IP prefixes for whitelisting traffic origins are used instead of theremote_group_id
. This lessens the load on Neutron resources, reducing the occurrence of timeouts. (BZ#1825286) Previously, the installation program required the user to manually create a virtual machine template before it could create an OpenShift Container Platform cluster on Red Hat Virtualization (RHV). This is because the installation program did not meet the following requirements in RHV version 4.3.9:
- The installation program must pass the ignition to the virtual machine.
- The template must specify its OS type as Red Hat CoreOS (RHCOS).
The installation program now creates a template that specifies RHCOS as the OS type, and it passes the ignition to the VM. The user no longer needs to create a virtual machine template. (BZ#1821151)
- Previously, the Keepalive process that provides failover for both API-VIP and INGRESS-VIP addresses in bare metal installer-provisioned infrastructure clusters used an IPV4 local address in a script that monitors local component status to decide which node should own the VIP even if the deployment used IPV6 addresses. Because of this, in IPv6 deployments, Keepalived sometimes received incorrect component status. Now, the Keepalived script uses localhost, which resolves to 127.0.0.1 in V4 deployments and ::1 in V6 deployments, so it always uses the correct local IP address. (BZ#1800969)
- Previously, in bare metal clusters that use installer-provisioned infrastructure, the VIP did not always fail over to a control plane machine with a healthy load balancer. Because of this, the control plane machine continues to own the API-VIP IP address even though the local load balancer is unhealthy and the OpenShift Container Platform API is unreachable for ~10 seconds. Now, the Keepalived check for API-VIP script also monitors the self-hosted load balancer health, and the API-VIP will failover to a control plane node with a working load balancer and prevent service downtime for the OpenShift Container Platform API. (BZ#1835974)
-
Previously, the installation program did not explicitly check for an overlap between the
machineCIDR
andprovisioningNetworkCIDR
ranges. As a result, the error message when the network ranges overlapped was unclear. Now, the installation program explicitly checks for overlap between these network ranges and presents a clear error message if they overlap. (BZ#1813422) - Because Operators in the control plane can start before bootstrap process completes, the bare metal provisioning infrastructure might be active on both the bootstrap and control plane at the same time. Previously, both sets of provisioning infrastructure could provision compute machines, and the machines did not all use the same infrastructure. Now, the bootstrap provisioning infrastructure provisions only control plane machines, so both provisioning infrastructures can be online at the same time. (BZ#1800746)
- Previously, the wrong port number was used when blocking DHCP traffic to the bootstrap node on IPv6. Because of this, a race condition was introduced where a control plane machine sometimes incorrectly obtained a DHCP lease from the bootstrap node. Now, the correct port is blocked for DHCPv6, and control plane machines are provisioned from only the bare metal infrastructure that runs in the cluster (BZ#1809691)
- Previously, with a bare metal cluster that uses installer-provisioned infrastructure, using VRRP to manage the virtual IP addresses for OpenShift Container Platform clusters meant that if you ran several clusters, virtual router IDs might already be in use in the broadcast domain. Because of this, nodes might be assigned virtual IP addresses that are already in use. Now, you can use a tool to check which virtual router IDs will be used for the chosen cluster name before you deploy a cluster. (BZ#1821667)
-
OpenShift Container Platform version 4.1 clusters did not use the
infrastructure.status.infraPlatform
parameter. Because of this, Operators must check and use old fields for clusters that originally installed version 4.1, which causes errors during upgrades. Now, the migration controller sets the new fields for all clusters during upgrade by using information that is available in the cluster, so the Operators can use all of the new parameters and reduce upgrade errors. (BZ#1814332) -
Because the AWS API that is used to fetch resources for clusters is extremely slow in reacting to previously deleted resources, trying to delete already deleted hosted zones caused failures if you tried to destroy a cluster multiple times. Because of this, the destroy command looped until the AWS APIs removed the hosted zone from their response. Now, the installation program skips the
notfound
error for hosted zone, and the destroy command completes more quickly. (BZ#1817201) -
Previously, the bootstrap server endpoint used the
api
endpoint that goes through the external load-balancer. Because of this, you needed to open another port to add RHEL nodes to the cluster. Now the bootstrap server endpoint uses the internalapi-int
endpoint, and you no longer need to open another port on the external load balancer. (BZ#1792822) -
Previously, for bare metal clusters, in order to support nodes DNS resolution, the node’s
/etc/resolv.conf
file pointed to the local instance of the infrastructure CoreDNS by prepending the node’s control plane IP address to the node’s/etc/resolv.conf
file. Because of this, when a host already had three nameservers listed in its/etc/resolv.conf
file, pods generated a "nameserver limits were exceeded" alert. Now, only the first three nameservers are included in the generated/etc/resolv.conf
file, so the alert is no longer generated by pods. (BZ#1825909) -
Previously, the
ipxe.efi
file was not present in the running ironic container, so the booting UEFI failed in cases whereipxe.efi
was needed. Now, theipxe.efi
file is copied to the/shared
directory at runtime, so UEFI boot is no longer impacted. (BZ#1810071) - Previously, rate limiting from AWS sometimes caused a failure to create records for the cluster. Now, the installation program uses an exponential back-off to allow for a longer wait timeout, which creates fewer failures due to rate limiting. (BZ#1766691)
- Previously, rate limiting from AWS sometimes causes a failure to fetch zones for the cluster, which would prevent the cluster from installing. Now, the installation program uses an exponential back-off to allow for a longer wait timeout, which creates fewer failures due to rate limiting. (BZ#1779312)
- Previously, the installation program did not check for symlinks when determining relative path to the configuration file, so the installation fails if the installation program runs from a symlink. Now, the installation program checks for symlinks, and you can run the installation program from a symlinked directory. (BZ#1767066)
Previously, the AWS Terraform provider that the installation program used occasionally caused a race condition with the S3 bucket, and the cluster installation failed with the following error:
When applying changes to module.bootstrap.aws_s3_bucket.ignition, provider level=error msg="\"aws\" produced an unexpected new value for was present, but now absent.
Now, the installation program uses different AWS Terraform provider code, which now robustly handles S3 eventual consistency, and the installer-provisioned AWS cluster installation does not fail with that error. (BZ#1745196)
- Previously, the CoreDNS forward plugin used a random server selection policy by default. As a result, clusters failed to resolve the OpenStack API hostname if given multiple external DNS resolvers. The plugin now uses DNS servers in the order they are provided. (BZ#1809611)
- Due to performance variability among RHOSP clouds where OpenShift Container Platform can be installed, installation times vary. As a result, the installer can time out before the installation succeeds. As a workaround, check your cluster’s status after the installer indicates failure. The cluster might be healthy. (BZ#1819746)
-
On RHOSP, control plane and compute nodes inject their IP addresses into their
/etc/resolv.conf
files as their preferred nameservers. As a result, hosts that already had three nameservers in the file generated nameserver limit warnings. Now, only the first three nameservers in /etc/resolve.conf are preserved. In this situation, pods no longer generate nameserver warnings. (BZ#1791008) - Previously, RHOSP clouds without trunk ports could return an error that the installer misinterpreted as a failure. As a result, cluster destruction would loop before timing out. With this update, the installer now correctly interprets the error, allowing for successful cluster destruction on clouds that do not support trunk ports. (BZ#1814593)
-
RHOSP resources that share names cannot be removed. Previously, if security groups that shared a name existed, cluster destruction using Ansible playbooks failed on RHOSP clouds. Now, the
down-security-groups.yaml
playbook uses group IDs instead of names when destroying clusters. All security groups are deleted if the playbook finishes successfully. (BZ#1841072) -
Some RHOSP environments might enforce a policy that disallows VMs from booting with ephemeral disks. As a result, cluster installations failed when bootstrap machines attempted to boot from ephemeral disks. Now, bootstrap machines follow
rootVolume
settings from the control plane machine pool, allowing cluster installations to succeed in environments that disallow VMs from booting with ephemeral disks. (BZ#1820434) - Previously, a prerequisite Terraform step did not always happen before floating IP address (FIP) association on clusters that ran on RHOSP. As a result, a race condition could occur that would cause installations to fail. The Terraform step now always occurs before FIP association. (BZ#1846297)
- Because a RHOSP user-provisioned installation script was not compatible with some Ansible versions, installations could fail. The script was updated to assure broad compatibility. Now, installations succeed regardless of the your Ansible version. (BZ#1810916)
- Currently, the RHOSP user-provisioned infrastructure playbooks do not delete Cinder volumes that were created in the cluster’s lifetime. Resultantly, destroyed clusters leak Cinder volumes. As a workaround, delete Cinder volumes manually after cluster destruction. (BZ#1814651)
- Previously, clusters on RHOSP did not process all certificates that were passed to it in certificate authority (CA) file bundles. As a result, clusters could not be installed with intermediate certificates that were signed by a non-default trusted authority. CA files are now split and processed separately, allowing installations that use intermediate certificates signed by non-default trusted authorities. (BZ#1809780)
kube-apiserver
- Previously, some users could not upgrade from 4.2 to 4.3 due to an upstream bug that prevented the running of clusters that used a mix of Kubernetes 1.14 and 1.16 components. This fix includes a merge from upstream so that OpenShift Container Platform 4.3 is now compatible with OpenShift Container Platform 4.2 when upgrading. (BZ#1816302)
- Previously, when creating a new version of an Operator, it could take several minutes before the lock was released and the new version of the Operator could continue because the leader election setup was not releasing the lock when the Operator received a UNIX signal to shut down. With this fix, the Operator rollout time has improved significantly because control plane Operators now respect the graceful termination period and do not have to wait for the lock to be released on startup. (BZ#1775224)
- Previously, during upgrades, the OpenShift Container Platform API server would sometimes be added back to the GCP load balancer, despite not yet being able to serve traffic because routes on the node were misconfigured. This was caused by a race condition between the node and GCP load balancer. This has been fixed by moving route configurations to iptables and differentiating between local and non-local traffic; non-local traffic is now always accepted. Now during API server upgrades, connections are gracefully terminated, and new connections are load-balanced only to running API servers. (BZ#1802534)
kube-scheduler
-
Previously, pods that were evictable because they would fit on a certain node might not be evicted because the Descheduler would return early in the node-checking loop to determine if pods were evictable in a
NodeAffinity
strategy. Now, the break condition of the node-checking loop has been fixed so that all nodes are considered when checking evictability. (BZ#1820253)
Logging
-
Previously, the Fluentd buffer queue was not limited and a high volume of incoming logs could flood the file system of a node and cause it to crash. As a result, applications would be rescheduled. To prevent this type of crash, the Fluentd buffer queue is now limited to a fixed amount of chunks per output (default:
32
). (BZ#1780698) - In an IPv6 bare metal deployment, Elasticsearch was binding on the IPv4 loopback address instead of the cluster IPv6 address. As a result, the Elasticsearch cluster failed to start. The downward API was changed to set the binding and publish host for Elasticsearch. Elasticsearch is able to bind to the network interface and starts as expected. (BZ#1811867)
- Because the cluster logging cluster service version (CSV) was using incorrect paths to obtain the status of some cluster logging components, the status was not being reported. As a result, cluster logging was not functioning properly. The paths have been corrected and cluster logging is working as expected. (BZ#1840888)
- Because the Elasticsearch Operator create a second deployment when more than three Elasticsearch nodes are configured, the Cluster Logging Operator was not reading the correct number of Elasticsearch nodes. As a result, the Cluster Logging Custom Resource always reported the number of nodes associated with one deployment. The Cluster Logging Operator was changed to correctly compute the number of Elasticsearch nodes. (BZ#1732698)
Machine Config Operator
- Multiple available networks on worker nodes make it difficult to pick an address on the control plane for CRI-O. This causes CRI-O to often bind to a non-control plane interface. This bug fix updates the CRI-O systemd service to depend on a service that chooses the correct interface and configures the CRI-O service. As a result, CRI-O binds to an address in the control plane as expected. (BZ#1808018)
-
Previously, when configuring OperatorHub for restricted networks in an IPv6 bare metal deployment, multiple interfaces could come up on OpenShift Container Platform (OCP) nodes without DHCP-provided names nor reverse resolution. This caused the multicast DNS publishing service to start with the default
localhost
name. This bug fix ensures that the Machine Config Operator only configures non-default names and waits until those are available. As a result, the correct host names are published to multicast DNS. (BZ#1810333) - The Ingress Virtual IP management configuration was using a fixed string for its password. If two VRRP keepalived instances in separate clusters had the same Virtual Router ID, they would have the same password and potentially belong to a single virtual router. This bug fix makes the password change depending on cluster configuration. As a result, different cluster Ingress Virtual IPs now have a different password. (BZ#1803232)
-
Previously, the systemd service doing control plane IP detection and configuring for Kubelet and CRI-O could run before any control plane IP was configured, resulting in a Kubelet and CRI-O failure message that the
nodeip-configuration
'Failed to find suitable node ip'. Now, the system retries until the interface has a control plane IP configured. (BZ#1819484) -
Previously, when CoreDNS would forward DNS requests to the list of servers in the
/etc/resolv.conf
file, if the file was changed, the change would not be reflected in the CoreDNS Corefile. With this fix, the Coredns-monitor pod now verifies that the CoreDNS forward list is synced with/etc/resolv.conf
so that the list of servers appear in the file. (BZ#1790819) - Previously, when the interface that keepalived uses was bridged, it was possible for users to dynamically put interfaces in bonds or bridges, and doing so could prevent keepalived from resuming operation, disrupting Virtual IP management. With this fix, the monitor interface now changes and reloads keepalived so that it reads the new configuration and virtual IP management can operate with minimal disruption. (BZ#1751978)
-
Because some routes contained the
expires
field, IPv6 (non_virtual_ip
script) could not process the route. As a result, services that need to be configured with anon_virtual_ip
fail. Thenon_virtual_ip
script has been updated. Routes are parsed and services are configured correctly. (BZ#1817236
Web console (Administrator perspective)
- Invalid monitoring flags were set when the console was started due to a missing Prometheus link on the monitoring metrics query page. Now, the proper flags have been set and Prometheus monitoring is available on the metrics query page. (BZ#1811481)
- When a user tried to install into an unsupported namespace, the form would not be submitted because it was not clear to the user which installation mode is supported by the Operator group. Now, an alert has been added for the supported Operator’s install mode. The alert will clarify why the picked namespace can be used by the install Operator. (BZ#1821407)
- Machine Health Checks and Machine Config were not visually separated, causing confusion to the user. Now, a divider has been added between the Machine Health Checks and the Machine Config for clarity. (BZ#1817879)
- An error message would appear in the browser’s console due to a missing property for the react component. Now, the property has been added for the react component and the error message does not occur. (BZ#1800769)
- Multiple alert receivers could be created with the same name. If one of the same named alerts were deleted, all would be deleted. Now, In the Create Receiver form, users are prompted with an error message if the name already exists, and the Create button is disabled. Users cannot create two receivers with the same name. (BZ#1805133)
- PVCs were sorted alphabetically, and now they are sorted numerically. (BZ#1806875)
-
Services were listed in alphabetical order, so that
oc
was not the first option. Now, theoc
option is appended to the front of the list. (BZ#1802429) - After alerts were changed to a silenced state, the Status card and notification drawer would continue to show the silenced alert. Now, the dashboard and the notification drawer do not show silenced alerts. (BZ#1802034)
- After alerts were changed to a silenced state, the Status card and notification drawer would continue to show the silenced alert. Now, the dashboard and the notification drawer do not show silenced alerts. (BZ#1808059)
- Sorting was not based on data in the column, causing erroneous sorting. Now, data is sorted by the correct operand status values. (BZ#1812076)
- Status descriptor paths can be longer than the space allotted for them inside the donut chart. Status descriptor paths that are very long can be clipped on the right and left sides, obscuring the value. Now, the status descriptor is path below the donut chart so it can wrap as needed and allow more than one status descriptor per row. Status descriptor paths with long values are full visible, and less scrolling is required to view all status descriptors. (BZ#1823870)
-
The web console would display an inaccurate update status of
Error Retrieving
when the version did not appear in the update channel. This suggested the version should be available, but it was not. Now, the web console display has been updated toVersion not found
when the version does not appear in the update channel. (BZ#1819892) -
The installed Operators list was only sortable by the
Name
column, limiting sorting options for users. Now, users can sort the list by more than just the name column. (BZ#1797931) - The pods details page did not include conditions. Without the conditions, it was difficult to know the status of the pod. There is now a conditions section on the pod details page and it is easier to discern the status of the pod. (BZ#1804869)
- The query browser results were rendered with a hard-coded sort. The hard-coded sort could override the sort specified in the query, thus rendering a different result than requested. The hard-coded sort is now removed so the sort specified in the query is preserved. (BZ#1808394)
-
Previously, the web console was experiencing runtime errors on certain pages due to the ts-loader using the incorrect
tsconfig.json
in some cases. The ts-loader issue is resolved, allowing all web console pages to load properly. (BZ#1811886) -
When navigating to the Advanced
Project Details Inventory section from the Developer perspective of the web console, DeploymentConfig
objects were not listed. TheDeploymentConfig
objects are now tracked and are included in the Inventory section of the dashboard. (BZ#1825228) -
Previously, the web console did not display user details when the user name contained special characters such as
#
. The web console now displays user details regardless of special characters in the user name. (BZ#1835460) -
Previously, when an object was edited in the YAML editor, it did not verify the presence of the required
metadata
field. If the field was missing when the object was saved, an error was logged in the browser’s JavaScript console, but no visible feedback was provided. Now if the requiredmetadata
field is missing, the web console presents an actionable error message. (BZ#1787503) - Previously, when editing an object by using the form view, switching to the YAML editor for the object did not synchronize all existing data. Now, all data is correctly synchronized between the form view and the YAML editor. (BZ#1796539)
- Previously, when navigating with the tab key, the notification drawer might be triggered and expand. With this bug fix, the notification drawer is not triggered when tabbing through UI elements. (BZ#1810568)
- Previously, when listing existing instances of a custom resource definition (CRD), the wrong API was used to populate the list. Now the correct API is used to populate the list. (BZ#1819028)
-
On the Operators
Installed Operators page, when viewing the available custom resource (CR) list for a selected Operator, the Version column displays the value Unknown
. Because no version information is available for a CR, this field is now removed from the UI. (BZ#1829052) - Previously, when completing the Create Operator Subscription form, if the Update Channel field was changed, the target namespace for the subscription was erroneously reset and the form could not be submitted. Now when adjusting the Update Channel, the target namespace value is preserved and the form can be submitted successfully. (BZ#1798851)
-
Previously, the metric
openshift_console_operator_build_info
was not properly exposed. With this bug fix, the metric is available in Prometheus. (BZ#1806787) -
Previously, in the administration perspective, when viewing the Workloads tab with a side panel visible, the notification drawer when expanded is hidden beneath the side panel. This bug fix adjusts the CSS
z-index
so that the notification drawer is visible. (BZ#1813052) -
Previously, the OperatorHub was visible in the web console to only cluster administrators. With this update, the web console now shows the OperatorHub to users who are assigned the
aggregate-olm-view
andaggregate-olm-edit
cluster role bindings. (BZ#1819938) -
Previously on the Home
Events page from the Administrator perspective of the web console, the node name did not show for several node events. With this update, all events now correctly link to the corresponding node. (BZ#1809813) -
Previously, the Home
Overview menu item from the Administrator perspective of the web console was hidden from users who could not list namespaces, but otherwise have permissions to see cluster metrics. With this update, the Overview navigation item is now visible for all users who have authority to view cluster metrics. (BZ#1811757) - Previously in the OperatorHub on the Installed Operators page, the link to view more APIs for an installed Operator did not open the correct tab. With this update, the View x more link under Provided APIs goes to the Details tab for the installed Operator. (BZ#1824254)
- Previously in the OperatorHub, overflow of a container background was not hidden in mobile view. This update fixes the gray background and hides the overflow. (BZ#1809812)
-
Previously, the
fieldDependency
specDescriptor
did not work as expected. As a result, the Control Field did not control the visibility of the Dependent Field. The visibility of the Dependent Field is now correctly enabled or disabled by the Control Field. (BZ#1826074) -
Previously, the default CA certificate was being used inside the console pod. This bug fix configures the console to use the
default-ingress-cert
config map if that config map exists; if it does not exist, the console configures the default CA certificate instead. This allows the default Ingress certificate to be used, if available, to verify access to the routes the Ingress controller creates. (BZ#1824934) - Previously, when creating a new Alert Receiver, the web console did not indicate that routing labels were required. A red asterisk has been added as a visual indicator that the routing labels are required. (BZ#1803614)
-
Previously, the Role Bindings tab in the web console
ClusterRole
details page could show bindings for a namespaced Role with the same name. The tab now correctly shows only bindings for theClusterRole
. (BZ#1624328) - Previously, markdown tables for OLM Operators could render poorly when they had a lot of content. The web console has improved the display of these tables and added a horizontal scrollbar, when necessary. (BZ#1831315)
- Previously, when checking all PVCs in the web console, it was hard to distinguish which storage class the PVC belonged to. A PVC Storage Classes column has been added to the web console so it is easier to find storage class info for PVCs. (BZ#1800459)
-
Previously, creating a new machine config pool using the console’s Compute
Machine Config Pools Create Machine Config Pool button resulted in a machine config pool that did not match the node. This was caused by the template using the spec.machineSelector
key for selecting the nodes to match. However, this key is not recognized by the API; the correct one for selecting a node isspec.nodeSelector
. The key for selecting nodes has been updated, allowing the web console to display a Machine Selector which now matches the appropriate node. (BZ#1813369) -
Previously,
oc
was not listed first on the CLI downloads page because the CLI downloads were listed alphabetically. Becauseoc
is the primary CLI for OpenShift Container Platform, it is now listed at the top of the CLI downloads page. (BZ#1824934) - Previously, the Explorer view presented Access Review tabs to users who lacked the required permissions to view these tabs. Users without this authorization saw an error message and instructions to try reloading the tab, but retrying would not change the result. With this release, the Access Review tabs are hidden from users who do not have permission to view the contents of the tabs. (BZ#1786251)
- Previously, memory consumption data in the Cluster Utilization card view and the top consumers popover view was inconsistent because these two views used different methods to calculate memory usage. With this release, the two views use the same method to calculate memory usage so that the data they provide is consistent. (BZ#1812096)
- Previously, users were able to create two routing labels for a single alert receiver. When two routing labels had the same key, the list page only showed the latest created one. However, exactly if one of the routing labels used regular expressions, the details page separated them as two distinct routing labels. With this release, users can no longer create two routing labels for a single alert receiver. (BZ#1804049)
- With this release, an update to a library that is used by the web console resolved performance and display issues on some views. (BZ#1796658)
-
Select links in the mast head had an href value of
#
with an OnClick handler containing the target destination. As a result, those links have the option to open in a new tab, however the#
resolved to the dashboard instead of the intended target destination. Now, any links with an href of#
are updated to a button element so the Open Link In New Tab option is not available. Links that have the Open Link In New Tab option show the correct URL. (BZ#1703757)
Monitoring
- Previously, mishandling of metadata related to the Prometheus PVC name could cause upgrade failures to or from versions 4.4.0-4.4.8. Now data is copied from old physical volumes to the new ones in order to retain the metric data and allow the upgrades to complete. (BZ#1832124)
- Previously, Thanos Querier could be scheduled on both on master and worker nodes, but it is only meant to be scheduled on worker nodes. Now the toleration allowing Thanos Querier to be scheduled on master nodes has been removed, so Thanos Querier is only deployed on worker nodes. (BZ#1812834)
- Previously, the evaluation of some Prometheus recording rules occasionally failed and caused metrics to generated from the rule to go missing. Now the recording rules have been fixed. (BZ#1802941)
-
Previously, the CPU usage rate was showing incorrect results dues to statistical smoothing of the data. Now the method for calculating CPU usage has been updated and the results
oc adm top
are similar to the Linuxtop
utility. (BZ#1812004) -
Previously, custom configurations to cluster monitoring were being lost because the
cluster-monitoring-config
map was invalid and the cluster monitoring Operator defaulted to use the default configuration. Now when the cluster monitoring Operator can not decode thecluster-monitoring-config
config map, it does not use the default configuration and fires an alert warning instead. (BZ#1807430)
Networking
- Changes on kube-proxy metrics implementation made some metrics disappear during the Kubernetes 1.17 rebase. This bug fix change how metrics are published in SDN, keeping them from disappearing. (BZ#1811739)
-
Previously when the installer introduced
machineNetwork
, the Cluster Network Operator was not modified to add it toproxy.status.noProxy
. This bug fix setproxy.status.noProxy
to contain the expected fields, includingmachineNetwork
. (BZ#1797894) - Previously, the node detected its self IP incorrectly preventing it from owning the egress IP it was assigned. This bug fix assigns the node IP from the Kubernetes API instead. (BZ#1802557)
- A code change inadvertently stopped setting the status for third-party plug-ins, which meant the Cluster Network Operator status never indicated that it was working. This bug fix added code to set the status when a third-party plug-in is in use. Now Cluster Network Operator correctly reports status when a third-party plug-in is in use. (BZ#1807611)
- Previously, the Cluster Network Operator on Kuryr bootstrapping had no logic to remove deprecated security group rules when they were replaced by new ones. On OpenShift Container Platform upgrades, the old security group rules were left on the security groups meaning that tightening them to increase security was not done on environments upgraded from 4.3 to 4.4. This bug fix ensures the Cluster Network Operator is removing old security group rules, and as a result the security group rules get removed on 4.3 to 4.4 upgrade and pods are correctly getting the restricted access to host VMs. (BZ#1832305)
- Previously, in order to enforce a Network Policy that blocks any traffic, the service matched by that policy should have the corresponding load balancer blocking the traffic, and the way Octavia provided this was by using ACLs and setting off the admin state on the load balancer listeners. As a consequence, the mismatch of the security groups on the Kuryr annotation for the OpenShift Container Platform endpoints and the actual security group set for the pods made some load balancers to be considered for a network policy update, and so having the traffic blocked with the admin state disabled. With this bug fix, the security groups field on the Kuryr annotation for the endpoints match the existent security groups of the selected pods. Now all load balancer listeners have the admin state enabled if no network policy blocks it. (BZ#1824258)
-
Previously, iptables experienced locking problems. In rare circumstances, a pod could fail to start, and the command
oc describe pod
would show an event including the text, "Failed create pod sandbox … could not set up pod iptables rules: Another app is currently holding the xtables lock." This bug fix passes-w
to iptables in the relevant piece of code, and as a result iptables wait for the lock and does not fail spuriously. (BZ#1810505) -
Previously, on node deletion, the chassis record for the node would not get removed from the south-bound database. Stale chassis records resulted in stale logical flows for that chassis which were never removed. This bug fix added a node sync mechanism in
ovnkube-master
to purge chassis records of deleted nodes. Now there are no more stale chassis records or stale logical flows corresponding to deleted nodes in the south-bound database. (BZ#1809747) -
When etcd was running slowly,
openshift-sdn
could miss namespace creation events due to a race condition. This could lead to pods in that namespace having no connectivity. With this bug fix, the race condition was removed. As a result, pods eventually have connectivity. (BZ#1825355)
Node
-
Previously, the
kubepods.slice
memory cgroup was not set to the maximum limit, minus the reservations. This caused the nodes to become overloaded with out of memory errors, and not evict workloads. Thekubepods.slice
memory reservation is now set correctly. (BZ#1800319) - Previously, the device mapper for devices was missing metrics, so none were available if the system was using a device mapper for the root device. The cadvisor was fixed and metrics are now available whether or not the device mapper is used for the root device. (BZ#1849269)
Node Tuning Operator
- The Node Tuning Operator did not ship with fixes to address tuned daemon behavior related to (BZ#1702724) and (BZ#1774645). As a result, when an invalid profile was specified by the user, a Denial of Service (DoS) of the operand’s functionality occurred. Also, correcting the profile did not restore the operand’s functionality. This was fixed by applying the aforementioned bug fixes, allowing the tuned daemon to process and set a new, corrected profile. (BZ#1823941)
-
Previously, tuned pods did not mount
/etc/sysctl.{conf,d/}
from the host. This gave the ability for settings provided by the host to be overridden by tuned profiles. Now/etc/sysctl.{conf,d/}
is mounted from the host in tuned pods, which prevents tuned profiles from overriding the host sysctl settings in/etc/sysctl.{conf,d/}
. (BZ#1825322)
oc
-
Previously, printer flags were not wired properly and the
oc adm group sync
command was missing output options. The flags are now wired properly and all of the output options are working correctly. (BZ#1828194) - Previously, the format result function had a hard-coded size, so panic occurred when the array was filled with less than the hard-coded limit. The number of LDAP entries is now limited based on the actual array capacity and the function correctly formats results. (BZ#1806876)
-
Previously, the
oc image mirror
command would give an error if only the--from-dir
option is specified, even though it should override the--dir
option. Now,--from-dir
properly overrides--dir
, and the command succeeds. (BZ#1807807) -
Previously, the help examples for the
oc adm release
command were not displayed correctly. They have been updated so that they now display properly. (BZ#1810310)
OLM
-
Custom resources installed by Operator Lifecycle Manager (OLM) are given
OwnerReference
objects to theInstallPlan
object they were applied from. Deleting anInstallPlan
object deletes the custom resources that were applied from it. This bug fix updates OLM to pointOwnerReference
objects for custom resources to the CSV that they were installed for. As a result, deleting anInstallPlan
object no longer deletes the custom resources that were applied from it. (BZ#1808113) - Previously, the garbage collection resource event queue was not configured correctly. This caused cluster-scoped resources generated for Operators managed by Operator Lifecycle Manager (OLM) to never get cleaned up when the Operator was uninstalled. This bug fix updates OLM to reconfigure garbage collection queues to be hit for owner labels referencing any namespace. As a result, cluster-scoped resources generated for Operators managed by OLM are now properly cleaned up when the Operator is uninstalled. (BZ#1834136)
If an Operator is being upgraded that provides a required API whose group, version, and kind (GVK) has not changed since the previous version of the Operator, and the Operator that depends on the API uses a
skipRange
instead of thespec.Replaces
field, Operator Lifecycle Manager (OLM) fails to generate the "upgraded CSV" with the correctreplaces
field. Specifically, OLM would:- Add the new Operator to the generation and mark the APIs it provides as present.
- Remove the old Operator from the generation and mark the APIs it provides as absent, despite being provided by the new version of the Operator.
Attempt to resolve the missing APIs, overwriting the new version of the Operator with a copy that does not have its
spec.Replaces
field set.This caused certain Operators to fail to upgrade to new versions. This bug fix updates OLM to remove the old Operator from the current generation before adding the new Operator to the generation. As a result, the upgrade succeeds as expected. (BZ#1818788)
-
Invalid
CatalogSource
object configurations were causing a nil-pointer exception and a panic. Thecatalog-operator
pod would crash every time an invalidCatalogSource
object was reconciled. This bug fix adds runtime nil checks andCatalogSource
object validation. As a result, invalidCatalogSource
objects are given a representative condition, and thecatalog-operator
pod no longer crashes. (BZ#1817833) -
Operator Lifecycle Manager (OLM) allows users to specify volumes and volume mounts using the
subscriptionConfig
field of aSubscription
object. Using this feature updates theDeployment
resource defined in theClusterServiceVersion
resource (CSV). Occasionally, OLM would not have theSubscription
object created for a CSV in its cache, and the CSV would be placed in the "installing phase" without creating theDeployment
resource with the volumes or volume mounts defined in theSubscription
object. OLM would then be unable to move the CSV into the "succeeded phase" because the calculatedDeployment
resource hash would not equal the actualDeployment
resource hash on theDeployment
resource. This error would not be resolved because OLM does not update or recreate theDeployment
resource in the "installing phase", and the issue would persist until five minutes passed, when OLM would resync CSVs. As a result, OLM would occasionally be delayed while installing CSVs. This bug fix ensures that, if OLM encounters aDeployment
resource hash error when installing a CSV, OLM now recreates theDeployment
resource. As a result, OLM is no longer delayed by an incorrectDeployment
resource hash. (BZ#1826443) -
Previously, Operator Lifecycle Manager (OLM) did not anticipate running multiple
APIService
resources on a singleDeployment
resource and only mounted the CA associated with the lastAPIService
resource created by OLM. This caused OLM to be unable to run multipleAPIService
resources on a singleDeployment
resource. This bug fix updates OLM to use the same CA for allAPIService
resources on a singleDeployment
resource. As a result, OLM can now run multipleAPIService
resources on a singleDeployment
resource. (BZ#1805412) -
Previously, Operator Lifecycle Manager (OLM) did not deprecate the v1alpha2 version
OperatorGroup
custom resource definition (CRD) correctly when introducing a structural schema. This caused v1alpha2OperatorGroup
CRDs to no longer be supported and they could not be created. This bug fix reintroduces the v1alpha2OperatorGroup
CRD, and as a result, OLM again supports the v1alpha2OperatorGroup
CRDs. (BZ#1798051) -
The application of a newly, non-deterministically resolved set of dependencies was triggered when previously-resolved
InstallPlan
objects no longer contained an equivalent set of manifests. When more than one valid set of dependencies for an Operator existed, this caused an equivalent but distinct resolution to sometimes be applied over an existing one. This bug fix adds ageneration
field to the status of theInstallPlan
object API and increments it upon every resolution, only applying theInstallPlan
object with the newest status generation. As a result, only one set of dependencies for an Operator exists on the cluster at a given time. (BZ#1784024) -
The
OperatorHub
type definition was missing an additional+genclient
marker comment required for Kubernetes client generation. This caused the generated client not to be available in theopenshift/client-go
config client. This bug fix adds the missing+genclient
marker comment to theOperatorHub
config type, and as a result, the generated client is now available as expected. (BZ#1816483)
openshift-apiserver
- Previously, the OpenShift API server was not available to clients during upgrades, causing failures. Now the OpenShift API server remains available to clients during upgrades. (BZ#1791162)
openshift-controller-manager
- Previously, the client used to create pull secrets for the OpenShift Container Platform internal registry had a low rate limit. If a large number of namespaces were created in a short time window, it would take a long time for image registry pull secrets to be created. The client’s rate limit has been increased, so internal registry pull secrets are now created quickly, even with high traffic. (BZ#1785023)
-
Previously, metrics such as
workqueue_depth
were unavailable in Prometheus metrics. With this bug fix, the missing metrics are now available. (BZ#1825324) -
If an
openshift-controller-manager
pod failed, no termination message was provided. Now if the pod terminates, a termination message is provided. (BZ#1804432) - Previously, metrics were not properly registered for the OpenShift Container Platform control plane. With this bug fix, metrics for the control plane are now available. (BZ#1809699)
- Previously, a pull secret for the internal registry could be orphaned when the associated token was deleted. In this bug fix, a reference is created between a pull secret and token so that a pull secret is no longer orphaned when the associated token is deleted. (BZ#1765294)
- Previously, if OpenShift Container Platform was configured with a global proxy, the proxy was not used when connecting to external image registries. Now when pull images from an external registry, OpenShift Container Platform uses the cluster-wide proxy configuration. (BZ#1805168)
- Previously, during a deployment rolling update the controller might be unavailable for an excessive amount of time. This bug fix minimizes any delay by allowing the controller to proactively release its lease as pods in the deployment terminate. (BZ#1809719)
-
Previously, the
openshift-controller-manager-operator
might potentially run with access to elevated SELinux privileges. With this bug fix, the correct security context constraints are now applied. (BZ#1806913) -
Previously, during an upgrade the
openshift-controller-manager
erroneously reported that the Operator had been upgraded and was available. Now, the Operator correctly reports when it is successfully updated. (BZ#1804434) -
During installation or upgrade, the
openshift-controller-manager
did not correctly report its progress condition. As a result, an installation or upgrade might fail. Now the Operator correctly reports its progress upon a successful installation or upgrade. (BZ#1814446) -
Previously, the
image-resolve-plugin
did not resolve images if thealpha.image.policy.openshift.io/resolve-names
annotation was added after resource creation. Theimage-resolve-plugin
was fixed to resolve images even if thealpha.image.policy.openshift.io/resolve-names
annotation is added after resource creation. (BZ#1805155) - Previously, the Controller Manager Operator did not expose its metrics when over an IPv6 cluster. Subsequently, metrics were not being properly scraped, which left users with no way to graph or query performance data. The Controller Manager Operator now properly binds to IPv6 interfaces, so metrics are properly scraped and presented to users. (BZ#1813233)
Routing
-
Previously, service load balancers could not include Azure master nodes, which broke ingress on compact clusters where worker nodes are also master nodes. Azure only allows a node’s network interface card (NIC) to be associated with a single load balancer at any point in time. With this update, the installer was changed to create a unified load balancer and network security group that are used for both the API and services of type
LoadBalancer
. Now, service load balancers can include master nodes on Azure and ingress works on compact clusters. (BZ#1794839) - Previously, the openshift-router did not establish a watch of default certificate secret contents if the secret was invalid. Upon starting, the openshift-router failed to read the invalid secret, which must exist for the router pod to start. As a result, the user had to update the invalid secret and delete the current router pods. With this update, the router now watches for any changes in the default certificate secret without deleting the router pods. If the secret is invalid, the router uses and serves the default router certificate. If the secret is valid, the router serves the default certificate from that secret. (BZ#1820400)
- Previously, the Ingress Operator failed to configure DNS when running in AWS China regions. With this update, the Ingress Operator now detects when it is running in AWS China regions and can configure DNS in Route 53 API endpoint. (BZ#1805224)
-
Previously, the Ingress Operator continuously upserted DNS records that it managed on Azure and Google Cloud Provider (GCP). With this update, the Ingress Operator avoids upserting a DNS record if the record is already published and neither the record nor the DNS zone configuration has changed since the controller last upserted the record. The Ingress Operator now makes fewer calls to the cloud provider API, which might prevent cloud provider rate limited events in the
openshift-ingress
namespace. Additionally, the Ingress Operator logs now show fewerupserted DNS record
log messages. (BZ#1809354) -
Previously, the ingress-to-route controller used the ingresses resource from the
extensions/v1beta1
API group, which was deprecated in Kubernetes 1.18. With this update, the ingress-to-route controller now uses the ingresses resource from thenetworking.k8s.io/v1beta1
API group. (BZ#1801415) - Previously, the router was not promoting inactive routes when a conflicting route was deleted. Now when a route is deleted, the router reprocesses all inactive routes and activates routes that no longer conflict with the deleted route. (BZ#1821095)
-
Previously, when a service with type
LoadBalancer
or an Ingress Controller with theLoadBalancerService
endpoint publishing strategy type was deleted, the service remained present and in apending
state. The service controller was changed in OpenShift Container Platform 3.10 to prevent unnecessaryGetLoadBalancer
cloud-provider API calls when non-LoadBalancer
services were created or deleted. A subsequent change in Kubernetes 1.15 prevented these unnecessary API calls in a different way. As a result, interaction between these two changes broke the service controller’s clean-up logic for services with typeLoadBalancer
. With this update, the change added in OpenShift Container Platform 3.10 was removed. Deletion of services with a typeLoadBalancer
and Ingress Controllers with aLoadBalancerService
type can now complete. (BZ#1798282) -
Previously, a confusing
LoadBalancerManager
status condition reason was set by the Ingress Operator when the endpoint publishing strategy did not include managing a load balancer. When anIngressController
resource is configured to use an endpoint publishing strategy type other thanLoadBalancerService
, the Ingress Operator does not manage a load balancer for that Ingress Controller. With this update, theLoadBalancerManager
status condition more clearly states why the Operator is not managing a load balancer for the Ingress Controller. The message now does not use phrases such as unsupported or does not support. (BZ#1826113) -
Previously, a Forwarded HTTP header with a non-standard
proto-version
parameter was added when the Ingress Controller forwarded an HTTP request to an application. As a result, the Forwarded header was not standards-compliant and might have caused problems when applications tried to parse the header value. With this update, the Forwarded header is now standards-compliant and the Ingress Controller does not specify aproto-version
parameter in the Forwarded header. (BZ#1803001) -
Previously, Prometheus counters that show the number of active sessions were preserved across router restarts and increased indefinitely. With this update,
haproxy_frontend_current_session
andhaproxy_server_current_session
now accurately depict the number of active sessions. The value of these counters are now reset upon router restart. (BZ#1832539) -
If the backing pods of a service exposed via a route are unavailable (e.g., crashlooping, deleted), the router responds with a
503
error. Previously, thehaproxy_server_http_responses_total
metric for that route was no longer available, thus monitoring on the route was no longer possible. With this update, all backend metrics are now reported and users can track when no pods are up. (BZ#1835845)
Samples
- Previous versions of the Samples Operator did not bootstrap as removed on s390x and ppc64le architectures, although samples content had not been made available on those architectures yet. This caused cluster upgrades on s390x and ppc64le architectures to fail because samples content was expected, although it was not available. Now the Samples Operator is forced to upgrade, even if it does not contain the necessary samples content. This fixes the cluster upgrade failures caused by unavailable sample content for the s390x and ppc64le architectures. (BZ#1835112)
- If a sample image stream available in a prior OpenShift Container Platform release was removed in a subsequent release, then during upgrade to that subsequent release, the removed image stream could be incorrectly tracked as needing image stream imports to complete. Since no image stream imports were occurring, samples were not reporting their upgrade as complete. This caused the cluster upgrade to fail. The Samples Operator has been updated to ignore the tracking of image streams that existed in a prior release but not in the release for which the upgrade is intended. Now image streams removed between releases no longer cause the Samples Operator to fail during upgrade. (BZ#1811143)
- Previously, the Samples Operator would send alerts about an invalid configuration or missing image pull secrets when it was bootstrapped as removed. This caused misleading alerts to users because invalid configurations and missing image pull secrets should not be sent by the Samples Operator if it is removed. The Samples Operator has been updated to not send alerts related to importing samples when it is bootstrapped as removed. (BZ#1813175)
- Previously, sample templates that were available in a prior release and then removed in a subsequent release might be marked as needing updates. Attempting to update the templates resulted in various errors and failure statuses. These templates were updated to not receive updates after their removal. As a result, removed sample templates do not generate errors or failures. (BZ#1828065)
Storage
- Previously, volumes might have failed to provision in certain Azure regions that were created without proper availability zone support. With this fix, availability zone support is now detected during provisioning to enable volume provisioning in all Azure regions. (BZ#1828174)
-
Previously, namespaces would get stuck in
Terminating
andVolumeSnapshot
objects would linger on the cluster when aVolumeSnapshotClass
resource was removed before the associatedVolumeSnapshot
resources because it was no longer possible to delete the associated resources. With this fix,VolumeSnapshot
resource functionality now examines whether the associatedVolumeSnapshotClass
resource has already been deleted so thatVolumeSnapshot
resources can be successfully deleted as long as no correspondingVolumeSnapshotClass
resource exists. (BZ#1808123) -
Previously, the CSI Snapshot Controller might crash when
VolumeSnapshotContents
resources were nil. The system now checks to see if theVolumeSnapshotContent
resource is nil before it gets used. (BZ#1814280) -
Previously when upgrading the Local Storage Operator, the associated diskmaker and provisioner pods might both be outdated unless the
LocalVolume
resource was also modified. With this fix, the daemon set’s hash is included in an annotation. If the hash does not match, the pods are deployed so that the diskmaker and provisioner pods are now successfully updated when the Local Storage Operator is updated. (BZ#1822213) -
Previously,
oc get volumesnapshot
would only display the name and creation time of the resource, and not the status. With this fix,oc get volumesnapshot
now includes additional details, such as the associatedVolumeSnapshotContent
resource,VolumeSnapshot
resource source, and other relevant information. (BZ#1800437) -
Previously,
oc get volumesnapshotclass
would only display the name and creation time of the resource, and not deletion policy or driver information. With this fix,oc get volumesnapshotclass
now includes additional details, such as the associated CSI Driver and deletion policy. (BZ#1800470) -
Previously,
oc get volumesnapshotcontent
would only display the name and creation time of the resource, and not additional relevant information. With this fix,oc get volumesnapshotcontent
now includes additional details, such as the associatedVolumeSnapshot
resource,VolumeSnapshotClass
resource, and other relevant information. (BZ#1800477)