1.7. Known issues
When powering on a virtual machine on vSphere with user-provisioned infrastructure, the process of scaling up a node might not work as expected. A known issue in the hypervisor configuration causes machines to be created within the hypervisor but not powered on. If a node appears to be stuck in the
Provisioning
state after scaling up a machine set, you can investigate the status of the virtual machine in the vSphere instance itself. Use the VMware commandsgovc tasks
andgovc events
to determine the status of the virtual machine. Check for a similar error message to the following:[Invalid memory setting: memory reservation (sched.mem.min) should be equal to memsize(8192). ]
You can attempt to resolve the issue with the steps in this VMware KBase article. For more information, see the Red Hat Knowledgebase solution [UPI vSphere] Node scale-up doesn’t work as expected. (BZ#1918383)
-
If your internal Elasticsearch instance uses persistent volume claims (PVCs), the PVCs must contain a
logging-cluster:elasticsearch
label. Without the label, during the upgrade the garbage collection process removes those PVCs and the Elasticsearch Operator creates new PVCs. If you are updating from an OpenShift Container Platform version prior to version 4.4.30, you must manually add the label to the Elasticsearch PVCs. After OpenShift Container Platform 4.4.30, the Elasticsearch Operator automatically adds the label to the PVCs. - When upgrading to a new OpenShift Container Platform z-stream release, connectivity to the API server might be interrupted as nodes are upgraded, causing API requests to fail. (BZ#1845411)
- When upgrading to a new OpenShift Container Platform z-stream release, connectivity to routers might be interrupted as router pods are updated. For the duration of the upgrade, some applications might not be consistently reachable. (BZ#1809665)
- When upgrading to a new OpenShift Container Platform release with the default CNI network provider set to OVN-Kubernetes, the upgrade fails and the cluster is left in an unusable state. (BZ#1854175)
-
Because the
ImageContentSourcePolicy
for image registry pull-through is not yet supported, the deployment pod cannot mirror images by using a digest ID if the image stream has the pull-through policy enabled. In this case, anImagePullBackOff
error displays. (BZ#1787112) If you scale up with a RHEL worker while running a cluster on RHOSP that uses user-provisioned infrastructure, all routes are inaccessible if the Ingress port VIP is on the RHEL worker. As a workaround, you must reschedule the router pod to an RHCOS node and make the Ingress VIP migrate to the RHCOS node. To do this, add the
node.openshift.io/os_id: rhcos
label to the Ingress Controller before upgrade:$ oc -n openshift-ingress-operator edit ingresscontroller/default -o yaml spec: nodePlacement: nodeSelector: matchLabels: kubernetes.io/os: linux node-role.kubernetes.io/worker: "" node.openshift.io/os_id: rhcos
-
The Che Workspace Operator was updated to use the
DevWorkspace
custom resource instead of theWorkspace
custom resource. However, the OpenShift web terminal continues to use theWorkspace
custom resource. Because of this, the OpenShift web terminal fails to work with the latest version of the Che Workspace Operator. (BZ#1846969) -
A
basic-user
is unable to view the Dashboard and Metrics tabs in the Monitoring view of the Developer perspective. (BZ#1846409) - In the Topology view, when you right-click a Knative service, the Edit Application Grouping option is displayed twice in the context menu. (BZ#1849107)
- The Special Resources Operator (SRO) cannot be deployed successfully on OpenShift Container Platform 4.5. This prevents the deployment of NVIDIA drivers, which are required by the cluster to run workloads requiring GPU resources. Also, the Topology Manager feature could not be tested with GPU resources as a result of this known issue. (BZ#1847805)
- The web console includes the option to create VM vNICS with a SLIRP binding, but this is not supported. Attempting to use this option will cause the VM to fail to boot. Do not select this option. (BZ#1828744)
- There is an issue where pods that use the OpenShift SDN default CNI network provider in a node can lose network communication, causing the pods to crash. This can sometimes happen when upgrading a cluster. As a workaround, you can delete and re-create the pods. (BZ#1855118)
-
There is a known issue where the custom pool is not supported on the master node. The command
oc label node
applies the new custom role to the target master node, but the Machine Config Operator does not apply changes specific to the custom pool. This results in an error, which can be seen in the Machine Config Controller pod logs. As a suggested workaround to ensure that control plane nodes remain stable, it is recommended to not apply multiple roles on master. (BZ#1797687) - The logging performance for clusters is degraded compared to past versions of OpenShift Container Platform. This is being actively investigated and will be updated in a future release of OpenShift Container Platform. (BZ#1833486)
-
You might receive a message that the system is unable to mount a volume when the volume contains a large number of files. This can happen when a pod mounts a volume that is set with
FSGroup SecurityContext
because the GID ownership of the files must be recursively updated for all files on the volume. Users should expect that pods using volumes with a large number of files and theFSGroup SecurityContext
setting can take considerable time to start. (BZ#1515907) - Running pods with frequent probes can cause the number of conmon processes to grow quickly. A conmon process is a program that detaches from its parent, CRI-O, and is used to exec the container runtime. If the probes happen frequently enough, systemd has trouble reaping all of its new children, and some conmon processes can become zombies. (BZ#1852064)
- On Microsoft Azure when upgrading from 4.4 to 4.5, the Ingress Operator can fail to ensure a DNSRecord due to errors refreshing the token. Restarting the Ingress Operator fixes the issue. (BZ#1854383)
-
When running OpenShift Container Platform on Azure with installer-provisioned infrastructure, there is a known issue where
oc
commands fail intermittently with TLS handshake timeout errors. (BZ#1851549) - For clusters on VMware vSphere instances using installer-provisioned infrastructure, bootstrap workers fail. The default resource pool resolves to multiple instances. (BZ#1852545)
-
There is an issue where the Machine Config Operator (MCO) becomes degraded during the installation of an OpenShift Container Platform cluster. This is caused by a machine config ordering problem during the bootstrap process. As a workaround, you must prefix any custom machine config file with a differing priority of
98-
instead of99-
. (BZ#1826150) - Git clone operations that go through an HTTPS proxy fail. Non-TLS (HTTP) proxies can be used successfully. (BZ#1750650)
-
Git clone operations fail in builds running behind a proxy if the source URIs use the
git://
orssh://
scheme. (BZ#1751738) An issue has been found for the s390x and ppc64le architectures that renders nodes unavailable for workloads after a forced reboot or power down. Do not force reboot or power down nodes.
If a forced reboot or power down is unavoidable and a node that comes back up is unavailable for workloads:
- SSH into the node.
- Stop the CRI-O and kubelet services.
-
Run the command
rm -rf /var/lib/containers
. Restart the CRI-O and kubelet services.
If an AWS account is configured to use AWS Organizations service control policies (SCPs) that use a global condition to deny all actions or require a specific permission, OpenShift Container Platform AWS installations fail, even if the provided credentials have the required permissions for installation.
A workaround for this issue is introduced in OpenShift Container Platform 4.5.8. (BZ#1829101)
In OpenShift Container Platform 4.1, anonymous users could access discovery endpoints. Later releases revoked this access to reduce the possible attack surface for security exploits because some discovery endpoints are forwarded to aggregated API servers. However, unauthenticated access is preserved in upgraded clusters so that existing use cases are not broken.
If you are a cluster administrator for a cluster that has been upgraded from OpenShift Container Platform 4.1 to 4.5, you can either revoke or continue to allow unauthenticated access. It is recommended to revoke unauthenticated access unless there is a specific need for it. If you do continue to allow unauthenticated access, be aware of the increased risks.
警告If you have applications that rely on unauthenticated access, they might receive HTTP
403
errors if you revoke unauthenticated access.Use the following script to revoke unauthenticated access to discovery endpoints:
## Snippet to remove unauthenticated group from all the cluster role bindings $ for clusterrolebinding in cluster-status-binding discovery system:basic-user system:discovery system:openshift:discovery ; do ### Find the index of unauthenticated group in list of subjects index=$(oc get clusterrolebinding ${clusterrolebinding} -o json | jq 'select(.subjects!=null) | .subjects | map(.name=="system:unauthenticated") | index(true)'); ### Remove the element at index from subjects array oc patch clusterrolebinding ${clusterrolebinding} --type=json --patch "[{'op': 'remove','path': '/subjects/$index'}]"; done
This script removes unauthenticated subjects from the following cluster role bindings:
-
cluster-status-binding
-
discovery
-
system:basic-user
-
system:discovery
-
system:openshift:discovery
-
-
The
oc annotate
command does not work for LDAP group names that contain an equal sign (=
), because the command uses the equal sign as a delimiter between the annotation name and value. As a workaround, useoc patch
oroc edit
to add the annotation. (BZ#1917280)