Chapter 13. Troubleshooting hosted control planes
If you encounter issues with hosted control planes, see the following information to guide you through troubleshooting.
13.1. Gathering information to troubleshoot hosted control planes
When you need to troubleshoot an issue with hosted clusters, you can gather information by running the must-gather
command. The command generates output for the management cluster and the hosted cluster.
The output for the management cluster contains the following content:
- Cluster-scoped resources: These resources are node definitions of the management cluster.
-
The
hypershift-dump
compressed file: This file is useful if you need to share the content with other people. - Namespaced resources: These resources include all of the objects from the relevant namespaces, such as config maps, services, events, and logs.
- Network logs: These logs include the OVN northbound and southbound databases and the status for each one.
- Hosted clusters: This level of output involves all of the resources inside of the hosted cluster.
The output for the hosted cluster contains the following content:
- Cluster-scoped resources: These resources include all of the cluster-wide objects, such as nodes and CRDs.
- Namespaced resources: These resources include all of the objects from the relevant namespaces, such as config maps, services, events, and logs.
Although the output does not contain any secret objects from the cluster, it can contain references to the names of secrets.
Prerequisites
-
You must have
cluster-admin
access to the management cluster. -
You need the
name
value for theHostedCluster
resource and the namespace where the CR is deployed. -
You must have the
hcp
command-line interface installed. For more information, see "Installing the hosted control planes command-line interface". -
You must have the OpenShift CLI (
oc
) installed. -
You must ensure that the
kubeconfig
file is loaded and is pointing to the management cluster.
Procedure
To gather the output for troubleshooting, enter the following command:
$ oc adm must-gather \ --image=registry.redhat.io/multicluster-engine/must-gather-rhel9:v<mce_version> \ /usr/bin/gather hosted-cluster-namespace=HOSTEDCLUSTERNAMESPACE \ hosted-cluster-name=HOSTEDCLUSTERNAME \ --dest-dir=NAME ; tar -cvzf NAME.tgz NAME
where:
-
You replace
<mce_version>
with the version of multicluster engine Operator that you are using; for example,2.6
. -
The
hosted-cluster-namespace=HOSTEDCLUSTERNAMESPACE
parameter is optional. If you do not include it, the command runs as though the hosted cluster is in the default namespace, which isclusters
. -
If you want to save the results of the command to a compressed file, specify the
--dest-dir=NAME
parameter and replaceNAME
with the name of the directory where you want to save the results.
-
You replace
Additional resources
13.2. Entering the must-gather command in a disconnected environment
Complete the following steps to run the must-gather
command in a disconnected environment.
Procedure
- In a disconnected environment, mirror the Red Hat operator catalog images into their mirror registry. For more information, see Install on disconnected networks.
Run the following command to extract logs, which reference the image from their mirror registry:
REGISTRY=registry.example.com:5000 IMAGE=$REGISTRY/multicluster-engine/must-gather-rhel8@sha256:ff9f37eb400dc1f7d07a9b6f2da9064992934b69847d17f59e385783c071b9d8 $ oc adm must-gather \ --image=$IMAGE /usr/bin/gather \ hosted-cluster-namespace=HOSTEDCLUSTERNAMESPACE \ hosted-cluster-name=HOSTEDCLUSTERNAME \ --dest-dir=./data
Additional resources
13.3. Troubleshooting hosted clusters on OpenShift Virtualization
When you troubleshoot a hosted cluster on OpenShift Virtualization, start with the top-level HostedCluster
and NodePool
resources and then work down the stack until you find the root cause. The following steps can help you discover the root cause of common issues.
13.3.1. HostedCluster resource is stuck in a partial state
If a hosted control plane is not coming fully online because a HostedCluster
resource is pending, identify the problem by checking prerequisites, resource conditions, and node and Operator status.
Procedure
- Ensure that you meet all of the prerequisites for a hosted cluster on OpenShift Virtualization.
-
View the conditions on the
HostedCluster
andNodePool
resources for validation errors that prevent progress. By using the
kubeconfig
file of the hosted cluster, inspect the status of the hosted cluster:-
View the output of the
oc get clusteroperators
command to see which cluster Operators are pending. -
View the output of the
oc get nodes
command to ensure that worker nodes are ready.
-
View the output of the
13.3.2. No worker nodes are registered
If a hosted control plane is not coming fully online because the hosted control plane has no worker nodes registered, identify the problem by checking the status of various parts of the hosted control plane.
Procedure
-
View the
HostedCluster
andNodePool
conditions for failures that indicate what the problem might be. Enter the following command to view the KubeVirt worker node virtual machine (VM) status for the
NodePool
resource:$ oc get vm -n <namespace>
If the VMs are stuck in the provisioning state, enter the following command to view the CDI import pods within the VM namespace for clues about why the importer pods have not completed:
$ oc get pods -n <namespace> | grep "import"
If the VMs are stuck in the starting state, enter the following command to view the status of the virt-launcher pods:
$ oc get pods -n <namespace> -l kubevirt.io=virt-launcher
If the virt-launcher pods are in a pending state, investigate why the pods are not being scheduled. For example, not enough resources might exist to run the virt-launcher pods.
- If the VMs are running but they are not registered as worker nodes, use the web console to gain VNC access to one of the affected VMs. The VNC output indicates whether the ignition configuration was applied. If a VM cannot access the hosted control plane ignition server on startup, the VM cannot be provisioned correctly.
- If the ignition configuration was applied but the VM is still not registering as a node, see Identifying the problem: Access the VM console logs to learn how to access the VM console logs during startup.
Additional resources
13.3.3. Worker nodes are stuck in the NotReady state
During cluster creation, nodes enter the NotReady
state temporarily while the networking stack is rolled out. This part of the process is normal. However, if this part of the process takes longer than 15 minutes, an issue might have occurred.
Procedure
Identify the problem by investigating the node object and pods:
Enter the following command to view the conditions on the node object and determine why the node is not ready:
$ oc get nodes -o yaml
Enter the following command to look for failing pods within the cluster:
$ oc get pods -A --field-selector=status.phase!=Running,status,phase!=Succeeded
13.3.4. Ingress and console cluster operators are not coming online
If a hosted control plane is not coming fully online because the Ingress and console cluster Operators are not online, check the wildcard DNS routes and load balancer.
Procedure
If the cluster uses the default Ingress behavior, enter the following command to ensure that wildcard DNS routes are enabled on the OpenShift Container Platform cluster that the virtual machines (VMs) are hosted on:
$ oc patch ingresscontroller -n openshift-ingress-operator \ default --type=json -p \ '[{ "op": "add", "path": "/spec/routeAdmission", "value": {wildcardPolicy: "WildcardsAllowed"}}]'
If you use a custom base domain for the hosted control plane, complete the following steps:
- Ensure that the load balancer is targeting the VM pods correctly.
- Ensure that the wildcard DNS entry is targeting the load balancer IP address.
13.3.5. Load balancer services for the hosted cluster are not available
If a hosted control plane is not coming fully online because the load balancer services are not becoming available, check events, details, and the Kubernetes Cluster Configuration Manager (KCCM) pod.
Procedure
- Look for events and details that are associated with the load balancer service within the hosted cluster.
By default, load balancers for the hosted cluster are handled by the kubevirt-cloud-controller-manager within the hosted control plane namespace. Ensure that the KCCM pod is online and view its logs for errors or warnings. To identify the KCCM pod in the hosted control plane namespace, enter the following command:
$ oc get pods -n <hosted_control_plane_namespace> -l app=cloud-controller-manager
13.3.6. Hosted cluster PVCs are not available
If a hosted control plane is not coming fully online because the persistent volume claims (PVCs) for a hosted cluster are not available, check the PVC events and details, and component logs.
Procedure
- Look for events and details that are associated with the PVC to understand which errors are occurring.
If a PVC is failing to attach to a pod, view the logs for the kubevirt-csi-node
daemonset
component within the hosted cluster to further investigate the problem. To identify the kubevirt-csi-node pods for each node, enter the following command:$ oc get pods -n openshift-cluster-csi-drivers -o wide -l app=kubevirt-csi-driver
If a PVC cannot bind to a persistent volume (PV), view the logs of the kubevirt-csi-controller component within the hosted control plane namespace. To identify the kubevirt-csi-controller pod within the hosted control plane namespace, enter the following command:
$ oc get pods -n <hcp namespace> -l app=kubevirt-csi-driver
13.3.7. VM nodes are not correctly joining the cluster
If a hosted control plane is not coming fully online because the VM nodes are not correctly joining the cluster, access the VM console logs.
Procedure
- To access the VM console logs, complete the steps in How to get serial console logs for VMs part of OpenShift Virtualization Hosted Control Plane clusters.
13.3.8. RHCOS image mirroring fails
For hosted control planes on OpenShift Virtualization in a disconnected environment, oc-mirror
fails to automatically mirror the Red Hat Enterprise Linux CoreOS (RHCOS) image to the internal registry. When you create your first hosted cluster, the Kubevirt virtual machine does not boot, because the boot image is not available in the internal registry.
To resolve this issue, manually mirror the RHCOS image to the internal registry.
Procedure
Get the internal registry name by running the following command:
$ oc get imagecontentsourcepolicy -o json | jq -r '.items[].spec.repositoryDigestMirrors[0].mirrors[0]'
Get a payload image by running the following command:
$ oc get clusterversion version -ojsonpath='{.status.desired.image}'
Extract the
0000_50_installer_coreos-bootimages.yaml
file that contains boot images from your payload image on the hosted cluster. Replace<payload_image>
with the name of your payload image. Run the following command:$ oc image extract --file /release-manifests/0000_50_installer_coreos-bootimages.yaml <payload_image> --confirm
Get the RHCOS image by running the following command:
$ cat 0000_50_installer_coreos-bootimages.yaml | yq -r .data.stream | jq -r '.architectures.x86_64.images.kubevirt."digest-ref"'
Mirror the RHCOS image to your internal registry. Replace
<rhcos_image>
with your RHCOS image; for example,quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d9643ead36b1c026be664c9c65c11433c6cdf71bfd93ba229141d134a4a6dd94
. Replace<internal_registry>
with the name of your internal registry; for example,virthost.ostest.test.metalkube.org:5000/localimages/ocp-v4.0-art-dev
. Run the following command:$ oc image mirror <rhcos_image> <internal_registry>
Create a YAML file named
rhcos-boot-kubevirt.yaml
that defines theImageDigestMirrorSet
object. See the following example configuration:apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: name: rhcos-boot-kubevirt spec: repositoryDigestMirrors: - mirrors: - <rhcos_image_no_digest> 1 source: virthost.ostest.test.metalkube.org:5000/localimages/ocp-v4.0-art-dev 2
Apply the
rhcos-boot-kubevirt.yaml
file to create theImageDigestMirrorSet
object by running the following command:$ oc apply -f rhcos-boot-kubevirt.yaml
13.3.9. Return non-bare-metal clusters to the late binding pool
If you are using late binding managed clusters without BareMetalHosts
, you must complete additional manual steps to delete a late binding cluster and return the nodes back to the Discovery ISO.
For late binding managed clusters without BareMetalHosts
, removing cluster information does not automatically return all nodes to the Discovery ISO.
Procedure
To unbind the non-bare-metal nodes with late binding, complete the following steps:
- Remove the cluster information. For more information, see Removing a cluster from management.
- Clean the root disks.
- Reboot manually with the Discovery ISO.
Additional resources
13.4. Restarting hosted control plane components
If you are an administrator for hosted control planes, you can use the hypershift.openshift.io/restart-date
annotation to restart all control plane components for a particular HostedCluster
resource. For example, you might need to restart control plane components for certificate rotation.
Procedure
To restart a control plane, annotate the HostedCluster
resource by entering the following command:
$ oc annotate hostedcluster \ -n <hosted_cluster_namespace> \ <hosted_cluster_name> \ hypershift.openshift.io/restart-date=$(date --iso-8601=seconds)
Verification
The control plane is restarted whenever the value of the anonotation changes. The date
command in the example serves as the source of a unique string. The annotation is treated as a string, not a timestamp.
The following components are restarted:
- catalog-operator
- certified-operators-catalog
- cluster-api
- cluster-autoscaler
- cluster-policy-controller
- cluster-version-operator
- community-operators-catalog
- control-plane-operator
- hosted-cluster-config-operator
- ignition-server
- ingress-operator
- konnectivity-agent
- konnectivity-server
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- machine-approver
- oauth-openshift
- olm-operator
- openshift-apiserver
- openshift-controller-manager
- openshift-oauth-apiserver
- packageserver
- redhat-marketplace-catalog
- redhat-operators-catalog
13.5. Pausing the reconciliation of a hosted cluster and hosted control plane
If you are a cluster instance administrator, you can pause the reconciliation of a hosted cluster and hosted control plane. You might want to pause reconciliation when you back up and restore an etcd database or when you need to debug problems with a hosted cluster or hosted control plane.
Procedure
To pause reconciliation for a hosted cluster and hosted control plane, populate the
pausedUntil
field of theHostedCluster
resource.To pause the reconciliation until a specific time, enter the following command:
$ oc patch -n <hosted_cluster_namespace> \ hostedclusters/<hosted_cluster_name> \ -p '{"spec":{"pausedUntil":"<timestamp>"}}' \ --type=merge 1
- 1
- Specify a timestamp in the RFC339 format, for example,
2024-03-03T03:28:48Z
. The reconciliation is paused until the specified time is passed.
To pause the reconciliation indefinitely, enter the following command:
$ oc patch -n <hosted_cluster_namespace> \ hostedclusters/<hosted_cluster_name> \ -p '{"spec":{"pausedUntil":"true"}}' \ --type=merge
The reconciliation is paused until you remove the field from the
HostedCluster
resource.When the pause reconciliation field is populated for the
HostedCluster
resource, the field is automatically added to the associatedHostedControlPlane
resource.
To remove the
pausedUntil
field, enter the following patch command:$ oc patch -n <hosted_cluster_namespace> \ hostedclusters/<hosted_cluster_name> \ -p '{"spec":{"pausedUntil":null}}' \ --type=merge
13.6. Scaling down the data plane to zero
If you are not using the hosted control plane, to save the resources and cost you can scale down a data plane to zero.
Ensure you are prepared to scale down the data plane to zero. Because the workload from the worker nodes disappears after scaling down.
Procedure
Set the
kubeconfig
file to access the hosted cluster by running the following command:$ export KUBECONFIG=<install_directory>/auth/kubeconfig
Get the name of the
NodePool
resource associated to your hosted cluster by running the following command:$ oc get nodepool --namespace <hosted_cluster_namespace>
Optional: To prevent the pods from draining, add the
nodeDrainTimeout
field in theNodePool
resource by running the following command:$ oc edit nodepool <nodepool_name> --namespace <hosted_cluster_namespace>
Example output
apiVersion: hypershift.openshift.io/v1alpha1 kind: NodePool metadata: # ... name: nodepool-1 namespace: clusters # ... spec: arch: amd64 clusterName: clustername 1 management: autoRepair: false replace: rollingUpdate: maxSurge: 1 maxUnavailable: 0 strategy: RollingUpdate upgradeType: Replace nodeDrainTimeout: 0s 2 # ...
NoteTo allow the node draining process to continue for a certain period of time, you can set the value of the
nodeDrainTimeout
field accordingly, for example,nodeDrainTimeout: 1m
.Scale down the
NodePool
resource associated to your hosted cluster by running the following command:$ oc scale nodepool/<nodepool_name> --namespace <hosted_cluster_namespace> --replicas=0
NoteAfter scaling down the data plan to zero, some pods in the control plane stay in the
Pending
status and the hosted control plane stays up and running. If necessary, you can scale up theNodePool
resource.Optional: Scale up the
NodePool
resource associated to your hosted cluster by running the following command:$ oc scale nodepool/<nodepool_name> --namespace <hosted_cluster_namespace> --replicas=1
After rescaling the
NodePool
resource, wait for couple of minutes for theNodePool
resource to become available in aReady
state.
Verification
Verify that the value for the
nodeDrainTimeout
field is greater than0s
by running the following command:$ oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -ojsonpath='{.spec.nodeDrainTimeout}'
Additional resources