Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 9. Troubleshooting hosted control planes
If you encounter issues with hosted control planes, see the following information to guide you through troubleshooting.
9.1. Gathering information to troubleshoot hosted control planes
When you need to troubleshoot an issue with hosted control plane clusters, you can gather information by running the hypershift dump cluster
command. The command generates output for the management cluster and the hosted cluster.
The output for the management cluster contains the following content:
- Cluster-scoped resources: These resources are node definitions of the management cluster.
-
The
hypershift-dump
compressed file: This file is useful if you need to share the content with other people. - Namespaced resources: These resources include all of the objects from the relevant namespaces, such as config maps, services, events, and logs.
- Network logs: These logs include the OVN northbound and southbound databases and the status for each one.
- Hosted clusters: This level of output involves all of the resources inside of the hosted cluster.
The output for the hosted cluster contains the following content:
- Cluster-scoped resources: These resources include all of the cluster-wide objects, such as nodes and CRDs.
- Namespaced resources: These resources include all of the objects from the relevant namespaces, such as config maps, services, events, and logs.
Although the output does not contain any secret objects from the cluster, it can contain references to the names of secrets.
Prerequisites
-
You must have
cluster-admin
access to the management cluster. -
You need the
name
value for theHostedCluster
resource and the namespace where the CR is deployed. -
You must have the
hcp
command line interface installed. For more information, see Installing the hosted control planes command line interface. -
You must have the OpenShift CLI (
oc
) installed. -
You must ensure that the
kubeconfig
file is loaded and is pointing to the management cluster.
Procedure
To gather the output for troubleshooting, enter the following command:
$ hypershift dump cluster \ --name <hosted_cluster_name> \1 --namespace <hosted_cluster_namespace> \ 2 --dump-guest-cluster \ --artifact-dir clusterDump-<hosted_cluster_namespace>-<hosted_cluster_name>
Example output
2023-06-06T12:18:20+02:00 INFO Archiving dump {"command": "tar", "args": ["-cvzf", "hypershift-dump.tar.gz", "cluster-scoped-resources", "event-filter.html", "namespaces", "network_logs", "timestamp"]} 2023-06-06T12:18:21+02:00 INFO Successfully archived dump {"duration": "1.519376292s"}
To configure the command-line interface so that it impersonates all of the queries against the management cluster by using a username or service account, enter the
hypershift dump cluster
command with the--as
flag.The service account must have enough permissions to query all of the objects from the namespaces, so the
cluster-admin
role is recommended to make sure you have enough permissions. The service account must be located in or have permissions to query the namespace of theHostedControlPlane
resource.If your username or service account does not have enough permissions, the output contains only the objects that you have permissions to access. During that process, you might see
forbidden
errors.To use impersonation by using a service account, enter the following command:
$ hypershift dump cluster \ --name <hosted_cluster_name> \1 --namespace <hosted_cluster_namespace> \2 --dump-guest-cluster \ --as "system:serviceaccount:<service_account_namespace>:<service_account_name>" \ 3 --artifact-dir clusterDump-<hosted_cluster_namespace>-<hosted_cluster_name>
To use impersonation by using a username, enter the following command:
$ hypershift dump cluster \ --name <hosted_cluster_name> \1 --namespace <hosted_cluster_namespace> \2 --dump-guest-cluster \ --as "<cluster_user_name>" \ 3 --artifact-dir clusterDump-<hosted_cluster_namespace>-<hosted_cluster_name>
9.2. Restarting hosted control plane components
If you are an administrator for hosted control planes, you can use the hypershift.openshift.io/restart-date
annotation to restart all control plane components for a particular HostedCluster
resource. For example, you might need to restart control plane components for certificate rotation.
Procedure
To restart a control plane, annotate the HostedCluster
resource by entering the following command:
$ oc annotate hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> hypershift.openshift.io/restart-date=$(date --iso-8601=seconds)
Verification
The control plane is restarted whenever the value of the anonotation changes. The date
command in the example serves as the source of a unique string. The annotation is treated as a string, not a timestamp.
The following components are restarted:
- catalog-operator
- certified-operators-catalog
- cluster-api
- cluster-autoscaler
- cluster-policy-controller
- cluster-version-operator
- community-operators-catalog
- control-plane-operator
- hosted-cluster-config-operator
- ignition-server
- ingress-operator
- konnectivity-agent
- konnectivity-server
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- machine-approver
- oauth-openshift
- olm-operator
- openshift-apiserver
- openshift-controller-manager
- openshift-oauth-apiserver
- packageserver
- redhat-marketplace-catalog
- redhat-operators-catalog
9.3. Pausing the reconciliation of a hosted cluster and hosted control plane
If you are a cluster instance administrator, you can pause the reconciliation of a hosted cluster and hosted control plane. You might want to pause reconciliation when you back up and restore an etcd database or when you need to debug problems with a hosted cluster or hosted control plane.
Procedure
To pause reconciliation for a hosted cluster and hosted control plane, populate the
pausedUntil
field of theHostedCluster
resource.To pause the reconciliation until a specific time, enter the following command:
$ oc patch -n <hosted_cluster_namespace> hostedclusters/<hosted_cluster_name> -p '{"spec":{"pausedUntil":"<timestamp>"}}' --type=merge 1
- 1
- Specify a timestamp in the RFC339 format, for example,
2024-03-03T03:28:48Z
. The reconciliation is paused until the specified time is passed.
To pause the reconciliation indefinitely, enter the following command:
$ oc patch -n <hosted_cluster_namespace> hostedclusters/<hosted_cluster_name> -p '{"spec":{"pausedUntil":"true"}}' --type=merge
The reconciliation is paused until you remove the field from the
HostedCluster
resource.When the pause reconciliation field is populated for the
HostedCluster
resource, the field is automatically added to the associatedHostedControlPlane
resource.
To remove the
pausedUntil
field, enter the following patch command:$ oc patch -n <hosted_cluster_namespace> hostedclusters/<hosted_cluster_name> -p '{"spec":{"pausedUntil":null}}' --type=merge
9.4. Scaling down the data plane to zero
If you are not using the hosted control plane, to save the resources and cost you can scale down a data plane to zero.
Ensure you are prepared to scale down the data plane to zero. Because the workload from the worker nodes disappears after scaling down.
Procedure
Set the
kubeconfig
file to access the hosted cluster by running the following command:$ export KUBECONFIG=<install_directory>/auth/kubeconfig
Get the name of the
NodePool
resource associated to your hosted cluster by running the following command:$ oc get nodepool --namespace <HOSTED_CLUSTER_NAMESPACE>
Optional: To prevent the pods from draining, add the
nodeDrainTimeout
field in theNodePool
resource by running the following command:$ oc edit NodePool <nodepool> -o yaml --namespace <HOSTED_CLUSTER_NAMESPACE>
Example output
apiVersion: hypershift.openshift.io/v1alpha1 kind: NodePool metadata: # ... name: nodepool-1 namespace: clusters # ... spec: arch: amd64 clusterName: clustername 1 management: autoRepair: false replace: rollingUpdate: maxSurge: 1 maxUnavailable: 0 strategy: RollingUpdate upgradeType: Replace nodeDrainTimeout: 0s 2 # ...
NoteTo allow the node draining process to continue for a certain period of time, you can set the value of the
nodeDrainTimeout
field accordingly, for example,nodeDrainTimeout: 1m
.Scale down the
NodePool
resource associated to your hosted cluster by running the following command:$ oc scale nodepool/<NODEPOOL_NAME> --namespace <HOSTED_CLUSTER_NAMESPACE> --replicas=0
NoteAfter scaling down the data plan to zero, some pods in the control plane stay in the
Pending
status and the hosted control plane stays up and running. If necessary, you can scale up theNodePool
resource.Optional: Scale up the
NodePool
resource associated to your hosted cluster by running the following command:$ oc scale nodepool/<NODEPOOL_NAME> --namespace <HOSTED_CLUSTER_NAMESPACE> --replicas=1
After rescaling the
NodePool
resource, wait for couple of minutes for theNodePool
resource to become available in aReady
state.
Verification
Verify that the value for the
nodeDrainTimeout
field is greater than0s
by running the following command:$ oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -ojsonpath='{.spec.nodeDrainTimeout}'
Additional resources