5.4. Troubleshooting Operator issues
Operators are a method of packaging, deploying, and managing an OpenShift Container Platform application. They act like an extension of the software vendor’s engineering team, watching over an OpenShift Container Platform environment and using its current state to make decisions in real time. Operators are designed to handle upgrades seamlessly, react to failures automatically, and not take shortcuts, such as skipping a software backup process to save time.
OpenShift Container Platform 4.5 includes a default set of Operators that are required for proper functioning of the cluster. These default Operators are managed by the Cluster Version Operator (CVO).
As a cluster administrator, you can install application Operators from the OperatorHub using the OpenShift Container Platform web console or the CLI. You can then subscribe the Operator to one or more namespaces to make it available for developers on your cluster. Application Operators are managed by Operator Lifecycle Manager (OLM).
If you experience Operator issues, verify Operator subscription status. Check Operator pod health across the cluster and gather Operator logs for diagnosis.
5.4.1. Operator subscription condition types
Subscriptions can report the following condition types:
Condition | Description |
---|---|
| Some or all of the catalog sources to be used in resolution are unhealthy. |
| An install plan for a subscription is missing. |
| An install plan for a subscription is pending installation. |
| An install plan for a subscription has failed. |
Default OpenShift Container Platform cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a Subscription
object. Application Operators are managed by Operator Lifecycle Manager (OLM) and they have a Subscription
object.
5.4.2. Viewing Operator subscription status using the CLI
You can view Operator subscription status using the CLI.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. -
You have installed the OpenShift CLI (
oc
).
Procedure
List Operator subscriptions:
$ oc get subs -n <operator_namespace>
Use the
oc describe
command to inspect aSubscription
resource:$ oc describe sub <subscription_name> -n <operator_namespace>
In the command output, find the
Conditions
section for the status of Operator subscription condition types. In the following example, theCatalogSourcesUnhealthy
condition type has a status offalse
because all available catalog sources are healthy:Example output
Conditions: Last Transition Time: 2019-07-29T13:42:57Z Message: all available catalogsources are healthy Reason: AllCatalogSourcesHealthy Status: False Type: CatalogSourcesUnhealthy
Default OpenShift Container Platform cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a Subscription
object. Application Operators are managed by Operator Lifecycle Manager (OLM) and they have a Subscription
object.
5.4.3. Querying Operator pod status
You can list Operator pods within a cluster and their status. You can also collect a detailed Operator pod summary.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. - Your API service is still functional.
-
You have installed the OpenShift CLI (
oc
).
Procedure
List Operators running in the cluster. The output includes Operator version, availability, and up-time information:
$ oc get clusteroperators
List Operator pods running in the Operator’s namespace, plus pod status, restarts, and age:
$ oc get pod -n <operator_namespace>
Output a detailed Operator pod summary:
$ oc describe pod <operator_pod_name> -n <operator_namespace>
If an Operator issue is node-specific, query Operator container status on that node.
Start a debug pod for the node:
$ oc debug node/my-node
Set
/host
as the root directory within the debug shell. The debug pod mounts the host’s root file system in/host
within the pod. By changing the root directory to/host
, you can run binaries contained in the host’s executable paths:# chroot /host
注意OpenShift Container Platform 4.5 cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes using SSH is not recommended and nodes will be tainted as accessed. However, if the OpenShift Container Platform API is not available, or the kubelet is not properly functioning on the target node,
oc
operations will be impacted. In such situations, it is possible to access nodes usingssh core@<node>.<cluster_name>.<base_domain>
instead.List details about the node’s containers, including state and associated pod IDs:
# crictl ps
List information about a specific Operator container on the node. The following example lists information about the
network-operator
container:# crictl ps --name network-operator
- Exit from the debug shell.
5.4.4. Gathering Operator logs
If you experience Operator issues, you can gather detailed diagnostic information from Operator pod logs.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. - Your API service is still functional.
-
You have installed the OpenShift CLI (
oc
). - You have the fully qualified domain names of the control plane, or master machines.
Procedure
List the Operator pods that are running in the Operator’s namespace, plus the pod status, restarts, and age:
$ oc get pods -n <operator_namespace>
Review logs for an Operator pod:
$ oc logs pod/<pod_name> -n <operator_namespace>
If an Operator pod has multiple containers, the preceding command will produce an error that includes the name of each container. Query logs from an individual container:
$ oc logs pod/<operator_pod_name> -c <container_name> -n <operator_namespace>
If the API is not functional, review Operator pod and container logs on each master node by using SSH instead. Replace
<master-node>.<cluster_name>.<base_domain>
with appropriate values.List pods on each master node:
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl pods
For any Operator pods not showing a
Ready
status, inspect the pod’s status in detail. Replace<operator_pod_id>
with the Operator pod’s ID listed in the output of the preceding command:$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl inspectp <operator_pod_id>
List containers related to an Operator pod:
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl ps --pod=<operator_pod_id>
For any Operator container not showing a
Ready
status, inspect the container’s status in detail. Replace<container_id>
with a container ID listed in the output of the preceding command:$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl inspect <container_id>
Review the logs for any Operator containers not showing a
Ready
status. Replace<container_id>
with a container ID listed in the output of the preceding command:$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl logs -f <container_id>
注意OpenShift Container Platform 4.5 cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes using SSH is not recommended and nodes will be tainted as accessed. Before attempting to collect diagnostic data over SSH, review whether the data collected by running
oc adm must gather
and otheroc
commands is sufficient instead. However, if the OpenShift Container Platform API is not available, or the kubelet is not properly functioning on the target node,oc
operations will be impacted. In such situations, it is possible to access nodes usingssh core@<node>.<cluster_name>.<base_domain>
.
5.4.5. Disabling the Machine Config Operator from automatically rebooting
When configuration changes are made by the Machine Config Operator, Red Hat Enterprise Linux CoreOS (RHCOS) must reboot for the changes to take effect. Whether the configuration change is automatic, such as when a kube-apiserver-to-kubelet-signer
CA is rotated, or manual, such as when a registry or SSH key is updated, an RHCOS node reboots automatically unless it is paused.
To avoid unwanted disruptions, you can modify the machine config pool (MCP) to prevent automatic rebooting after the Operator makes changes to the machine config.
Pausing a machine config pool stops all system reboot processes and all configuration changes from being applied.
5.4.5.1. Disabling the Machine Config Operator from automatically rebooting by using the console
To avoid unwanted disruptions from changes made by the Machine Config Operator (MCO), you can use the OpenShift Container Platform web console to modify the machine config pool (MCP) to prevent the MCO from making any changes to nodes in that pool. This prevents any reboots that would normally be part of the MCO update process.
Pausing an MCP stops all updates to your RHCOS nodes, including updates to the operating system, security, certificate, and any other updates related to the machine config. Pausing should be done for short periods of time only.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role.
Procedure
To pause or unpause automatic MCO update rebooting:
Pause the autoreboot process:
-
Log in to the OpenShift Container Platform web console as a user with the
cluster-admin
role. -
Click Compute
Machine Config Pools. - On the Machine Config Pools page, click either master or worker, depending upon which nodes you want to pause rebooting for.
- On the master or worker page, click YAML.
In the YAML, update the
spec.paused
field totrue
.Sample MachineConfigPool object
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool ... spec: ... paused: true 1
- 1
- Update the
spec.paused
field totrue
to pause rebooting.
To verify that the MCP is paused, return to the Machine Config Pools page.
On the Machine Config Pools page, the Paused column reports True for the MCP you modified.
If the MCP has pending changes while paused, the Updated column is False and Updating is False. When Updated is True and Updating is False, there are no pending changes.
重要If there are pending changes (where both the Updated and Updating columns are False), it is recommended to schedule a maintenance window for a reboot as early as possible. Use the following steps for unpausing the autoreboot process to apply the changes that were queued since the last reboot.
-
Log in to the OpenShift Container Platform web console as a user with the
Unpause the autoreboot process:
-
Log in to the OpenShift Container Platform web console as a user with the
cluster-admin
role. -
Click Compute
Machine Config Pools. - On the Machine Config Pools page, click either master or worker, depending upon which nodes you want to pause rebooting for.
- On the master or worker page, click YAML.
In the YAML, update the
spec.paused
field tofalse
.Sample MachineConfigPool object
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool ... spec: ... paused: false 1
- 1
- Update the
spec.paused
field tofalse
to allow rebooting.
注意By unpausing an MCP, the MCO applies all paused changes reboots Red Hat Enterprise Linux CoreOS (RHCOS) as needed.
To verify that the MCP is paused, return to the Machine Config Pools page.
On the Machine Config Pools page, the Paused column reports False for the MCP you modified.
If the MCP is applying any pending changes, the Updated column is False and the Updating column is True. When Updated is True and Updating is False, there are no further changes being made.
-
Log in to the OpenShift Container Platform web console as a user with the
5.4.5.2. Disabling the Machine Config Operator from automatically rebooting by using the CLI
To avoid unwanted disruptions from changes made by the Machine Config Operator (MCO), you can modify the machine config pool (MCP) using the OpenShift CLI (oc) to prevent the MCO from making any changes to nodes in that pool. This prevents any reboots that would normally be part of the MCO update process.
Pausing an MCP stops all updates to your RHCOS nodes, including updates to the operating system, security, certificate, as well as any other updates related to the machine config. Pausing should be done for short periods of time only.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. -
You have installed the OpenShift CLI (
oc
).
Procedure
To pause or unpause automatic MCO update rebooting:
Pause the autoreboot process:
Update the
MachineConfigPool
custom resource to set thespec.paused
field totrue
.Control plane (master) nodes
$ oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/master
Worker nodes
$ oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/worker
Verify that the MCP is paused:
Control plane (master) nodes
$ oc get machineconfigpool/master --template='{{.spec.paused}}'
Worker nodes
$ oc get machineconfigpool/worker --template='{{.spec.paused}}'
Example output
true
The
spec.paused
field istrue
and the MCP is paused.Determine if the MCP has pending changes:
# oc get machineconfigpool
Example output
NAME CONFIG UPDATED UPDATING master rendered-master-33cf0a1254318755d7b48002c597bf91 True False worker rendered-worker-e405a5bdb0db1295acea08bcca33fa60 False False
If the UPDATED column is False and UPDATING is False, there are pending changes. When UPDATED is True and UPDATING is False, there are no pending changes. In the previous example, the worker node has pending changes. The master node does not have any pending changes.
重要If there are pending changes (where both the Updated and Updating columns are False), it is recommended to schedule a maintenance window for a reboot as early as possible. Use the following steps for unpausing the autoreboot process to apply the changes that were queued since the last reboot.
Unpause the autoreboot process:
Update the
MachineConfigPool
custom resource to set thespec.paused
field tofalse
.Control plane (master) nodes
$ oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/master
Worker nodes
$ oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/worker
注意By unpausing an MCP, the MCO applies all paused changes and reboots Red Hat Enterprise Linux CoreOS (RHCOS) as needed.
Verify that the MCP is unpaused:
Control plane (master) nodes
$ oc get machineconfigpool/master --template='{{.spec.paused}}'
Worker nodes
$ oc get machineconfigpool/worker --template='{{.spec.paused}}'
Example output
false
The
spec.paused
field isfalse
and the MCP is unpaused.Determine if the MCP has pending changes:
$ oc get machineconfigpool
Example output
NAME CONFIG UPDATED UPDATING master rendered-master-546383f80705bd5aeaba93 True False worker rendered-worker-b4c51bb33ccaae6fc4a6a5 False True
If the MCP is applying any pending changes, the UPDATED column is False and the UPDATING column is True. When UPDATED is True and UPDATING is False, there are no further changes being made. In the previous example, the MCO is updating the worker node.