Chapter 40. Diagnostics Tool
40.1. Overview
The oc adm diagnostics
command runs a series of checks for error conditions in the host or cluster. Specifically, it:
- Verifies that the default registry and router are running and correctly configured.
-
Checks
ClusterRoleBindings
andClusterRoles
for consistency with base policy. - Checks that all of the client configuration contexts are valid and can be connected to.
- Checks that SkyDNS is working properly and the pods have SDN connectivity.
- Validates master and node configuration on the host.
- Checks that nodes are running and available.
- Analyzes host logs for known errors.
- Checks that systemd units are configured as expected for the host.
40.2. Using the Diagnostics Tool
You can deploy OpenShift Container Platform in several ways. These include:
- Built from source
- Included within a VM image
- As a container image
- Using enterprise RPMs
Each method is suited for a different configuration and environment. To minimize environment assumptions, the diagnostics tool is included with the openshift
binary to provide diagnostics within an OpenShift Container Platform server or client.
To use the diagnostics tool, preferably on a master host and as cluster administrator, run:
# oc adm diagnostics
This runs all available diagnostics and skips any that do not apply to the environment.
You can run a specific diagnostics by name or run specific diagnostics by name as you work to address issues. For example:
$ oc adm diagnostics
The options for the diagnostics tool require working configuration files. For example, the NodeConfigCheck does not run unless a node configuration is available.
The diagnostics tool uses the standard configuration file locations by default:
Client:
-
As indicated by the
$KUBECONFIG
environment variable - ~/.kube/config file
-
As indicated by the
Master:
- /etc/origin/master/master-config.yaml
Node:
- /etc/origin/node/node-config.yaml
You can specify non-standard locations with the --config
, --master-config
, and --node-config
options. If a configuration file is not specified, related diagnostics are skipped.
Available diagnostics include:
Diagnostic Name | Purpose |
---|---|
| Check the aggregated logging integration for proper configuration and operation. |
| Check systemd service logs for problems. Does not require a configuration file to check against. |
| Check that the cluster has a working container image registry for builds and image streams. |
| Check that the default cluster role bindings are present and contain the expected subjects according to base policy. |
| Check that cluster roles are present and contain the expected permissions according to base policy. |
| Check for a working default router in the cluster. |
| Check that each context in the client configuration is complete and has connectivity to its API server. |
| Creates a pod that runs diagnostics from an application standpoint, which checks that DNS within the pod is working as expected and the credentials for the default service account authenticate correctly to the master API. |
| Check the volume of writes against etcd for a time period and classify them by operation and key. This diagnostic only runs if specifically requested, because it does not run as quickly as other diagnostics and can increase load on etcd. |
| Check this host’s master configuration file for problems. |
| Check that the master running on this host is also running a node to verify that it is a member of the cluster SDN. |
| Check that the integrated Heapster metrics can be reached via the cluster API proxy. |
| Create diagnostic pods on multiple nodes to diagnose common network issues from an application or pod standpoint. Run this diagnostic when the master can schedule pods on nodes, but the pods have connection issues. This check confirms that pods can connect to services, other pods, and the external network.
If there are any errors, this diagnostic stores results and retrieved files in a local directory (/tmp/openshift/, by default) for further analysis. The directory can be specified with the |
| Checks this host’s node configuration file for problems. |
| Check that the nodes defined in the master API are ready and can schedule pods. |
| Check all route certificates for those that might be rejected by extended validation. |
| Check for existing services that specify external IPs, which are disallowed according to master configuration. |
| Check systemd status for units on this host related to OpenShift Container Platform. Does not require a configuration file to check against. |
40.3. Running Diagnostics in a Server Environment
An Ansible-deployed cluster provides additional diagnostic benefits for nodes within an OpenShift Container Platform cluster. These include:
- Master and node configuration is based on a configuration file in a standard location.
- Systemd units are configured to manage the server(s).
- Both master and node configuration files are in standard locations.
- Systemd units are created and configured for managing the nodes in a cluster.
- All components log to journald.
Keeping to the default location of the configuration files placed by an Ansible-deployed cluster ensures that running oc adm diagnostics
works without any flags. If you are not using the default location for the configuration files, you must use the --master-config
and --node-config
options:
# oc adm diagnostics --master-config=<file_path> --node-config=<file_path>
Systemd units and logs entries in journald are necessary for the current log diagnostic logic. For other deployment types, logs can be stored in single files, stored in files that combine node and master logs, or printed to stdout. If log entries do not use journald, the log diagnostics cannot work and do not run.
40.4. Running Diagnostics in a Client Environment
You can run the diagnostics tool as an ordinary user or a cluster-admin
, and it runs using the level of permissions granted to the account from which you run it.
A client with ordinary access can diagnose its connection to the master and run a diagnostic pod. If multiple users or masters are configured, connections are tested for all, but the diagnostic pod only runs against the current user, server, or project.
A client with cluster-admin
access can diagnose the status of infrastructure such as nodes, registry, and router. In each case, running oc adm diagnostics
searches for the standard client configuration file in its standard location and uses it if available.
40.5. Ansible-based Health Checks
Additional diagnostic health checks are available through the Ansible-based tooling used to install and manage OpenShift Container Platform clusters. They can report common deployment problems for the current OpenShift Container Platform installation.
These checks can be run either using the ansible-playbook
command (the same method used during cluster installations) or as a containerized version of openshift-ansible. For the ansible-playbook
method, the checks are provided by the openshift-ansible RPM package. For the containerized method, the openshift3/ose-ansible container image is distributed via the Red Hat Container Registry. Example usage for each method are provided in subsequent sections.
The following health checks are a set of diagnostic tasks that are meant to be run against the Ansible inventory file for a deployed OpenShift Container Platform cluster using the provided health.yml playbook.
Due to potential changes the health check playbooks can make to the environment, you must run the playbooks against only Ansible-deployed clusters and using the same inventory file used for deployment. The changes consist of installing dependencies so that the checks can gather the required information. In some circumstances, additional system components, such as docker
or networking configurations, can change if their current state differs from the configuration in the inventory file. You should run these health checks only if you do not expect the inventory file to make any changes to the existing cluster configuration.
Check Name | Purpose |
---|---|
| This check measures the total size of OpenShift Container Platform image data in an etcd cluster. The check fails if the calculated size exceeds a user-defined limit. If no limit is specified, this check fails if the size of image data amounts to 50% or more of the currently used space in the etcd cluster. A failure from this check indicates that a significant amount of space in etcd is being taken up by OpenShift Container Platform image data, which can eventually result in the etcd cluster crashing.
A user-defined limit may be set by passing the |
|
This check detects higher-than-normal traffic on an etcd host. It fails if a For further information on improving etcd performance, see Recommended Practices for OpenShift Container Platform etcd Hosts and the Red Hat Knowledgebase. |
|
This check ensures that the volume usage for an etcd cluster is below a maximum user-specified threshold. If no maximum threshold value is specified, it is defaulted to
A user-defined limit may be set by passing the |
| Only runs on hosts that depend on the docker daemon (nodes and containerized installations). Checks that docker's total usage does not exceed a user-defined limit. If no user-defined limit is set, docker's maximum usage threshold defaults to 90% of the total size available.
You can set the threshold limit for total percent usage with a variable in the inventory file, for example This also checks that docker's storage is using a supported configuration. |
|
This set of checks verifies that Curator, Kibana, Elasticsearch, and Fluentd pods have been deployed and are in a |
| This check detects higher than normal time delays between log creation and log aggregation by Elasticsearch in a logging stack deployment. It fails if a new log entry cannot be queried through Elasticsearch within a timeout (by default, 30 seconds). The check only runs if logging is enabled.
A user-defined timeout may be set by passing the |
| This check performs the following cluster-level diagnostics of the OpenShift Container Platform SDN:
If you specify the
This check can help you diagnose pod or infrastructure problems when the |
A similar set of checks meant to run as part of the installation process can be found in Configuring Cluster Pre-install Checks. Another set of checks for checking certificate expiration can be found in Redeploying Certificates.
40.5.1. Running Health Checks via ansible-playbook
To run the openshift-ansible health checks using the ansible-playbook
command, change to the playbook directory, specify your cluster’s inventory file, and run the health.yml playbook:
$ cd /usr/share/ansible/openshift-ansible $ ansible-playbook -i <inventory_file> \ playbooks/openshift-checks/health.yml
To set variables in the command line, include the -e
flag with any desired variables in key=value
format. For example:
$ cd /usr/share/ansible/openshift-ansible $ ansible-playbook -i <inventory_file> \ playbooks/openshift-checks/health.yml \ -e openshift_check_logging_index_timeout_seconds=45 \ -e etcd_max_image_data_size_bytes=40000000000
To disable specific checks, include the variable openshift_disable_check
with a comma-delimited list of check names in your inventory file before running the playbook. For example:
openshift_disable_check=etcd_traffic,etcd_volume
Alternatively, set any checks to disable as variables with -e openshift_disable_check=<check1>,<check2>
when running the ansible-playbook
command.
40.5.2. Running Health Checks via Docker CLI
You can run the openshift-ansible playbooks in a container, avoiding the need for installing and configuring Ansible, on any host that can run the ose-ansible image via the Docker CLI.
Run the following as a non-root user that has privileges to run containers:
# docker run -u `id -u` \ 1 -v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \ 2 -v /etc/ansible/hosts:/tmp/inventory:ro \ 3 -e INVENTORY_FILE=/tmp/inventory \ -e PLAYBOOK_FILE=playbooks/openshift-checks/health.yml \ 4 -e OPTS="-v -e openshift_check_logging_index_timeout_seconds=45 -e etcd_max_image_data_size_bytes=40000000000" \ 5 openshift3/ose-ansible
- 1
- These options make the container run with the same UID as the current user, which is required for permissions so that the SSH key can be read inside the container (SSH private keys are expected to be readable only by their owner).
- 2
- Mount SSH keys as a volume under /opt/app-root/src/.ssh under normal usage when running the container as a non-root user.
- 3
- Change /etc/ansible/hosts to the location of the cluster’s inventory file, if different. This file is bind-mounted to /tmp/inventory, which is used according to the
INVENTORY_FILE
environment variable in the container. - 4
- The
PLAYBOOK_FILE
environment variable is set to the location of the health.yml playbook relative to /usr/share/ansible/openshift-ansible inside the container. - 5
- Set any variables desired for a single run with the
-e key=value
format.
In the previous command, the SSH key is mounted with the :Z
option so that the container can read the SSH key from its restricted SELinux context. Adding this option means that your original SSH key file is relabeled similarly to system_u:object_r:container_file_t:s0:c113,c247
. For more details about :Z
, see the docker-run(1)
man page.
These volume mount specifications can have unexpected consequences. For example, if you mount, and therefore relabel, the $HOME/.ssh directory, sshd becomes unable to access the public keys to allow remote login. To avoid altering the original file labels, mount a copy of the SSH key or directory.
Mounting an entire .ssh directory can be helpful for:
- Allowing you to use an SSH configuration to match keys with hosts or modify other connection parameters.
-
Allowing a user to provide a known_hosts file and have SSH validate host keys. This is disabled by the default configuration and can be re-enabled with an environment variable by adding
-e ANSIBLE_HOST_KEY_CHECKING=True
to thedocker
command line.