此内容没有您所选择的语言版本。
Chapter 3. Environment health checks
This topic contains steps to verify the overall health of the OpenShift Container Platform cluster and the various components, as well as describing the intended behavior.
Knowing the verification process for the various components is the first step to troubleshooting issues. If experiencing issues, you can use the checks provided in this section to diagnose any problems.
3.1. Checking complete environment health
To verify the end-to-end functionality of an OpenShift Container Platform cluster, build and deploy an example application.
Procedure
Create a new project named
validate
, as well as an example application from thecakephp-mysql-example
template:$ oc new-project validate $ oc new-app cakephp-mysql-example
You can check the logs to follow the build:
$ oc logs -f bc/cakephp-mysql-example
Once the build is complete, two pods should be running: a database and an application:
$ oc get pods NAME READY STATUS RESTARTS AGE cakephp-mysql-example-1-build 0/1 Completed 0 1m cakephp-mysql-example-2-247xm 1/1 Running 0 39s mysql-1-hbk46 1/1 Running 0 1m
-
Visit the application URL. The Cake PHP framework welcome page should be visible. The URL should have the following format
cakephp-mysql-example-validate.<app_domain>
. Once the functionality has been verified, the
validate
project can be deleted:$ oc delete project validate
All resources within the project will be deleted as well.
3.2. Creating alerts using Prometheus
You can integrate OpenShift Container Platform with Prometheus to create visuals and alerts to help diagnose any environment issues before they arise. These issues can include if a node goes down, if a pod is consuming too much CPU or memory, and more.
See the Prometheus on OpenShift Container Platform section in the Installation and configuration guide for more information.
Prometheus on OpenShift Container Platform is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend to use them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information on Red Hat Technology Preview features support scope, see https://access.redhat.com/support/offerings/techpreview/.
3.3. Host health
To verify that the cluster is up and running, connect to a master instance, and run the following:
$ oc get nodes NAME STATUS AGE VERSION ocp-infra-node-1clj Ready 1h v1.6.1+5115d708d7 ocp-infra-node-86qr Ready 1h v1.6.1+5115d708d7 ocp-infra-node-g8qw Ready 1h v1.6.1+5115d708d7 ocp-master-94zd Ready,SchedulingDisabled 1h v1.6.1+5115d708d7 ocp-master-gjkm Ready,SchedulingDisabled 1h v1.6.1+5115d708d7 ocp-master-wc8w Ready,SchedulingDisabled 1h v1.6.1+5115d708d7 ocp-node-c5dg Ready 1h v1.6.1+5115d708d7 ocp-node-ghxn Ready 1h v1.6.1+5115d708d7 ocp-node-w135 Ready 1h v1.6.1+5115d708d7
The above cluster example consists of three master hosts, three infrastructure node hosts, and three node hosts. All of them are running. All hosts in the cluster should be visible in this output.
The Ready
status means that master hosts can communicate with node hosts and that the nodes are ready to run pods (excluding the nodes in which scheduling is disabled).
Before you run etcd commands, source the etcd.conf file:
# source /etc/etcd/etcd.conf
You can check the basic etcd health status from any master instance with the etcdctl
command:
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE \ --ca-file=/etc/etcd/ca.crt --endpoints=$ETCD_LISTEN_CLIENT_URLS cluster-health member 59df5107484b84df is healthy: got healthy result from https://10.156.0.5:2379 member 6df7221a03f65299 is healthy: got healthy result from https://10.156.0.6:2379 member fea6dfedf3eecfa3 is healthy: got healthy result from https://10.156.0.9:2379 cluster is healthy
However, to get more information about etcd hosts, including the associated master host:
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE \ --ca-file=/etc/etcd/ca.crt --endpoints=$ETCD_LISTEN_CLIENT_URLS member list 295750b7103123e0: name=ocp-master-zh8d peerURLs=https://10.156.0.7:2380 clientURLs=https://10.156.0.7:2379 isLeader=true b097a72f2610aea5: name=ocp-master-qcg3 peerURLs=https://10.156.0.11:2380 clientURLs=https://10.156.0.11:2379 isLeader=false fea6dfedf3eecfa3: name=ocp-master-j338 peerURLs=https://10.156.0.9:2380 clientURLs=https://10.156.0.9:2379 isLeader=false
All etcd hosts should contain the master host name if the etcd cluster is co-located with master services, or all etcd instances should be visible if etcd is running separately.
etcdctl2
is an alias for the etcdctl
tool that contains the proper flags to query the etcd cluster in v2 data model, as well as, etcdctl3
for v3 data model.
3.4. Router and registry health
To check if a router service is running:
$ oc -n default get deploymentconfigs/router NAME REVISION DESIRED CURRENT TRIGGERED BY router 1 3 3 config
The values in the DESIRED
and CURRENT
columns should match the number of nodes hosts.
Use the same command to check the registry status:
$ oc -n default get deploymentconfigs/docker-registry NAME REVISION DESIRED CURRENT TRIGGERED BY docker-registry 1 3 3 config
Multiple running instances of the container registry require backend storage supporting writes by multiple processes. If the chosen infrastructure provider does not contain this ability, running a single instance of a container registry is acceptable.
To verify that all pods are running and on which hosts:
$ oc -n default get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE docker-registry-1-54nhl 1/1 Running 0 2d 172.16.2.3 ocp-infra-node-tl47 docker-registry-1-jsm2t 1/1 Running 0 2d 172.16.8.2 ocp-infra-node-62rc docker-registry-1-qbt4g 1/1 Running 0 2d 172.16.14.3 ocp-infra-node-xrtz registry-console-2-gbhcz 1/1 Running 0 2d 172.16.8.4 ocp-infra-node-62rc router-1-6zhf8 1/1 Running 0 2d 10.156.0.4 ocp-infra-node-62rc router-1-ffq4g 1/1 Running 0 2d 10.156.0.10 ocp-infra-node-tl47 router-1-zqxbl 1/1 Running 0 2d 10.156.0.8 ocp-infra-node-xrtz
If OpenShift Container Platform is using an external container registry, the internal registry service does not need to be running.
3.5. Network connectivity
Network connectivity has two main networking layers: the cluster network for node interaction, and the software defined network (SDN) for pod interaction. OpenShift Container Platform supports multiple network configurations, often optimized for a specific infrastructure provider.
Due to the complexity of networking, not all verification scenarios are covered in this section.
3.5.1. Connectivity on master hosts
etcd and master hosts
Master services keep their state synchronized using the etcd key-value store. Communication between master and etcd services is important, whether those etcd services are collocated on master hosts, or running on hosts designated only for the etcd service. This communication happens on TCP ports 2379
and 2380
. See the Host health section for methods to check this communication.
SkyDNS
SkyDNS
provides name resolution of local services running in OpenShift Container Platform. This service uses TCP
and UDP
port 8053
.
To verify the name resolution:
$ dig +short docker-registry.default.svc.cluster.local 172.30.150.7
If the answer matches the output of the following, SkyDNS
service is working correctly:
$ oc get svc/docker-registry -n default NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE docker-registry 172.30.150.7 <none> 5000/TCP 3d
API service and web console
Both the API service and web console share the same port, usually TCP
8443
or 443
, depending on the setup. This port needs to be available within the cluster and to everyone who needs to work with the deployed environment. The URLs under which this port is reachable may differ for internal cluster and for external clients.
In the following example, the https://internal-master.example.com:443
URL is used by the internal cluster, and the https://master.example.com:443
URL is used by external clients. On any node host:
$ curl https://internal-master.example.com:443/version { "major": "1", "minor": "6", "gitVersion": "v1.6.1+5115d708d7", "gitCommit": "fff65cf", "gitTreeState": "clean", "buildDate": "2017-10-11T22:44:25Z", "goVersion": "go1.7.6", "compiler": "gc", "platform": "linux/amd64" }
This must be reachable from client’s network:
$ curl -k https://master.example.com:443/healthz ok
3.5.2. Connectivity on node instances
The SDN connecting pod communication on nodes uses UDP
port 4789
by default.
To verify node host functionality, create a new application. The following example ensures the node reaches the docker registry, which is running on an infrastructure node:
Procedure
Create a new project:
$ oc new-project sdn-test
Deploy an httpd application:
$ oc new-app centos/httpd-24-centos7~https://github.com/sclorg/httpd-ex
Wait until the build is complete:
$ oc get pods NAME READY STATUS RESTARTS AGE httpd-ex-1-205hz 1/1 Running 0 34s httpd-ex-1-build 0/1 Completed 0 1m
Connect to the running pod:
$ oc rsh po/<pod-name>
For example:
$ oc rsh po/httpd-ex-1-205hz
Check the
healthz
path of the internal registry service:$ curl -kv https://docker-registry.default.svc.cluster.local:5000/healthz * About to connect() to docker-registry.default.svc.cluster.locl port 5000 (#0) * Trying 172.30.150.7... * Connected to docker-registry.default.svc.cluster.local (172.30.150.7) port 5000 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * skipping SSL peer certificate verification * SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: * subject: CN=172.30.150.7 * start date: Nov 30 17:21:51 2017 GMT * expire date: Nov 30 17:21:52 2019 GMT * common name: 172.30.150.7 * issuer: CN=openshift-signer@1512059618 > GET /healthz HTTP/1.1 > User-Agent: curl/7.29.0 > Host: docker-registry.default.svc.cluster.local:5000 > Accept: */* > < HTTP/1.1 200 OK < Cache-Control: no-cache < Date: Mon, 04 Dec 2017 16:26:49 GMT < Content-Length: 0 < Content-Type: text/plain; charset=utf-8 < * Connection #0 to host docker-registry.default.svc.cluster.local left intact sh-4.2$ *exit*
The
HTTP/1.1 200 OK
response means the node is correctly connecting.Clean up the test project:
$ oc delete project sdn-test project "sdn-test" deleted
The node host is listening on
TCP
port10250
. This port needs to be reachable by all master hosts on any node, and if monitoring is deployed in the cluster, the infrastructure nodes must have access to this port on all instances as well. Broken communication on this port can be detected with the following command:$ oc get nodes NAME STATUS AGE VERSION ocp-infra-node-1clj Ready 4d v1.6.1+5115d708d7 ocp-infra-node-86qr Ready 4d v1.6.1+5115d708d7 ocp-infra-node-g8qw Ready 4d v1.6.1+5115d708d7 ocp-master-94zd Ready,SchedulingDisabled 4d v1.6.1+5115d708d7 ocp-master-gjkm Ready,SchedulingDisabled 4d v1.6.1+5115d708d7 ocp-master-wc8w Ready,SchedulingDisabled 4d v1.6.1+5115d708d7 ocp-node-c5dg Ready 4d v1.6.1+5115d708d7 ocp-node-ghxn Ready 4d v1.6.1+5115d708d7 ocp-node-w135 NotReady 4d v1.6.1+5115d708d7
In the output above, the node service on the
ocp-node-w135
node is not reachable by the master services, which is represented by itsNotReady
status.The last service is the router, which is responsible for routing connections to the correct services running in the OpenShift Container Platform cluster. Routers listen on
TCP
ports80
and443
on infrastructure nodes for ingress traffic. Before routers can start working, DNS must be configured:$ dig *.apps.example.com ; <<>> DiG 9.11.1-P3-RedHat-9.11.1-8.P3.fc27 <<>> *.apps.example.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45790 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;*.apps.example.com. IN A ;; ANSWER SECTION: *.apps.example.com. 3571 IN CNAME apps.example.com. apps.example.com. 3561 IN A 35.xx.xx.92 ;; Query time: 0 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Tue Dec 05 16:03:52 CET 2017 ;; MSG SIZE rcvd: 105
The IP address, in this case
35.xx.xx.92
, should be pointing to the load balancer distributing ingress traffic to all infrastructure nodes. To verify the functionality of the routers, check the registry service once more, but this time from outside the cluster:$ curl -kv https://docker-registry-default.apps.example.com/healthz * Trying 35.xx.xx.92... * TCP_NODELAY set * Connected to docker-registry-default.apps.example.com (35.xx.xx.92) port 443 (#0) ... < HTTP/2 200 < cache-control: no-cache < content-type: text/plain; charset=utf-8 < content-length: 0 < date: Tue, 05 Dec 2017 15:13:27 GMT < * Connection #0 to host docker-registry-default.apps.example.com left intact
3.6. Storage
Master instances need at least 40 GB of hard disk space for the /var
directory. Check the disk usage of a master host using the df
command:
$ df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/sda1 xfs 45G 2.8G 43G 7% / devtmpfs devtmpfs 3.6G 0 3.6G 0% /dev tmpfs tmpfs 3.6G 0 3.6G 0% /dev/shm tmpfs tmpfs 3.6G 63M 3.6G 2% /run tmpfs tmpfs 3.6G 0 3.6G 0% /sys/fs/cgroup tmpfs tmpfs 732M 0 732M 0% /run/user/1000 tmpfs tmpfs 732M 0 732M 0% /run/user/0
Node instances need at least 15 GB space for the /var
directory, and at least another 15 GB for Docker storage (/var/lib/docker
in this case). Depending on the size of the cluster and the amount of ephemeral storage desired for pods, a separate partition should be created for /var/lib/origin/openshift.local.volumes
on the nodes.
$ df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/sda1 xfs 25G 2.4G 23G 10% / devtmpfs devtmpfs 3.6G 0 3.6G 0% /dev tmpfs tmpfs 3.6G 0 3.6G 0% /dev/shm tmpfs tmpfs 3.6G 147M 3.5G 4% /run tmpfs tmpfs 3.6G 0 3.6G 0% /sys/fs/cgroup /dev/sdb xfs 25G 2.7G 23G 11% /var/lib/docker /dev/sdc xfs 50G 33M 50G 1% /var/lib/origin/openshift.local.volumes tmpfs tmpfs 732M 0 732M 0% /run/user/1000
Persistent storage for pods should be handled outside of the instances running the OpenShift Container Platform cluster. Persistent volumes for pods can be provisioned by the infrastructure provider, or with the use of container native storage or container ready storage.
3.7. Docker storage
Docker Storage can be backed by one of two options. The first is a thin pool logical volume with device mapper, the second, since Red Hat Enterprise Linux version 7.4, is an overlay2 file system. The overlay2 file system is generally recommended due to the ease of setup and increased performance.
The Docker storage disk is mounted as /var/lib/docker
and formatted with xfs
file system. Docker storage is configured to use overlay2 filesystem:
$ cat /etc/sysconfig/docker-storage DOCKER_STORAGE_OPTIONS='--storage-driver overlay2'
To verify this storage driver is used by Docker:
# docker info Containers: 4 Running: 4 Paused: 0 Stopped: 0 Images: 4 Server Version: 1.12.6 Storage Driver: overlay2 Backing Filesystem: xfs Logging Driver: journald Cgroup Driver: systemd Plugins: Volume: local Network: overlay host bridge null Authorization: rhel-push-plugin Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Security Options: seccomp selinux Kernel Version: 3.10.0-693.11.1.el7.x86_64 Operating System: Employee SKU OSType: linux Architecture: x86_64 Number of Docker Hooks: 3 CPUs: 2 Total Memory: 7.147 GiB Name: ocp-infra-node-1clj ID: T7T6:IQTG:WTUX:7BRU:5FI4:XUL5:PAAM:4SLW:NWKL:WU2V:NQOW:JPHC Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://registry.access.redhat.com/v1/ WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled Insecure Registries: 127.0.0.0/8 Registries: registry.access.redhat.com (secure), registry.access.redhat.com (secure), docker.io (secure)
3.8. API service status
The OpenShift API service, atomic-openshift-master-api.service
, runs on all master instances. To see the status of the service:
$ systemctl status atomic-openshift-master-api.service ● atomic-openshift-master-api.service - Atomic OpenShift Master API Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2017-11-30 11:40:19 EST; 5 days ago Docs: https://github.com/openshift/origin Main PID: 30047 (openshift) Memory: 65.0M CGroup: /system.slice/atomic-openshift-master-api.service └─30047 /usr/bin/openshift start master api --config=/etc/origin/master/ma... Dec 06 09:15:49 ocp-master-94zd atomic-openshift-master-api[30047]: I1206 09:15:49.85... Dec 06 09:15:50 ocp-master-94zd atomic-openshift-master-api[30047]: I1206 09:15:50.96... Dec 06 09:15:52 ocp-master-94zd atomic-openshift-master-api[30047]: I1206 09:15:52.34...
The API service exposes a health check, which can be queried externally with:
$ curl -k https://master.example.com/healthz ok
3.9. Controller role verification
The OpenShift Container Platform controller service, atomic-openshift-master-controllers.service
, is available across all master hosts. The service runs in active/passive mode, meaning it should only be running on one master at any time.
The OpenShift Container Platform controllers execute a procedure to choose which host runs the service. The current running value is stored in an annotation in a special configmap
stored in the kube-system
project.
Verify the master host running the atomic-openshift-master-controllers
service as a cluster-admin
user:
$ oc get -n kube-system cm openshift-master-controllers -o yaml apiVersion: v1 kind: ConfigMap metadata: annotations: control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"master-ose-master-0.example.com-10.19.115.212-dnwrtcl4","leaseDurationSeconds":15,"acquireTime":"2018-02-17T18:16:54Z","renewTime":"2018-02-19T13:50:33Z","leaderTransitions":16}' creationTimestamp: 2018-02-02T10:30:04Z name: openshift-master-controllers namespace: kube-system resourceVersion: "17349662" selfLink: /api/v1/namespaces/kube-system/configmaps/openshift-master-controllers uid: 08636843-0804-11e8-8580-fa163eb934f0
The command outputs the current master controller in the control-plane.alpha.kubernetes.io/leader
annotation, within the holderIdentity
property as:
master-<hostname>-<ip>-<8_random_characters>
Find the hostname of the master host by filtering the output using the following:
$ oc get -n kube-system cm openshift-master-controllers -o json | jq -r '.metadata.annotations[] | fromjson.holderIdentity | match("^master-(.*)-[0-9.]*-[0-9a-z]{8}$") | .captures[0].string' ose-master-0.example.com
3.10. Verifying correct Maximum Transmission Unit (MTU) size
Verifying the maximum transmission unit (MTU) prevents a possible networking misconfiguration that can masquerade as an SSL certificate issue.
When a packet is larger than the MTU size that is transmitted over HTTP, the physical network router is able to break the packet into multiple packets to transmit the data. However, when a packet is larger than the MTU size is that transmitted over HTTPS, the router is forced to drop the packet.
Installation produces certificates that provide secure connections to multiple components that include:
- master hosts
- node hosts
- infrastructure nodes
- registry
- router
These certificates can be found within the /etc/origin/master
directory for the master nodes and /etc/origin/node
directory for the infra and app nodes.
After installation, you can verify connectivity to the REGISTRY_OPENSHIFT_SERVER_ADDR
using the process outlined in the Network connectivity section.
Prerequisites
From a master host, get the HTTPS address:
$ oc get dc docker-registry -o jsonpath='{.spec.template.spec.containers[].env[?(@.name=="OPENSHIFT_DEFAULT_REGISTRY")].value}{"\n"}' docker-registry.default.svc:5000
The above gives the output of
docker-registry.default.svc:5000
.Append
/healthz
to the value given above, use it to check on all hosts (master, infrastructure, node):$ curl -v https://docker-registry.default.svc:5000/healthz * About to connect() to docker-registry.default.svc port 5000 (#0) * Trying 172.30.11.171... * Connected to docker-registry.default.svc (172.30.11.171) port 5000 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: * subject: CN=172.30.11.171 * start date: Oct 18 05:30:10 2017 GMT * expire date: Oct 18 05:30:11 2019 GMT * common name: 172.30.11.171 * issuer: CN=openshift-signer@1508303629 > GET /healthz HTTP/1.1 > User-Agent: curl/7.29.0 > Host: docker-registry.default.svc:5000 > Accept: */* > < HTTP/1.1 200 OK < Cache-Control: no-cache < Date: Tue, 24 Oct 2017 19:42:35 GMT < Content-Length: 0 < Content-Type: text/plain; charset=utf-8 < * Connection #0 to host docker-registry.default.svc left intact
The above example output shows the MTU size being used to ensure the SSL connection is correct. The attempt to connect is successful, followed by connectivity being established and completes with initializing the NSS with the certpath and all the server certificate information regarding the docker-registry.
An improper MTU size results in a timeout:
$ curl -v https://docker-registry.default.svc:5000/healthz * About to connect() to docker-registry.default.svc port 5000 (#0) * Trying 172.30.11.171... * Connected to docker-registry.default.svc (172.30.11.171) port 5000 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb
The above example shows that the connection is established, but it cannot finish initializing NSS with certpath. The issue deals with improper MTU size set within the
/etc/origin/node/node-config.yaml
file.To fix this issue, adjust the MTU size within the
/etc/origin/node/node-config.yaml
to 50 bytes smaller than the MTU size being used by the OpenShift SDN Ethernet device.View the MTU size of the desired Ethernet device (i.e.
eth0
):$ ip link show eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether fa:16:3e:92:6a:86 brd ff:ff:ff:ff:ff:ff
The above shows MTU set to 1500.
To change the MTU size, modify the
/etc/origin/node/node-config.yaml
file and set a value that is 50 bytes smaller than output provided by theip
command.For example, if the MTU size is set to 1500, adjust the MTU size to 1450 within the
/etc/origin/node/node-config.yaml
file:networkConfig: mtu: 1450
Save the changes and reboot the node:
NoteYou must change the MTU size on all masters and nodes that are part of the OpenShift Container Platform SDN. Also, the MTU size of the tun0 interface must be the same across all nodes that are part of the cluster.
Once the node is back online, confirm the issue no longer exists by re-running the original
curl
command.$ curl -v https://docker-registry.default.svc:5000/healthz
If the timeout persists, continue to adjust the MTU size in increments of 50 bytes and repeat the process.