Chapter 5. Investigating the pods of the RHOSO High Availability services
The Operator of each Red Hat OpenStack Services on OpenShift (RHOSO) High Availability service monitors the status of the pods that they manage. The service Operators aim to keep at least one replica of the service with a status of Running
.
Procedure
You can use the following command to check the status and availability of all the pods of the Galera, RabbitMQ, and memcached shared control plane services:
$ oc get pods |egrep -e "galera|rabbit|memcache" NAME READY STATUS RESTARTS AGE memcached-0 1/1 Running 0 3h11m memcached-1 1/1 Running 0 3h11m memcached-2 1/1 Running 0 3h11m openstack-cell1-galera-0 1/1 Running 0 3h11m openstack-cell1-galera-1 1/1 Running 0 3h11m openstack-cell1-galera-2 1/1 Running 0 3h11m openstack-galera-0 1/1 Running 0 3h11m openstack-galera-1 1/1 Running 0 3h11m openstack-galera-2 1/1 Running 0 3h11m rabbitmq-cell1-server-0 1/1 Running 0 3h11m rabbitmq-cell1-server-1 1/1 Running 0 3h11m rabbitmq-cell1-server-2 1/1 Running 0 3h11m rabbitmq-server-0 1/1 Running 0 3h11m rabbitmq-server-1 1/1 Running 0 3h11m rabbitmq-server-2 1/1 Running 0 3h11m
You can use the following command to investigate a specific pod from this list, typically to determine why a pod cannot be started:
$ oc describe pod/<pod-name>
Replace
<pod-name>
with the name of the pod from the list of pods that you want more information about.In the following example the
rabbitmq-server-0
pod is being investigated:$ oc describe pod/rabbitmq-server-0 Name: rabbitmq-server-0 Namespace: openstack Priority: 0 Service Account: rabbitmq-server Node: master-2/192.168.111.22 Start Time: Thu, 21 Mar 2024 08:39:57 -0400 Labels: app.kubernetes.io/component=rabbitmq app.kubernetes.io/name=rabbitmq app.kubernetes.io/part-of=rabbitmq controller-revision-hash=rabbitmq-server-5c886b79b4 statefulset.kubernetes.io/pod-name=rabbitmq-server-0 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["192.168.16.35/22"],"mac_address":"0a:58:c0:a8:10:23","gateway_ips":["192.168.16.1"],"routes":[{"dest":"192.16... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "192.168.16.35" ], "mac": "0a:58:c0:a8:10:23", "default": true, "dns": {} }] openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running ...
5.1. Understanding the Taint/Toleration based pod eviction process
Red Hat OpenShift Container Platform (RHOCP) implements a Taint/Toleration based pod eviction process, which determines how individual pods are evicted from worker nodes. Pods that are evicted are rescheduled on different nodes.
RHOCP assigns taints
to specific worker node conditions, such as not-ready
and unreachable
. When a worker node experiences one of these conditions, RHOCP automatically taints the worker node. After a worker node has become tainted, the pods must determine if they can tolerate this taint or not:
- Any pod that does not tolerate the taint is evicted immediately.
Any pod that does tolerate the taint will never be evicted, unless the pod has a limited
toleration
to this taint:The
tolerationSeconds
parameter specifies the limitedtoleration
of a pod, which is how long a pod can tolerate a specific taint (node condition) and remain connected to the worker node. If the worker node condition still exists after the specifiedtolerationSeconds
period, the taint remains on the worker node and the pod is evicted. If the worker node condition clears before the specifiedtolerationSeconds
period, then the pod is not evicted.
RHOCP adds a default toleration for the node.kubernetes.io/not-ready
and node.kubernetes.io/unreachable
taints of five minutes, by setting tolerationSeconds=300
.
Red Hat OpenStack Services on OpenShift (RHOSO) 18.0 Operators do not modify the default tolerations for taints, therefore pods that run on a tainted worker node take more than five minutes to be rescheduled.
Improving high availability and failover times
RHOCP provides the following Operators that you can install, which work together to provide increased resiliency to help reduce the failover times of workloads:
- The Node Health Check Operator. For information, see Remediating Nodes with Node Health Checks in the Workload Availability for Red Hat OpenShift Remediation, fencing, and maintenance guide.
The Self Node Remediation Operator. For information, see Using Self Node remediation in the Workload Availability for Red Hat OpenShift Remediation, fencing, and maintenance guide.
NoteWhen a pod uses local storage and the worker node fails on which this pod is running, manual intervention might be required even when using the Self Node Remediation Operator.
- The Fence Agents Remediation Operator. For information, see Using Fence Agents Remediation in the Workload Availability for Red Hat OpenShift Remediation, fencing, and maintenance guide.
Installing and configuring the Node Health Check Operator in conjunction with the Self Node Remediation Operator, the Fence Agents Remediation Operator, or both allows these Operators to control the time it takes for a worker node to be declared not-ready
or unreachable
, and thereby reduce the time a pod remains bound to a failed worker node.