Chapter 5. Investigating the pods of the RHOSO High Availability services


The Operator of each Red Hat OpenStack Services on OpenShift (RHOSO) High Availability service monitors the status of the pods that they manage. The service Operators aim to keep at least one replica of the service with a status of Running.

Procedure

  1. You can use the following command to check the status and availability of all the pods of the Galera, RabbitMQ, and memcached shared control plane services:

    $ oc get pods |egrep -e "galera|rabbit|memcache"
    NAME                            	READY   STATUS  	RESTARTS   AGE
    memcached-0                        1/1  Running 	0      	3h11m
    memcached-1                        1/1  Running 	0      	3h11m
    memcached-2                        1/1  Running 	0      	3h11m
    openstack-cell1-galera-0           1/1 	Running 	0      	3h11m
    openstack-cell1-galera-1           1/1 	Running 	0      	3h11m
    openstack-cell1-galera-2           1/1 	Running 	0      	3h11m
    openstack-galera-0                 1/1 	Running 	0      	3h11m
    openstack-galera-1                 1/1 	Running 	0      	3h11m
    openstack-galera-2                 1/1 	Running 	0      	3h11m
    rabbitmq-cell1-server-0            1/1 	Running 	0      	3h11m
    rabbitmq-cell1-server-1            1/1 	Running 	0      	3h11m
    rabbitmq-cell1-server-2            1/1 	Running 	0      	3h11m
    rabbitmq-server-0                  1/1 	Running 	0      	3h11m
    rabbitmq-server-1                  1/1 	Running 	0      	3h11m
    rabbitmq-server-2                  1/1 	Running 	0      	3h11m
  2. You can use the following command to investigate a specific pod from this list, typically to determine why a pod cannot be started:

    $ oc describe pod/<pod-name>
    • Replace <pod-name> with the name of the pod from the list of pods that you want more information about.

      In the following example the rabbitmq-server-0 pod is being investigated:

      $ oc describe pod/rabbitmq-server-0
      Name:         	rabbitmq-server-0
      Namespace:    	openstack
      Priority:     	0
      Service Account:  rabbitmq-server
      Node:         	master-2/192.168.111.22
      Start Time:   	Thu, 21 Mar 2024 08:39:57 -0400
      Labels:       	app.kubernetes.io/component=rabbitmq
                    	app.kubernetes.io/name=rabbitmq
                    	app.kubernetes.io/part-of=rabbitmq
                    	controller-revision-hash=rabbitmq-server-5c886b79b4
                    	statefulset.kubernetes.io/pod-name=rabbitmq-server-0
      Annotations:  	k8s.ovn.org/pod-networks:
                      	{"default":{"ip_addresses":["192.168.16.35/22"],"mac_address":"0a:58:c0:a8:10:23","gateway_ips":["192.168.16.1"],"routes":[{"dest":"192.16...
                    	k8s.v1.cni.cncf.io/network-status:
                      	[{
                          	"name": "ovn-kubernetes",
                          	"interface": "eth0",
                          	"ips": [
                              	"192.168.16.35"
                          	],
                          	"mac": "0a:58:c0:a8:10:23",
                          	"default": true,
                          	"dns": {}
                      	}]
                    	openshift.io/scc: restricted-v2
                    	seccomp.security.alpha.kubernetes.io/pod: runtime/default
      Status:       	Running
      ...

5.1. Understanding the Taint/Toleration based pod eviction process

Red Hat OpenShift Container Platform (RHOCP) implements a Taint/Toleration based pod eviction process, which determines how individual pods are evicted from worker nodes. Pods that are evicted are rescheduled on different nodes.

RHOCP assigns taints to specific worker node conditions, such as not-ready and unreachable. When a worker node experiences one of these conditions, RHOCP automatically taints the worker node. After a worker node has become tainted, the pods must determine if they can tolerate this taint or not:

  • Any pod that does not tolerate the taint is evicted immediately.
  • Any pod that does tolerate the taint will never be evicted, unless the pod has a limited toleration to this taint:

    The tolerationSeconds parameter specifies the limited toleration of a pod, which is how long a pod can tolerate a specific taint (node condition) and remain connected to the worker node. If the worker node condition still exists after the specified tolerationSeconds period, the taint remains on the worker node and the pod is evicted. If the worker node condition clears before the specified tolerationSeconds period, then the pod is not evicted.

RHOCP adds a default toleration for the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints of five minutes, by setting tolerationSeconds=300.

Important

Red Hat OpenStack Services on OpenShift (RHOSO) 18.0 Operators do not modify the default tolerations for taints, therefore pods that run on a tainted worker node take more than five minutes to be rescheduled.

Improving high availability and failover times

RHOCP provides the following Operators that you can install, which work together to provide increased resiliency to help reduce the failover times of workloads:

  • The Node Health Check Operator. For information, see Remediating Nodes with Node Health Checks in the Workload Availability for Red Hat OpenShift Remediation, fencing, and maintenance guide.
  • The Self Node Remediation Operator. For information, see Using Self Node remediation in the Workload Availability for Red Hat OpenShift Remediation, fencing, and maintenance guide.

    Note

    When a pod uses local storage and the worker node fails on which this pod is running, manual intervention might be required even when using the Self Node Remediation Operator.

  • The Fence Agents Remediation Operator. For information, see Using Fence Agents Remediation in the Workload Availability for Red Hat OpenShift Remediation, fencing, and maintenance guide.

Installing and configuring the Node Health Check Operator in conjunction with the Self Node Remediation Operator, the Fence Agents Remediation Operator, or both allows these Operators to control the time it takes for a worker node to be declared not-ready or unreachable, and thereby reduce the time a pod remains bound to a failed worker node.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.