2.3. Determining the state of the unhealthy etcd member
The steps to replace an unhealthy etcd member depend on which of the following states your etcd member is in:
- The machine is not running or the node is not ready
- The etcd pod is crashlooping
This procedure determines which state your etcd member is in. This enables you to know which procedure to follow to replace the unhealthy etcd member.
If you are aware that the machine is not running or the node is not ready, but you expect it to return to a healthy state soon, then you do not need to perform a procedure to replace the etcd member. The etcd cluster Operator will automatically sync when the machine or node returns to a healthy state.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. - You have identified an unhealthy etcd member.
Procedure
Determine if the machine is not running:
$ oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}' | grep -v running
Example output
ip-10-0-131-183.ec2.internal stopped 1
- 1
- This output lists the node and the status of the node’s machine. If the status is anything other than
running
, then the machine is not running.
If the machine is not running, then follow the Replacing an unhealthy etcd member whose machine is not running or whose node is not ready procedure.
Determine if the node is not ready.
If either of the following scenarios are true, then the node is not ready.
If the machine is running, then check whether the node is unreachable:
$ oc get nodes -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{"\t"}{range .spec.taints[*]}{.key}{" "}' | grep unreachable
Example output
ip-10-0-131-183.ec2.internal node-role.kubernetes.io/master node.kubernetes.io/unreachable node.kubernetes.io/unreachable 1
- 1
- If the node is listed with an
unreachable
taint, then the node is not ready.
If the node is still reachable, then check whether the node is listed as
NotReady
:$ oc get nodes -l node-role.kubernetes.io/master | grep "NotReady"
Example output
ip-10-0-131-183.ec2.internal NotReady master 122m v1.18.3 1
- 1
- If the node is listed as
NotReady
, then the node is not ready.
If the node is not ready, then follow the Replacing an unhealthy etcd member whose machine is not running or whose node is not ready procedure.
Determine if the etcd pod is crashlooping.
If the machine is running and the node is ready, then check whether the etcd pod is crashlooping.
Verify that all master nodes are listed as
Ready
:$ oc get nodes -l node-role.kubernetes.io/master
Example output
NAME STATUS ROLES AGE VERSION ip-10-0-131-183.ec2.internal Ready master 6h13m v1.18.3 ip-10-0-164-97.ec2.internal Ready master 6h13m v1.18.3 ip-10-0-154-204.ec2.internal Ready master 6h13m v1.18.3
Check whether the status of an etcd pod is either
Error
orCrashloopBackoff
:$ oc get pods -n openshift-etcd | grep etcd
Example output
etcd-ip-10-0-131-183.ec2.internal 2/3 Error 7 6h9m 1 etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 6h6m etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 6h6m
- 1
- Since this status of this pod is
Error
, then the etcd pod is crashlooping.
If the etcd pod is crashlooping, then follow the Replacing an unhealthy etcd member whose etcd pod is crashlooping procedure.