2.3. Determining the state of the unhealthy etcd member


The steps to replace an unhealthy etcd member depend on which of the following states your etcd member is in:

  • The machine is not running or the node is not ready
  • The etcd pod is crashlooping

This procedure determines which state your etcd member is in. This enables you to know which procedure to follow to replace the unhealthy etcd member.

注意

If you are aware that the machine is not running or the node is not ready, but you expect it to return to a healthy state soon, then you do not need to perform a procedure to replace the etcd member. The etcd cluster Operator will automatically sync when the machine or node returns to a healthy state.

Prerequisites

  • You have access to the cluster as a user with the cluster-admin role.
  • You have identified an unhealthy etcd member.

Procedure

  1. Determine if the machine is not running:

    $ oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}' | grep -v running

    Example output

    ip-10-0-131-183.ec2.internal  stopped 1

    1
    This output lists the node and the status of the node’s machine. If the status is anything other than running, then the machine is not running.

    If the machine is not running, then follow the Replacing an unhealthy etcd member whose machine is not running or whose node is not ready procedure.

  2. Determine if the node is not ready.

    If either of the following scenarios are true, then the node is not ready.

    • If the machine is running, then check whether the node is unreachable:

      $ oc get nodes -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{"\t"}{range .spec.taints[*]}{.key}{" "}' | grep unreachable

      Example output

      ip-10-0-131-183.ec2.internal	node-role.kubernetes.io/master node.kubernetes.io/unreachable node.kubernetes.io/unreachable 1

      1
      If the node is listed with an unreachable taint, then the node is not ready.
    • If the node is still reachable, then check whether the node is listed as NotReady:

      $ oc get nodes -l node-role.kubernetes.io/master | grep "NotReady"

      Example output

      ip-10-0-131-183.ec2.internal   NotReady   master   122m   v1.18.3 1

      1
      If the node is listed as NotReady, then the node is not ready.

    If the node is not ready, then follow the Replacing an unhealthy etcd member whose machine is not running or whose node is not ready procedure.

  3. Determine if the etcd pod is crashlooping.

    If the machine is running and the node is ready, then check whether the etcd pod is crashlooping.

    1. Verify that all master nodes are listed as Ready:

      $ oc get nodes -l node-role.kubernetes.io/master

      Example output

      NAME                           STATUS   ROLES    AGE     VERSION
      ip-10-0-131-183.ec2.internal   Ready    master   6h13m   v1.18.3
      ip-10-0-164-97.ec2.internal    Ready    master   6h13m   v1.18.3
      ip-10-0-154-204.ec2.internal   Ready    master   6h13m   v1.18.3

    2. Check whether the status of an etcd pod is either Error or CrashloopBackoff:

      $ oc get pods -n openshift-etcd | grep etcd

      Example output

      etcd-ip-10-0-131-183.ec2.internal                2/3     Error       7          6h9m 1
      etcd-ip-10-0-164-97.ec2.internal                 3/3     Running     0          6h6m
      etcd-ip-10-0-154-204.ec2.internal                3/3     Running     0          6h6m

      1
      Since this status of this pod is Error, then the etcd pod is crashlooping.

    If the etcd pod is crashlooping, then follow the Replacing an unhealthy etcd member whose etcd pod is crashlooping procedure.

Red Hat logoGithubRedditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

© 2024 Red Hat, Inc.