11.6. 替换不健康的集群中的 control plane 节点


您可以通过删除不健康的 control plane 节点并添加新 control plane 节点,来替换 OpenShift Container Platform 集群中的不健康 control plane (master)节点(master)节点。

有关在健康集群中替换 control plane 节点的详情,请参考将 control plane 节点放在健康的集群中

11.6.1. 删除不健康的 control plane 节点

从集群中删除不健康的 control plane 节点。以下示例中为 node-0

先决条件

  • 已安装具有至少三个 control plane 节点的集群。
  • 至少一个 control plane 节点未就绪。

流程

  1. 检查节点状态,以确认 control plane 节点未就绪:

    $ oc get nodes
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME      STATUS      ROLES    AGE   VERSION
    node-0    NotReady    master   20h   v1.24.0+3882f8f
    node-1    Ready       master   20h   v1.24.0+3882f8f
    node-2    Ready       master   20h   v1.24.0+3882f8f
    node-3    Ready       worker   20h   v1.24.0+3882f8f
    node-4    Ready       worker   20h   v1.24.0+3882f8f
    Copy to Clipboard Toggle word wrap

  2. 确认集群不健康的 etcd-operator 日志中:

    $ oc logs -n openshift-etcd-operator etcd-operator deployment/etcd-operator
    Copy to Clipboard Toggle word wrap

    输出示例

    E0927 08:24:23.983733       1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, node-0 is unhealthy
    Copy to Clipboard Toggle word wrap

  3. 运行以下命令确认 etcd 成员:

    1. 打开到 control plane 节点的远程 shell 会话:

      $ oc rsh -n openshift-etcd node-1
      Copy to Clipboard Toggle word wrap
    2. 列出 etcdctl 成员:

      # etcdctl member list -w table
      Copy to Clipboard Toggle word wrap

      输出示例

      +--------+---------+--------+--------------+--------------+---------+
      |   ID   | STATUS  |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
      +--------+---------+--------+--------------+--------------+---------+
      |61e2a860| started | node-0 |192.168.111.25|192.168.111.25|  false  |
      |2c18942f| started | node-1 |192.168.111.26|192.168.111.26|  false  |
      |ead4f280| started | node-2 |192.168.111.28|192.168.111.28|  false  |
      +--------+---------+--------+--------------+--------------+---------+
      Copy to Clipboard Toggle word wrap

  4. 确认 etcdctl endpoint health 报告集群的不健康成员:

    # etcdctl endpoint health
    Copy to Clipboard Toggle word wrap

    输出示例

    {"level":"warn","ts":"2022-09-27T08:25:35.953Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000680380/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""}
    192.168.111.28 is healthy: committed proposal: took = 12.465641ms
    192.168.111.26 is healthy: committed proposal: took = 12.297059ms
    192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded
    Error: unhealthy cluster
    Copy to Clipboard Toggle word wrap

  5. 通过删除 Machine 自定义资源(CR)来删除不健康的 control plane:

    $ oc delete machine -n openshift-machine-api node-0
    Copy to Clipboard Toggle word wrap
    注意

    MachineNode CR 可能不会被删除,因为它们受终结器保护。如果发生了这种情况,您必须通过删除所有终结器来手动删除 Machine CR。

  6. etcd-operator 日志中验证不健康的机器是否已被删除:

    $ oc logs -n openshift-etcd-operator etcd-operator deployment/ettcd-operator
    Copy to Clipboard Toggle word wrap

    输出示例

    I0927 08:58:41.249222       1 machinedeletionhooks.go:135] skip removing the deletion hook from machine node-0 since its member is still present with any of: [{InternalIP } {InternalIP 192.168.111.25}]
    Copy to Clipboard Toggle word wrap

  7. 如果您看到删除已被跳过,如上例中所示,请手动删除不健康的 etcdctl 成员:

    1. 打开到 control plane 节点的远程 shell 会话:

      $ oc rsh -n openshift-etcd node-1
      Copy to Clipboard Toggle word wrap
    2. 列出 etcdctl 成员:

      # etcdctl member list -w table
      Copy to Clipboard Toggle word wrap

      输出示例

      +--------+---------+--------+--------------+--------------+---------+
      |   ID   |  STATUS |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
      +--------+---------+--------+--------------+--------------+---------+
      |61e2a860| started | node-0 |192.168.111.25|192.168.111.25|  false  |
      |2c18942f| started | node-1 |192.168.111.26|192.168.111.26|  false  |
      |ead4f280| started | node-2 |192.168.111.28|192.168.111.28|  false  |
      +--------+---------+--------+--------------+--------------+---------+
      Copy to Clipboard Toggle word wrap

    3. 确认 etcdctl endpoint health 报告集群的不健康成员:

      # etcdctl endpoint health
      Copy to Clipboard Toggle word wrap

      输出示例

      {"level":"warn","ts":"2022-09-27T10:31:07.227Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d6e00/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""}
      192.168.111.28 is healthy: committed proposal: took = 13.038278ms
      192.168.111.26 is healthy: committed proposal: took = 12.950355ms
      192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded
      Error: unhealthy cluster
      Copy to Clipboard Toggle word wrap

    4. 从集群中删除不健康的 etcdctl 成员:

      # etcdctl member remove 61e2a86084aafa62
      Copy to Clipboard Toggle word wrap

      输出示例

      Member 61e2a86084aafa62 removed from cluster 6881c977b97990d7
      Copy to Clipboard Toggle word wrap

    5. 运行以下命令验证不健康的 etcdctl 成员是否已移除:

      # etcdctl member list -w table
      Copy to Clipboard Toggle word wrap

      输出示例

      +----------+---------+--------+--------------+--------------+-------+
      |    ID    | STATUS  |  NAME  |  PEER ADDRS  | CLIENT ADDRS |LEARNER|
      +----------+---------+--------+--------------+--------------+-------+
      | 2c18942f | started | node-1 |192.168.111.26|192.168.111.26| false |
      | ead4f280 | started | node-2 |192.168.111.28|192.168.111.28| false |
      +----------+---------+--------+--------------+--------------+-------+
      Copy to Clipboard Toggle word wrap

11.6.2. 添加新 control plane 节点

添加新的 control plane 节点,以替换您删除的不健康节点。在以下示例中,新节点为 node-5

先决条件

流程

  1. 为新的第 2 天 control plane 节点检索待处理的证书签名请求(CSR):

    $ oc get csr | grep Pending
    Copy to Clipboard Toggle word wrap

    输出示例

    csr-5sd59   8m19s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
    csr-xzqts   10s     kubernetes.io/kubelet-serving                 system:node:node-5                                                   <none>              Pending
    Copy to Clipboard Toggle word wrap

  2. 批准新节点的所有待处理的 CSR (本例中为node-5 ):

    $ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
    Copy to Clipboard Toggle word wrap
    注意

    您必须批准 CSR 才能完成安装。

  3. 确认 control plane 节点处于 Ready 状态:

    $ oc get nodes
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME      STATUS    ROLES     AGE     VERSION
    node-1    Ready     master    20h     v1.24.0+3882f8f
    node-2    Ready     master    20h     v1.24.0+3882f8f
    node-3    Ready     worker    20h     v1.24.0+3882f8f
    node-4    Ready     worker    20h     v1.24.0+3882f8f
    node-5    Ready     master    2m52s   v1.24.0+3882f8f
    Copy to Clipboard Toggle word wrap

    当集群使用 Machine API 运行时,etcd operator 需要一个 Machine CR 引用新节点。当集群有三个 control plane 节点时,机器 API 会被自动激活。

  4. 创建 BareMetalHostMachine CR,并将其链接到新的 control plane 节点 CR。

    重要

    boot-it-yourself 将不会创建 BareMetalHostMachine CR,因此您必须创建它们。无法创建 BareMetalHostMachine CR 将在 etcd operator 中生成错误。

    1. 使用具有唯一 .metadata.name 值的 BareMetalHost CR:

      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        name: node-5
        namespace: openshift-machine-api
      spec:
        automatedCleaningMode: metadata
        bootMACAddress: 00:00:00:00:00:02
        bootMode: UEFI
        customDeploy:
          method: install_coreos
        externallyProvisioned: true
        online: true
        userData:
          name: master-user-data-managed
          namespace: openshift-machine-api
      Copy to Clipboard Toggle word wrap
    2. 应用 BareMetalHost CR:

      $ oc apply -f <filename> 
      1
      Copy to Clipboard Toggle word wrap
      1
      将 <filename> 替换为 BareMetalHost CR 的名称。
    3. 使用唯一的 .metadata.name 值创建 Machine CR:

      apiVersion: machine.openshift.io/v1beta1
      kind: Machine
      metadata:
        annotations:
          machine.openshift.io/instance-state: externally provisioned
          metal3.io/BareMetalHost: openshift-machine-api/node-5
        finalizers:
        - machine.machine.openshift.io
        labels:
          machine.openshift.io/cluster-api-cluster: test-day2-1-6qv96
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
        name: node-5
        namespace: openshift-machine-api
      spec:
        metadata: {}
        providerSpec:
          value:
            apiVersion: baremetal.cluster.k8s.io/v1alpha1
            customDeploy:
              method: install_coreos
            hostSelector: {}
            image:
              checksum: ""
              url: ""
            kind: BareMetalMachineProviderSpec
            metadata:
              creationTimestamp: null
            userData:
             name: master-user-data-managed
      Copy to Clipboard Toggle word wrap
    4. 应用 Machine CR:

      $ oc apply -f <filename> 
      1
      Copy to Clipboard Toggle word wrap
      1
      将 <filename> 替换为 Machine CR 的名称。
    5. 运行 link-machine-and-node.sh 脚本链接 BareMetalHost, Machine, 和 Node

      1. 将以下 link-machine-and-node.sh 脚本复制到本地机器中:

        #!/bin/bash
        
        # Credit goes to
        # https://bugzilla.redhat.com/show_bug.cgi?id=1801238.
        # This script will link Machine object
        # and Node object. This is needed
        # in order to have IP address of
        # the Node present in the status of the Machine.
        
        set -e
        
        machine="$1"
        node="$2"
        
        if [ -z "$machine" ] || [ -z "$node" ]; then
            echo "Usage: $0 MACHINE NODE"
            exit 1
        fi
        
        node_name=$(echo "${node}" | cut -f2 -d':')
        
        oc proxy &
        proxy_pid=$!
        function kill_proxy {
            kill $proxy_pid
        }
        trap kill_proxy EXIT SIGINT
        
        HOST_PROXY_API_PATH="http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts"
        
        function print_nics() {
            local ips
            local eob
            declare -a ips
        
            readarray -t ips < <(echo "${1}" \
                                 | jq '.[] | select(. | .type == "InternalIP") | .address' \
                                 | sed 's/"//g')
        
            eob=','
            for (( i=0; i<${#ips[@]}; i++ )); do
                if [ $((i+1)) -eq ${#ips[@]} ]; then
                    eob=""
                fi
                cat <<- EOF
                  {
                    "ip": "${ips[$i]}",
                    "mac": "00:00:00:00:00:00",
                    "model": "unknown",
                    "speedGbps": 10,
                    "vlanId": 0,
                    "pxe": true,
                    "name": "eth1"
                  }${eob}
        EOF
            done
        }
        
        function wait_for_json() {
            local name
            local url
            local curl_opts
            local timeout
        
            local start_time
            local curr_time
            local time_diff
        
            name="$1"
            url="$2"
            timeout="$3"
            shift 3
            curl_opts="$@"
            echo -n "Waiting for $name to respond"
            start_time=$(date +%s)
            until curl -g -X GET "$url" "${curl_opts[@]}" 2> /dev/null | jq '.' 2> /dev/null > /dev/null; do
                echo -n "."
                curr_time=$(date +%s)
                time_diff=$((curr_time - start_time))
                if [[ $time_diff -gt $timeout ]]; then
                    printf '\nTimed out waiting for %s' "${name}"
                    return 1
                fi
                sleep 5
            done
            echo " Success!"
            return 0
        }
        wait_for_json oc_proxy "${HOST_PROXY_API_PATH}" 10 -H "Accept: application/json" -H "Content-Type: application/json"
        
        addresses=$(oc get node -n openshift-machine-api "${node_name}" -o json | jq -c '.status.addresses')
        
        machine_data=$(oc get machines.machine.openshift.io -n openshift-machine-api -o json "${machine}")
        host=$(echo "$machine_data" | jq '.metadata.annotations["metal3.io/BareMetalHost"]' | cut -f2 -d/ | sed 's/"//g')
        
        if [ -z "$host" ]; then
            echo "Machine $machine is not linked to a host yet." 1>&2
            exit 1
        fi
        
        # The address structure on the host doesn't match the node, so extract
        # the values we want into separate variables so we can build the patch
        # we need.
        hostname=$(echo "${addresses}" | jq '.[] | select(. | .type == "Hostname") | .address' | sed 's/"//g')
        
        set +e
        read -r -d '' host_patch << EOF
        {
          "status": {
            "hardware": {
              "hostname": "${hostname}",
              "nics": [
        $(print_nics "${addresses}")
              ],
              "systemVendor": {
                "manufacturer": "Red Hat",
                "productName": "product name",
                "serialNumber": ""
              },
              "firmware": {
                "bios": {
                  "date": "04/01/2014",
                  "vendor": "SeaBIOS",
                  "version": "1.11.0-2.el7"
                }
              },
              "ramMebibytes": 0,
              "storage": [],
              "cpu": {
                "arch": "x86_64",
                "model": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
                "clockMegahertz": 2199.998,
                "count": 4,
                "flags": []
              }
            }
          }
        }
        EOF
        set -e
        
        echo "PATCHING HOST"
        echo "${host_patch}" | jq .
        
        curl -s \
             -X PATCH \
             "${HOST_PROXY_API_PATH}/${host}/status" \
             -H "Content-type: application/merge-patch+json" \
             -d "${host_patch}"
        
        oc get baremetalhost -n openshift-machine-api -o yaml "${host}"
        Copy to Clipboard Toggle word wrap
      2. 使脚本可执行:

        $ chmod +x link-machine-and-node.sh
        Copy to Clipboard Toggle word wrap
      3. 运行脚本:

        $ bash link-machine-and-node.sh node-5 node-5
        Copy to Clipboard Toggle word wrap
        注意

        第一个 node-5 实例代表计算机,第二个代表该节点。

  5. 运行以下命令确认 etcd 成员:

    1. 打开到 control plane 节点的远程 shell 会话:

      $ oc rsh -n openshift-etcd node-1
      Copy to Clipboard Toggle word wrap
    2. 列出 etcdctl 成员:

      # etcdctl member list -w table
      Copy to Clipboard Toggle word wrap

      输出示例

      +---------+-------+--------+--------------+--------------+-------+
      |   ID    | STATUS|  NAME  |   PEER ADDRS | CLIENT ADDRS |LEARNER|
      +---------+-------+--------+--------------+--------------+-------+
      | 2c18942f|started| node-1 |192.168.111.26|192.168.111.26| false |
      | ead4f280|started| node-2 |192.168.111.28|192.168.111.28| false |
      | 79153c5a|started| node-5 |192.168.111.29|192.168.111.29| false |
      +---------+-------+--------+--------------+--------------+-------+
      Copy to Clipboard Toggle word wrap

  6. 监控 etcd Operator 配置过程,直到完成:

    $ oc get clusteroperator etcd
    Copy to Clipboard Toggle word wrap

    输出示例(在完成中)

    NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
    etcd   4.11.5    True        False         False      22h
    Copy to Clipboard Toggle word wrap

  7. 运行以下命令确认 etcdctl health:

    1. 打开到 control plane 节点的远程 shell 会话:

      $ oc rsh -n openshift-etcd node-1
      Copy to Clipboard Toggle word wrap
    2. 检查端点健康状况:

      # etcdctl endpoint health
      Copy to Clipboard Toggle word wrap

      输出示例

      192.168.111.26 is healthy: committed proposal: took = 9.105375ms
      192.168.111.28 is healthy: committed proposal: took = 9.15205ms
      192.168.111.29 is healthy: committed proposal: took = 10.277577ms
      Copy to Clipboard Toggle word wrap

  8. 确认节点的健康状况:

    $ oc get Nodes
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME     STATUS   ROLES    AGE   VERSION
    node-1   Ready    master   20h   v1.24.0+3882f8f
    node-2   Ready    master   20h   v1.24.0+3882f8f
    node-3   Ready    worker   20h   v1.24.0+3882f8f
    node-4   Ready    worker   20h   v1.24.0+3882f8f
    node-5   Ready    master   40m   v1.24.0+3882f8f
    Copy to Clipboard Toggle word wrap

  9. 验证集群 Operator 是否可用:

    $ oc get ClusterOperators
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME                               VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
    authentication                     4.11.5    True        False         False      150m
    baremetal                          4.11.5    True        False         False      22h
    cloud-controller-manager           4.11.5    True        False         False      22h
    cloud-credential                   4.11.5    True        False         False      22h
    cluster-autoscaler                 4.11.5    True        False         False      22h
    config-operator                    4.11.5    True        False         False      22h
    console                            4.11.5    True        False         False      145m
    csi-snapshot-controller            4.11.5    True        False         False      22h
    dns                                4.11.5    True        False         False      22h
    etcd                               4.11.5    True        False         False      22h
    image-registry                     4.11.5    True        False         False      22h
    ingress                            4.11.5    True        False         False      22h
    insights                           4.11.5    True        False         False      22h
    kube-apiserver                     4.11.5    True        False         False      22h
    kube-controller-manager            4.11.5    True        False         False      22h
    kube-scheduler                     4.11.5    True        False         False      22h
    kube-storage-version-migrator      4.11.5    True        False         False      148m
    machine-api                        4.11.5    True        False         False      22h
    machine-approver                   4.11.5    True        False         False      22h
    machine-config                     4.11.5    True        False         False      110m
    marketplace                        4.11.5    True        False         False      22h
    monitoring                         4.11.5    True        False         False      22h
    network                            4.11.5    True        False         False      22h
    node-tuning                        4.11.5    True        False         False      22h
    openshift-apiserver                4.11.5    True        False         False      163m
    openshift-controller-manager       4.11.5    True        False         False      22h
    openshift-samples                  4.11.5    True        False         False      22h
    operator-lifecycle-manager         4.11.5    True        False         False      22h
    operator-lifecycle-manager-catalog 4.11.5    True        False         False      22h
    operator-lifecycle-manager-pkgsvr  4.11.5    True        False         False      22h
    service-ca                         4.11.5    True        False         False      22h
    storage                            4.11.5    True        False         False      22h
    Copy to Clipboard Toggle word wrap

  10. 验证集群版本是否正确:

    $ oc get ClusterVersion
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.11.5    True        False         22h     Cluster version is 4.11.5
    Copy to Clipboard Toggle word wrap

返回顶部
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2025 Red Hat