搜索

12.6. 在不健康集群中安装主 control plane 节点

download PDF

此流程描述了如何在不健康的 OpenShift Container Platform 集群上安装主 control plane 节点。

先决条件

  • 您已安装了一个具有至少三个节点的健康集群。
  • 您已创建了第 2 天 控制平面。
  • 您已创建了单个 control plane 节点。

流程

  1. 确认集群的初始状态:

    $ oc get nodes

    输出示例

    NAME       STATUS     ROLES    AGE   VERSION
    worker-1   Ready      worker   20h   v1.24.0+3882f8f
    master-2   NotReady   master   20h   v1.24.0+3882f8f
    master-3   Ready      master   20h   v1.24.0+3882f8f
    worker-4   Ready      worker   20h   v1.24.0+3882f8f
    master-5   Ready      master   15h   v1.24.0+3882f8f

  2. 确认 etcd-operator 检测到集群不健康:

    $ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf

    输出示例

    E0927 08:24:23.983733       1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, worker-2 is unhealthy

  3. 确认 etcdctl 成员:

    $ oc rsh -n openshift-etcd etcd-worker-3
    etcdctl member list -w table

    输出示例

    +--------+---------+--------+--------------+--------------+---------+
    |   ID   | STATUS  |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
    +--------+---------+--------+--------------+--------------+---------+
    |2c18942f| started |worker-3|192.168.111.26|192.168.111.26|  false  |
    |61e2a860| started |worker-2|192.168.111.25|192.168.111.25|  false  |
    |ead4f280| started |worker-5|192.168.111.28|192.168.111.28|  false  |
    +--------+---------+--------+--------------+--------------+---------+

  4. 确认 etcdctl 报告集群的不健康成员:

    $ etcdctl endpoint health

    输出示例

    {"level":"warn","ts":"2022-09-27T08:25:35.953Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000680380/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""}
    192.168.111.28 is healthy: committed proposal: took = 12.465641ms
    192.168.111.26 is healthy: committed proposal: took = 12.297059ms
    192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded
    Error: unhealthy cluster

  5. 通过删除 Machine 自定义资源来删除不健康的 control plane:

    $ oc delete machine -n openshift-machine-api test-day2-1-6qv96-master-2
    注意

    如果不健康的集群无法成功运行,则不会删除 MachineNode 自定义资源 (CR)。

  6. 确认 etcd-operator 没有删除不健康的机器:

    $ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf -f

    输出示例

    I0927 08:58:41.249222       1 machinedeletionhooks.go:135] skip removing the deletion hook from machine test-day2-1-6qv96-master-2 since its member is still present with any of: [{InternalIP } {InternalIP 192.168.111.26}]

  7. 手动删除不健康的 etcdctl 成员:

    $ oc rsh -n openshift-etcd etcd-worker-3\
    etcdctl member list -w table

    输出示例

    +--------+---------+--------+--------------+--------------+---------+
    |   ID   |  STATUS |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
    +--------+---------+--------+--------------+--------------+---------+
    |2c18942f| started |worker-3|192.168.111.26|192.168.111.26|  false  |
    |61e2a860| started |worker-2|192.168.111.25|192.168.111.25|  false  |
    |ead4f280| started |worker-5|192.168.111.28|192.168.111.28|  false  |
    +--------+---------+--------+--------------+--------------+---------+

  8. 确认 etcdctl 报告集群的不健康成员:

    $ etcdctl endpoint health

    输出示例

    {"level":"warn","ts":"2022-09-27T10:31:07.227Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d6e00/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""}
    192.168.111.28 is healthy: committed proposal: took = 13.038278ms
    192.168.111.26 is healthy: committed proposal: took = 12.950355ms
    192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded
    Error: unhealthy cluster

  9. 通过删除 etcdctl 成员自定义资源来删除不健康的集群:

    $ etcdctl member remove 61e2a86084aafa62

    输出示例

    Member 61e2a86084aafa62 removed from cluster 6881c977b97990d7

  10. 运行以下命令确认 etcdctl 的成员:

    $ etcdctl member list -w table

    输出示例

    +----------+---------+--------+--------------+--------------+-------+
    |    ID    | STATUS  |  NAME  |  PEER ADDRS  | CLIENT ADDRS |LEARNER|
    +----------+---------+--------+--------------+--------------+-------+
    | 2c18942f | started |worker-3|192.168.111.26|192.168.111.26| false |
    | ead4f280 | started |worker-5|192.168.111.28|192.168.111.28| false |
    +----------+---------+--------+--------------+--------------+-------+

  11. 检查并批准证书签名请求

    1. 查看证书签名请求 (CSR):

      $ oc get csr | grep Pending

      输出示例

      csr-5sd59   8m19s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
      csr-xzqts   10s     kubernetes.io/kubelet-serving                 system:node:worker-6                                                   <none>              Pending

    2. 批准所有待处理的 CSR:

      $ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
      注意

      您必须批准 CSR 才能完成安装。

  12. 确认 control plane 节点就绪状态:

    $ oc get nodes

    输出示例

    NAME       STATUS   ROLES    AGE     VERSION
    worker-1   Ready    worker   22h     v1.24.0+3882f8f
    master-3   Ready    master   22h     v1.24.0+3882f8f
    worker-4   Ready    worker   22h     v1.24.0+3882f8f
    master-5   Ready    master   17h     v1.24.0+3882f8f
    master-6   Ready    master   2m52s   v1.24.0+3882f8f

  13. 验证 Machine, NodeBareMetalHost 自定义资源。

    如果集群使用功能 Machine API 运行,etcd-operator 需要 Machine CR。存在时,Machine CR 会在 Running 阶段显示。

  14. 创建与 BareMetalHostNode 链接的 Machine 自定义资源。

    确保有 Machine CR 引用新添加的节点。

    重要

    boot-it-yourself 将不会创建 BareMetalHostMachine CR,因此您必须创建它们。如果无法创建 BareMetalHostMachine CR,在运行 etcd-operator 时会生成错误。

  15. 添加 BareMetalHost 自定义资源:

    $ oc create bmh -n openshift-machine-api custom-master3
  16. 添加 Machine 自定义资源:

    $ oc create machine -n openshift-machine-api custom-master3
  17. 运行 link-machine-and-node.sh 脚本链接 BareMetalHost, Machine, 和 Node

    #!/bin/bash
    
    # Credit goes to https://bugzilla.redhat.com/show_bug.cgi?id=1801238.
    # This script will link Machine object and Node object. This is needed
    # in order to have IP address of the Node present in the status of the Machine.
    
    # set -x
    set -e
    
    machine="$1"
    node="$2"
    
    if [ -z "$machine" ] || [ -z "$node" ]; then
        echo "Usage: $0 MACHINE NODE"
        exit 1
    fi
    
    # uid=$(echo "${node}" | cut -f1 -d':')
    node_name=$(echo "${node}" | cut -f2 -d':')
    
    oc proxy &
    proxy_pid=$!
    function kill_proxy {
        kill $proxy_pid
    }
    trap kill_proxy EXIT SIGINT
    
    HOST_PROXY_API_PATH="http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts"
    
    function print_nics() {
        local ips
        local eob
        declare -a ips
    
        readarray -t ips < <(echo "${1}" \
                             | jq '.[] | select(. | .type == "InternalIP") | .address' \
                             | sed 's/"//g')
    
        eob=','
        for (( i=0; i<${#ips[@]}; i++ )); do
            if [ $((i+1)) -eq ${#ips[@]} ]; then
                eob=""
            fi
            cat <<- EOF
              {
                "ip": "${ips[$i]}",
                "mac": "00:00:00:00:00:00",
                "model": "unknown",
                "speedGbps": 10,
                "vlanId": 0,
                "pxe": true,
                "name": "eth1"
              }${eob}
    EOF
        done
    }
    
    function wait_for_json() {
        local name
        local url
        local curl_opts
        local timeout
    
        local start_time
        local curr_time
        local time_diff
    
        name="$1"
        url="$2"
        timeout="$3"
        shift 3
        curl_opts="$@"
        echo -n "Waiting for $name to respond"
        start_time=$(date +%s)
        until curl -g -X GET "$url" "${curl_opts[@]}" 2> /dev/null | jq '.' 2> /dev/null > /dev/null; do
            echo -n "."
            curr_time=$(date +%s)
            time_diff=$((curr_time - start_time))
            if [[ $time_diff -gt $timeout ]]; then
                printf '\nTimed out waiting for %s' "${name}"
                return 1
            fi
            sleep 5
        done
        echo " Success!"
        return 0
    }
    wait_for_json oc_proxy "${HOST_PROXY_API_PATH}" 10 -H "Accept: application/json" -H "Content-Type: application/json"
    
    addresses=$(oc get node -n openshift-machine-api "${node_name}" -o json | jq -c '.status.addresses')
    
    machine_data=$(oc get machines.machine.openshift.io -n openshift-machine-api -o json "${machine}")
    host=$(echo "$machine_data" | jq '.metadata.annotations["metal3.io/BareMetalHost"]' | cut -f2 -d/ | sed 's/"//g')
    
    if [ -z "$host" ]; then
        echo "Machine $machine is not linked to a host yet." 1>&2
        exit 1
    fi
    
    # The address structure on the host doesn't match the node, so extract
    # the values we want into separate variables so we can build the patch
    # we need.
    hostname=$(echo "${addresses}" | jq '.[] | select(. | .type == "Hostname") | .address' | sed 's/"//g')
    
    set +e
    read -r -d '' host_patch << EOF
    {
      "status": {
        "hardware": {
          "hostname": "${hostname}",
          "nics": [
    $(print_nics "${addresses}")
          ],
          "systemVendor": {
            "manufacturer": "Red Hat",
            "productName": "product name",
            "serialNumber": ""
          },
          "firmware": {
            "bios": {
              "date": "04/01/2014",
              "vendor": "SeaBIOS",
              "version": "1.11.0-2.el7"
            }
          },
          "ramMebibytes": 0,
          "storage": [],
          "cpu": {
            "arch": "x86_64",
            "model": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
            "clockMegahertz": 2199.998,
            "count": 4,
            "flags": []
          }
        }
      }
    }
    EOF
    set -e
    
    echo "PATCHING HOST"
    echo "${host_patch}" | jq .
    
    curl -s \
         -X PATCH \
         "${HOST_PROXY_API_PATH}/${host}/status" \
         -H "Content-type: application/merge-patch+json" \
         -d "${host_patch}"
    
    oc get baremetalhost -n openshift-machine-api -o yaml "${host}"
    $ bash link-machine-and-node.sh custom-master3 worker-3
  18. 运行以下命令确认 etcdctl 的成员:

    $ oc rsh -n openshift-etcd etcd-worker-3
    etcdctl member list -w table

    输出示例

    +---------+-------+--------+--------------+--------------+-------+
    |   ID    | STATUS|  NAME  |   PEER ADDRS | CLIENT ADDRS |LEARNER|
    +---------+-------+--------+--------------+--------------+-------+
    | 2c18942f|started|worker-3|192.168.111.26|192.168.111.26| false |
    | ead4f280|started|worker-5|192.168.111.28|192.168.111.28| false |
    | 79153c5a|started|worker-6|192.168.111.29|192.168.111.29| false |
    +---------+-------+--------+--------------+--------------+-------+

  19. 确认 etcd Operator 已配置了所有节点:

    $ oc get clusteroperator etcd

    输出示例

    NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
    etcd   4.11.5    True        False         False      22h

  20. 确认 etcdctl 的健康状况:

    $ oc rsh -n openshift-etcd etcd-worker-3
    etcdctl endpoint health

    输出示例

    192.168.111.26 is healthy: committed proposal: took = 9.105375ms
    192.168.111.28 is healthy: committed proposal: took = 9.15205ms
    192.168.111.29 is healthy: committed proposal: took = 10.277577ms

  21. 确认节点的健康状况:

    $ oc get Nodes

    输出示例

    NAME       STATUS   ROLES    AGE   VERSION
    worker-1   Ready    worker   22h   v1.24.0+3882f8f
    master-3   Ready    master   22h   v1.24.0+3882f8f
    worker-4   Ready    worker   22h   v1.24.0+3882f8f
    master-5   Ready    master   18h   v1.24.0+3882f8f
    master-6   Ready    master   40m   v1.24.0+3882f8f

  22. 确认 ClusterOperators 的健康状况:

    $ oc get ClusterOperators

    输出示例

    NAME                               VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
    authentication                     4.11.5    True        False         False      150m
    baremetal                          4.11.5    True        False         False      22h
    cloud-controller-manager           4.11.5    True        False         False      22h
    cloud-credential                   4.11.5    True        False         False      22h
    cluster-autoscaler                 4.11.5    True        False         False      22h
    config-operator                    4.11.5    True        False         False      22h
    console                            4.11.5    True        False         False      145m
    csi-snapshot-controller            4.11.5    True        False         False      22h
    dns                                4.11.5    True        False         False      22h
    etcd                               4.11.5    True        False         False      22h
    image-registry                     4.11.5    True        False         False      22h
    ingress                            4.11.5    True        False         False      22h
    insights                           4.11.5    True        False         False      22h
    kube-apiserver                     4.11.5    True        False         False      22h
    kube-controller-manager            4.11.5    True        False         False      22h
    kube-scheduler                     4.11.5    True        False         False      22h
    kube-storage-version-migrator      4.11.5    True        False         False      148m
    machine-api                        4.11.5    True        False         False      22h
    machine-approver                   4.11.5    True        False         False      22h
    machine-config                     4.11.5    True        False         False      110m
    marketplace                        4.11.5    True        False         False      22h
    monitoring                         4.11.5    True        False         False      22h
    network                            4.11.5    True        False         False      22h
    node-tuning                        4.11.5    True        False         False      22h
    openshift-apiserver                4.11.5    True        False         False      163m
    openshift-controller-manager       4.11.5    True        False         False      22h
    openshift-samples                  4.11.5    True        False         False      22h
    operator-lifecycle-manager         4.11.5    True        False         False      22h
    operator-lifecycle-manager-catalog 4.11.5    True        False         False      22h
    operator-lifecycle-manager-pkgsvr  4.11.5    True        False         False      22h
    service-ca                         4.11.5    True        False         False      22h
    storage                            4.11.5    True        False         False      22h

  23. 确认 ClusterVersion

    $ oc get ClusterVersion

    输出示例

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.11.5    True        False         22h     Cluster version is 4.11.5

Red Hat logoGithubRedditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

© 2024 Red Hat, Inc.