11.5. 替换健康集群中的 control plane 节点


您可以通过添加新的 control plane 节点并删除现有的 control plane 节点,来替换健康的 OpenShift Container Platform 集群中的 control plane (master)节点,它有三个到五个 control plane 节点。

如果集群不健康,您必须在管理 control plane 节点前执行额外的操作。如需更多信息,请参阅在不健康的集群中替换 control plane 节点

11.5.1. 添加新 control plane 节点

添加新的 control plane 节点,并验证其状态是否健康。在以下示例中,新节点为 node-5

先决条件

  • 您使用 OpenShift Container Platform 4.11 或更高版本。
  • 已安装一个带有至少三个 control plane 节点的健康集群。
  • 您已创建了单个 control plane 节点,用于第 2 天。

流程

  1. 为新的第 2 天 control plane 节点检索待处理的证书签名请求(CSR):

    $ oc get csr | grep Pending
    Copy to Clipboard Toggle word wrap

    输出示例

    csr-5sd59   8m19s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
    csr-xzqts   10s     kubernetes.io/kubelet-serving                 system:node:node-5                                                   <none>              Pending
    Copy to Clipboard Toggle word wrap

  2. 批准新节点的所有待处理的 CSR (本例中为node-5 ):

    $ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
    Copy to Clipboard Toggle word wrap
    重要

    您必须批准 CSR 才能完成安装。

  3. 确认新的 control plane 节点处于 Ready 状态:

    $ oc get nodes
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME       STATUS   ROLES    AGE     VERSION
    node-0   Ready    master   4h42m   v1.24.0+3882f8f
    node-1   Ready    master   4h27m   v1.24.0+3882f8f
    node-2   Ready    master   4h43m   v1.24.0+3882f8f
    node-3   Ready    worker   4h29m   v1.24.0+3882f8f
    node-4   Ready    worker   4h30m   v1.24.0+3882f8f
    node-5   Ready    master   105s    v1.24.0+3882f8f
    Copy to Clipboard Toggle word wrap

    注意

    etcd Operator 需要 Machine 自定义资源(CR),在集群使用 Machine API 运行时引用新节点。当集群有三个或更多 control plane 节点时,机器 API 会被自动激活。

  4. 创建 BareMetalHostMachine CR,并将其链接到新的 control plane 节点 CR。

    1. 使用具有唯一 .metadata.name 值的 BareMetalHost CR:

      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        name: node-5
        namespace: openshift-machine-api
      spec:
        automatedCleaningMode: metadata
        bootMACAddress: 00:00:00:00:00:02
        bootMode: UEFI
        customDeploy:
          method: install_coreos
        externallyProvisioned: true
        online: true
        userData:
          name: master-user-data-managed
          namespace: openshift-machine-api
      Copy to Clipboard Toggle word wrap
    2. 应用 BareMetalHost CR:

      $ oc apply -f <filename> 
      1
      Copy to Clipboard Toggle word wrap
      1
      将 <filename> 替换为 BareMetalHost CR 的名称。
    3. 使用唯一的 .metadata.name 值创建 Machine CR:

      apiVersion: machine.openshift.io/v1beta1
      kind: Machine
      metadata:
        annotations:
          machine.openshift.io/instance-state: externally provisioned
          metal3.io/BareMetalHost: openshift-machine-api/node-5
        finalizers:
        - machine.machine.openshift.io
        labels:
          machine.openshift.io/cluster-api-cluster: <cluster_name> 
      1
      
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
        name: node-5
        namespace: openshift-machine-api
      spec:
        metadata: {}
        providerSpec:
          value:
            apiVersion: baremetal.cluster.k8s.io/v1alpha1
            customDeploy:
              method: install_coreos
            hostSelector: {}
            image:
              checksum: ""
              url: ""
            kind: BareMetalMachineProviderSpec
            metadata:
              creationTimestamp: null
            userData:
              name: master-user-data-managed
      Copy to Clipboard Toggle word wrap
      1
      <cluster_name> 替换为特定集群的名称,如 test-day2-1-6qv96

      要获取集群名称,请运行以下命令:

      $ oc get infrastructure cluster -o=jsonpath='{.status.infrastructureName}{"\n"}'
      Copy to Clipboard Toggle word wrap
    4. 应用 Machine CR:

      $ oc apply -f <filename> 
      1
      Copy to Clipboard Toggle word wrap
      1
      <filename> 替换为 Machine CR 的名称。
    5. 运行 link-machine-and-node.sh 脚本链接 BareMetalHost, Machine, 和 Node

      1. 将以下 link-machine-and-node.sh 脚本复制到本地机器中:

        #!/bin/bash
        
        # Credit goes to
        # https://bugzilla.redhat.com/show_bug.cgi?id=1801238.
        # This script will link Machine object
        # and Node object. This is needed
        # in order to have IP address of
        # the Node present in the status of the Machine.
        
        set -e
        
        machine="$1"
        node="$2"
        
        if [ -z "$machine" ] || [ -z "$node" ]; then
            echo "Usage: $0 MACHINE NODE"
            exit 1
        fi
        
        node_name=$(echo "${node}" | cut -f2 -d':')
        
        oc proxy &
        proxy_pid=$!
        function kill_proxy {
            kill $proxy_pid
        }
        trap kill_proxy EXIT SIGINT
        
        HOST_PROXY_API_PATH="http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts"
        
        function print_nics() {
            local ips
            local eob
            declare -a ips
        
            readarray -t ips < <(echo "${1}" \
                                 | jq '.[] | select(. | .type == "InternalIP") | .address' \
                                 | sed 's/"//g')
        
            eob=','
            for (( i=0; i<${#ips[@]}; i++ )); do
                if [ $((i+1)) -eq ${#ips[@]} ]; then
                    eob=""
                fi
                cat <<- EOF
                  {
                    "ip": "${ips[$i]}",
                    "mac": "00:00:00:00:00:00",
                    "model": "unknown",
                    "speedGbps": 10,
                    "vlanId": 0,
                    "pxe": true,
                    "name": "eth1"
                  }${eob}
        EOF
            done
        }
        
        function wait_for_json() {
            local name
            local url
            local curl_opts
            local timeout
        
            local start_time
            local curr_time
            local time_diff
        
            name="$1"
            url="$2"
            timeout="$3"
            shift 3
            curl_opts="$@"
            echo -n "Waiting for $name to respond"
            start_time=$(date +%s)
            until curl -g -X GET "$url" "${curl_opts[@]}" 2> /dev/null | jq '.' 2> /dev/null > /dev/null; do
                echo -n "."
                curr_time=$(date +%s)
                time_diff=$((curr_time - start_time))
                if [[ $time_diff -gt $timeout ]]; then
                    printf '\nTimed out waiting for %s' "${name}"
                    return 1
                fi
                sleep 5
            done
            echo " Success!"
            return 0
        }
        wait_for_json oc_proxy "${HOST_PROXY_API_PATH}" 10 -H "Accept: application/json" -H "Content-Type: application/json"
        
        addresses=$(oc get node -n openshift-machine-api "${node_name}" -o json | jq -c '.status.addresses')
        
        machine_data=$(oc get machines.machine.openshift.io -n openshift-machine-api -o json "${machine}")
        host=$(echo "$machine_data" | jq '.metadata.annotations["metal3.io/BareMetalHost"]' | cut -f2 -d/ | sed 's/"//g')
        
        if [ -z "$host" ]; then
            echo "Machine $machine is not linked to a host yet." 1>&2
            exit 1
        fi
        
        # The address structure on the host doesn't match the node, so extract
        # the values we want into separate variables so we can build the patch
        # we need.
        hostname=$(echo "${addresses}" | jq '.[] | select(. | .type == "Hostname") | .address' | sed 's/"//g')
        
        set +e
        read -r -d '' host_patch << EOF
        {
          "status": {
            "hardware": {
              "hostname": "${hostname}",
              "nics": [
        $(print_nics "${addresses}")
              ],
              "systemVendor": {
                "manufacturer": "Red Hat",
                "productName": "product name",
                "serialNumber": ""
              },
              "firmware": {
                "bios": {
                  "date": "04/01/2014",
                  "vendor": "SeaBIOS",
                  "version": "1.11.0-2.el7"
                }
              },
              "ramMebibytes": 0,
              "storage": [],
              "cpu": {
                "arch": "x86_64",
                "model": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
                "clockMegahertz": 2199.998,
                "count": 4,
                "flags": []
              }
            }
          }
        }
        EOF
        set -e
        
        echo "PATCHING HOST"
        echo "${host_patch}" | jq .
        
        curl -s \
             -X PATCH \
             "${HOST_PROXY_API_PATH}/${host}/status" \
             -H "Content-type: application/merge-patch+json" \
             -d "${host_patch}"
        
        oc get baremetalhost -n openshift-machine-api -o yaml "${host}"
        Copy to Clipboard Toggle word wrap
      2. 使脚本可执行:

        $ chmod +x link-machine-and-node.sh
        Copy to Clipboard Toggle word wrap
      3. 运行脚本:

        $ bash link-machine-and-node.sh node-5 node-5
        Copy to Clipboard Toggle word wrap
        注意

        第一个 node-5 实例代表计算机,第二个代表该节点。

  5. 通过执行预先存在的 control plane 节点之一来确认 etcd 成员:

    1. 打开到 control plane 节点的远程 shell 会话:

      $ oc rsh -n openshift-etcd etcd-node-0
      Copy to Clipboard Toggle word wrap
    2. 列出 etcd 成员:

      # etcdctl member list -w table
      Copy to Clipboard Toggle word wrap

      输出示例

      +--------+---------+--------+--------------+--------------+---------+
      |   ID   |  STATUS |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
      +--------+---------+--------+--------------+--------------+---------+
      |76ae1d00| started |node-0  |192.168.111.24|192.168.111.24|  false  |
      |2c18942f| started |node-1  |192.168.111.26|192.168.111.26|  false  |
      |61e2a860| started |node-2  |192.168.111.25|192.168.111.25|  false  |
      |ead5f280| started |node-5  |192.168.111.28|192.168.111.28|  false  |
      +--------+---------+--------+--------------+--------------+---------+
      Copy to Clipboard Toggle word wrap

  6. 监控 etcd Operator 配置过程,直到完成:

    $ oc get clusteroperator etcd
    Copy to Clipboard Toggle word wrap

    输出示例(在完成中)

    NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    etcd   4.11.5    True        False         False      5h54m
    Copy to Clipboard Toggle word wrap

  7. 运行以下命令确认 etcd 健康状况:

    1. 打开到 control plane 节点的远程 shell 会话:

      $ oc rsh -n openshift-etcd etcd-node-0
      Copy to Clipboard Toggle word wrap
    2. 检查端点健康状况:

      # etcdctl endpoint health
      Copy to Clipboard Toggle word wrap

      输出示例

      192.168.111.24 is healthy: committed proposal: took = 10.383651ms
      192.168.111.26 is healthy: committed proposal: took = 11.297561ms
      192.168.111.25 is healthy: committed proposal: took = 13.892416ms
      192.168.111.28 is healthy: committed proposal: took = 11.870755ms
      Copy to Clipboard Toggle word wrap

  8. 验证所有节点是否已就绪:

    $ oc get nodes
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME      STATUS   ROLES    AGE     VERSION
    node-0    Ready    master   6h20m   v1.24.0+3882f8f
    node-1    Ready    master   6h20m   v1.24.0+3882f8f
    node-2    Ready    master   6h4m    v1.24.0+3882f8f
    node-3    Ready    worker   6h7m    v1.24.0+3882f8f
    node-4    Ready    worker   6h7m    v1.24.0+3882f8f
    node-5    Ready    master   99m     v1.24.0+3882f8f
    Copy to Clipboard Toggle word wrap

  9. 验证集群 Operator 是否可用:

    $ oc get ClusterOperators
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME                                      VERSION AVAILABLE PROGRESSING DEGRADED SINCE MSG
    authentication                            4.11.5  True      False       False    5h57m
    baremetal                                 4.11.5  True      False       False    6h19m
    cloud-controller-manager                  4.11.5  True      False       False    6h20m
    cloud-credential                          4.11.5  True      False       False    6h23m
    cluster-autoscaler                        4.11.5  True      False       False    6h18m
    config-operator                           4.11.5  True      False       False    6h19m
    console                                   4.11.5  True      False       False    6h4m
    csi-snapshot-controller                   4.11.5  True      False       False    6h19m
    dns                                       4.11.5  True      False       False    6h18m
    etcd                                      4.11.5  True      False       False    6h17m
    image-registry                            4.11.5  True      False       False    6h7m
    ingress                                   4.11.5  True      False       False    6h6m
    insights                                  4.11.5  True      False       False    6h12m
    kube-apiserver                            4.11.5  True      False       False    6h16m
    kube-controller-manager                   4.11.5  True      False       False    6h16m
    kube-scheduler                            4.11.5  True      False       False    6h16m
    kube-storage-version-migrator             4.11.5  True      False       False    6h19m
    machine-api                               4.11.5  True      False       False    6h15m
    machine-approver                          4.11.5  True      False       False    6h19m
    machine-config                            4.11.5  True      False       False    6h18m
    marketplace                               4.11.5  True      False       False    6h18m
    monitoring                                4.11.5  True      False       False    6h4m
    network                                   4.11.5  True      False       False    6h20m
    node-tuning                               4.11.5  True      False       False    6h18m
    openshift-apiserver                       4.11.5  True      False       False    6h8m
    openshift-controller-manager              4.11.5  True      False       False    6h7m
    openshift-samples                         4.11.5  True      False       False    6h12m
    operator-lifecycle-manager                4.11.5  True      False       False    6h18m
    operator-lifecycle-manager-catalog        4.11.5  True      False       False    6h19m
    operator-lifecycle-manager-pkgsvr         4.11.5  True      False       False    6h12m
    service-ca                                4.11.5  True      False       False    6h19m
    storage                                   4.11.5  True      False       False    6h19m
    Copy to Clipboard Toggle word wrap

  10. 验证集群版本是否正确:

    $ oc get ClusterVersion
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.11.5    True        False         5h57m   Cluster version is 4.11.5
    Copy to Clipboard Toggle word wrap

11.5.2. 删除现有的 control plane 节点

删除您要替换的 control plane 节点。以下示例中为 node-0

先决条件

  • 您已添加了新的健康 control plane 节点。

流程

  1. 删除预先存在的 control plane 节点的 BareMetalHost CR:

    $ oc delete bmh -n openshift-machine-api node-0
    Copy to Clipboard Toggle word wrap
  2. 确认机器不健康:

    $ oc get machine -A
    Copy to Clipboard Toggle word wrap

    输出示例

    NAMESPACE               NAME     PHASE     AGE
    openshift-machine-api   node-0   Failed    20h
    openshift-machine-api   node-1   Running   20h
    openshift-machine-api   node-2   Running   20h
    openshift-machine-api   node-3   Running   19h
    openshift-machine-api   node-4   Running   19h
    openshift-machine-api   node-5   Running   14h
    Copy to Clipboard Toggle word wrap

  3. 删除 Machine CR:

    $ oc delete machine -n openshift-machine-api node-0
    machine.machine.openshift.io "node-0" deleted
    Copy to Clipboard Toggle word wrap
  4. 确认删除 Node CR:

    $ oc get nodes
    Copy to Clipboard Toggle word wrap

    输出示例

    NAME      STATUS   ROLES    AGE   VERSION
    node-1    Ready    master   20h   v1.24.0+3882f8f
    node-2    Ready    master   19h   v1.24.0+3882f8f
    node-3    Ready    worker   19h   v1.24.0+3882f8f
    node-4    Ready    worker   19h   v1.24.0+3882f8f
    node-5    Ready    master   15h   v1.24.0+3882f8f
    Copy to Clipboard Toggle word wrap

  5. 检查 etcd-operator 日志以确认 etcd 集群的状态:

    $ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf
    Copy to Clipboard Toggle word wrap

    输出示例

    E0927 07:53:10.597523       1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resource
    Copy to Clipboard Toggle word wrap

  6. 删除物理计算机,以允许 etcd Operator 协调群集成员:

    1. 打开到 control plane 节点的远程 shell 会话:

      $ oc rsh -n openshift-etcd etcd-node-1
      Copy to Clipboard Toggle word wrap
    2. 通过检查成员和端点健康状况来监控 etcd operator 协调的进度:

      # etcdctl member list -w table; etcdctl endpoint health
      Copy to Clipboard Toggle word wrap

      输出示例

      +--------+---------+--------+--------------+--------------+---------+
      |   ID   |  STATUS |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
      +--------+---------+--------+--------------+--------------+---------+
      |2c18942f| started | node-1 |192.168.111.26|192.168.111.26|  false  |
      |61e2a860| started | node-2 |192.168.111.25|192.168.111.25|  false  |
      |ead4f280| started | node-5 |192.168.111.28|192.168.111.28|  false  |
      +--------+---------+--------+--------------+--------------+---------+
      192.168.111.26 is healthy: committed proposal: took = 10.458132ms
      192.168.111.25 is healthy: committed proposal: took = 11.047349ms
      192.168.111.28 is healthy: committed proposal: took = 11.414402ms
      Copy to Clipboard Toggle word wrap

返回顶部
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2025 Red Hat