主页
产品
Assisted Installer for OpenShift Container Platform
2026
使用辅助安装程序安装 OpenShift Container Platform
11.6. 替换不健康的集群中的 control plane 节点

11.6. 替换不健康的集群中的 control plane 节点

您可以通过删除不健康的 control plane 节点并添加新 control plane 节点，来替换 OpenShift Container Platform 集群中的不健康 control plane (master)节点(master)节点。

有关在健康集群中替换 control plane 节点的详情，请参考将 control plane 节点放在健康的集群中。

11.6.1. 删除不健康的 control plane 节点
复制链接

从集群中删除不健康的 control plane 节点。以下示例中为 node-0。

先决条件

已安装具有至少三个 control plane 节点的集群。
至少一个 control plane 节点未就绪。

流程

检查节点状态，以确认 control plane 节点未就绪：

$ oc get nodes

输出示例

NAME      STATUS      ROLES    AGE   VERSION
node-0    NotReady    master   20h   v1.24.0+3882f8f
node-1    Ready       master   20h   v1.24.0+3882f8f
node-2    Ready       master   20h   v1.24.0+3882f8f
node-3    Ready       worker   20h   v1.24.0+3882f8f
node-4    Ready       worker   20h   v1.24.0+3882f8f

确认集群不健康的 etcd-operator 日志中：

$ oc logs -n openshift-etcd-operator etcd-operator deployment/etcd-operator

输出示例

E0927 08:24:23.983733       1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, node-0 is unhealthy

运行以下命令确认 etcd 成员：

打开到 control plane 节点的远程 shell 会话：
```
$ oc rsh -n openshift-etcd node-1
```

列出 etcdctl 成员：

# etcdctl member list -w table

输出示例

+--------+---------+--------+--------------+--------------+---------+
|   ID   | STATUS  |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
+--------+---------+--------+--------------+--------------+---------+
|61e2a860| started | node-0 |192.168.111.25|192.168.111.25|  false  |
|2c18942f| started | node-1 |192.168.111.26|192.168.111.26|  false  |
|ead4f280| started | node-2 |192.168.111.28|192.168.111.28|  false  |
+--------+---------+--------+--------------+--------------+---------+

确认 etcdctl endpoint health 报告集群的不健康成员：

# etcdctl endpoint health

输出示例

{"level":"warn","ts":"2022-09-27T08:25:35.953Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000680380/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""}
192.168.111.28 is healthy: committed proposal: took = 12.465641ms
192.168.111.26 is healthy: committed proposal: took = 12.297059ms
192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

通过删除 Machine 自定义资源(CR)来删除不健康的 control plane：
```
$ oc delete machine -n openshift-machine-api node-0
```
注意
Machine 和 Node CR 可能不会被删除，因为它们受终结器保护。如果发生了这种情况，您必须通过删除所有终结器来手动删除 Machine CR。

在 etcd-operator 日志中验证不健康的机器是否已被删除：

$ oc logs -n openshift-etcd-operator etcd-operator deployment/ettcd-operator

输出示例

I0927 08:58:41.249222       1 machinedeletionhooks.go:135] skip removing the deletion hook from machine node-0 since its member is still present with any of: [{InternalIP } {InternalIP 192.168.111.25}]

如果您看到删除已被跳过，如上例中所示，请手动删除不健康的 etcdctl 成员：

打开到 control plane 节点的远程 shell 会话：
```
$ oc rsh -n openshift-etcd node-1
```

列出 etcdctl 成员：

# etcdctl member list -w table

输出示例

+--------+---------+--------+--------------+--------------+---------+
|   ID   |  STATUS |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
+--------+---------+--------+--------------+--------------+---------+
|61e2a860| started | node-0 |192.168.111.25|192.168.111.25|  false  |
|2c18942f| started | node-1 |192.168.111.26|192.168.111.26|  false  |
|ead4f280| started | node-2 |192.168.111.28|192.168.111.28|  false  |
+--------+---------+--------+--------------+--------------+---------+

确认 etcdctl endpoint health 报告集群的不健康成员：

# etcdctl endpoint health

输出示例

{"level":"warn","ts":"2022-09-27T10:31:07.227Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d6e00/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""}
192.168.111.28 is healthy: committed proposal: took = 13.038278ms
192.168.111.26 is healthy: committed proposal: took = 12.950355ms
192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

从集群中删除不健康的 etcdctl 成员：

# etcdctl member remove 61e2a86084aafa62

输出示例

Member 61e2a86084aafa62 removed from cluster 6881c977b97990d7

运行以下命令验证不健康的 etcdctl 成员是否已移除：

# etcdctl member list -w table

输出示例

+----------+---------+--------+--------------+--------------+-------+
|    ID    | STATUS  |  NAME  |  PEER ADDRS  | CLIENT ADDRS |LEARNER|
+----------+---------+--------+--------------+--------------+-------+
| 2c18942f | started | node-1 |192.168.111.26|192.168.111.26| false |
| ead4f280 | started | node-2 |192.168.111.28|192.168.111.28| false |
+----------+---------+--------+--------------+--------------+-------+

11.6.2. 添加新 control plane 节点
复制链接

添加新的 control plane 节点，以替换您删除的不健康节点。在以下示例中，新节点为 node-5。

先决条件

您已为第 2 天安装了 control plane 节点。如需更多信息，请参阅使用 Web 控制台添加主机或使用 API 添加主机。

流程

为新的第 2 天 control plane 节点检索待处理的证书签名请求(CSR)：

$ oc get csr | grep Pending

输出示例

csr-5sd59   8m19s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-xzqts   10s     kubernetes.io/kubelet-serving                 system:node:node-5                                                   <none>              Pending

批准新节点的所有待处理的 CSR （本例中为node-5 ）：

$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve

注意

您必须批准 CSR 才能完成安装。

确认 control plane 节点处于 Ready 状态：

$ oc get nodes

输出示例

NAME      STATUS    ROLES     AGE     VERSION
node-1    Ready     master    20h     v1.24.0+3882f8f
node-2    Ready     master    20h     v1.24.0+3882f8f
node-3    Ready     worker    20h     v1.24.0+3882f8f
node-4    Ready     worker    20h     v1.24.0+3882f8f
node-5    Ready     master    2m52s   v1.24.0+3882f8f

当集群使用 Machine API 运行时，etcd operator 需要一个 Machine CR 引用新节点。当集群有三个 control plane 节点时，机器 API 会被自动激活。

创建 BareMetalHost 和 Machine CR，并将其链接到新的 control plane 节点 CR。

重要

boot-it-yourself 将不会创建 BareMetalHost 和 Machine CR，因此您必须创建它们。无法创建 BareMetalHost 和 Machine CR 将在 etcd operator 中生成错误。

使用具有唯一 .metadata.name 值的 BareMetalHost CR：

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: node-5
  namespace: openshift-machine-api
spec:
  automatedCleaningMode: metadata
  bootMACAddress: 00:00:00:00:00:02
  bootMode: UEFI
  customDeploy:
    method: install_coreos
  externallyProvisioned: true
  online: true
  userData:
    name: master-user-data-managed
    namespace: openshift-machine-api

应用 BareMetalHost CR：
```
$ oc apply -f <filename>
```
将 <filename > 替换为 BareMetalHost CR 的名称。

使用唯一的 .metadata.name 值创建 Machine CR：

apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: externally provisioned
    metal3.io/BareMetalHost: openshift-machine-api/node-5
  finalizers:
  - machine.machine.openshift.io
  labels:
    machine.openshift.io/cluster-api-cluster: test-day2-1-6qv96
    machine.openshift.io/cluster-api-machine-role: master
    machine.openshift.io/cluster-api-machine-type: master
  name: node-5
  namespace: openshift-machine-api
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: baremetal.cluster.k8s.io/v1alpha1
      customDeploy:
        method: install_coreos
      hostSelector: {}
      image:
        checksum: ""
        url: ""
      kind: BareMetalMachineProviderSpec
      metadata:
        creationTimestamp: null
      userData:
       name: master-user-data-managed

应用 Machine CR：
```
$ oc apply -f <filename>
```
将 <filename> 替换为 Machine CR 的名称。

运行 link-machine-and-node.sh 脚本链接 BareMetalHost, Machine, 和 Node：

将以下 link-machine-and-node.sh 脚本复制到本地机器中：

#!/bin/bash

# Credit goes to
# https://bugzilla.redhat.com/show_bug.cgi?id=1801238.
# This script will link Machine object
# and Node object. This is needed
# in order to have IP address of
# the Node present in the status of the Machine.

set -e

machine="$1"
node="$2"

if [ -z "$machine" ] || [ -z "$node" ]; then
    echo "Usage: $0 MACHINE NODE"
    exit 1
fi

node_name=$(echo "${node}" | cut -f2 -d':')

oc proxy &
proxy_pid=$!
function kill_proxy {
    kill $proxy_pid
}
trap kill_proxy EXIT SIGINT

HOST_PROXY_API_PATH="http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts"

function print_nics() {
    local ips
    local eob
    declare -a ips

    readarray -t ips < <(echo "${1}" \
                         | jq '.[] | select(. | .type == "InternalIP") | .address' \
                         | sed 's/"//g')

    eob=','
    for (( i=0; i<${#ips[@]}; i++ )); do
        if [ $((i+1)) -eq ${#ips[@]} ]; then
            eob=""
        fi
        cat <<- EOF
          {
            "ip": "${ips[$i]}",
            "mac": "00:00:00:00:00:00",
            "model": "unknown",
            "speedGbps": 10,
            "vlanId": 0,
            "pxe": true,
            "name": "eth1"
          }${eob}
EOF
    done
}

function wait_for_json() {
    local name
    local url
    local curl_opts
    local timeout

    local start_time
    local curr_time
    local time_diff

    name="$1"
    url="$2"
    timeout="$3"
    shift 3
    curl_opts="$@"
    echo -n "Waiting for $name to respond"
    start_time=$(date +%s)
    until curl -g -X GET "$url" "${curl_opts[@]}" 2> /dev/null | jq '.' 2> /dev/null > /dev/null; do
        echo -n "."
        curr_time=$(date +%s)
        time_diff=$((curr_time - start_time))
        if [[ $time_diff -gt $timeout ]]; then
            printf '\nTimed out waiting for %s' "${name}"
            return 1
        fi
        sleep 5
    done
    echo " Success!"
    return 0
}
wait_for_json oc_proxy "${HOST_PROXY_API_PATH}" 10 -H "Accept: application/json" -H "Content-Type: application/json"

addresses=$(oc get node -n openshift-machine-api "${node_name}" -o json | jq -c '.status.addresses')

machine_data=$(oc get machines.machine.openshift.io -n openshift-machine-api -o json "${machine}")
host=$(echo "$machine_data" | jq '.metadata.annotations["metal3.io/BareMetalHost"]' | cut -f2 -d/ | sed 's/"//g')

if [ -z "$host" ]; then
    echo "Machine $machine is not linked to a host yet." 1>&2
    exit 1
fi

# The address structure on the host doesn't match the node, so extract
# the values we want into separate variables so we can build the patch
# we need.
hostname=$(echo "${addresses}" | jq '.[] | select(. | .type == "Hostname") | .address' | sed 's/"//g')

set +e
read -r -d '' host_patch << EOF
{
  "status": {
    "hardware": {
      "hostname": "${hostname}",
      "nics": [
$(print_nics "${addresses}")
      ],
      "systemVendor": {
        "manufacturer": "Red Hat",
        "productName": "product name",
        "serialNumber": ""
      },
      "firmware": {
        "bios": {
          "date": "04/01/2014",
          "vendor": "SeaBIOS",
          "version": "1.11.0-2.el7"
        }
      },
      "ramMebibytes": 0,
      "storage": [],
      "cpu": {
        "arch": "x86_64",
        "model": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "clockMegahertz": 2199.998,
        "count": 4,
        "flags": []
      }
    }
  }
}
EOF
set -e

echo "PATCHING HOST"
echo "${host_patch}" | jq .

curl -s \
     -X PATCH \
     "${HOST_PROXY_API_PATH}/${host}/status" \
     -H "Content-type: application/merge-patch+json" \
     -d "${host_patch}"

oc get baremetalhost -n openshift-machine-api -o yaml "${host}"

使脚本可执行：
```
$ chmod +x link-machine-and-node.sh
```
运行脚本：
```
$ bash link-machine-and-node.sh node-5 node-5
```
注意
第一个 node-5 实例代表计算机，第二个代表该节点。

运行以下命令确认 etcd 成员：

打开到 control plane 节点的远程 shell 会话：
```
$ oc rsh -n openshift-etcd node-1
```

列出 etcdctl 成员：

# etcdctl member list -w table

输出示例

+---------+-------+--------+--------------+--------------+-------+
|   ID    | STATUS|  NAME  |   PEER ADDRS | CLIENT ADDRS |LEARNER|
+---------+-------+--------+--------------+--------------+-------+
| 2c18942f|started| node-1 |192.168.111.26|192.168.111.26| false |
| ead4f280|started| node-2 |192.168.111.28|192.168.111.28| false |
| 79153c5a|started| node-5 |192.168.111.29|192.168.111.29| false |
+---------+-------+--------+--------------+--------------+-------+

监控 etcd Operator 配置过程，直到完成：

$ oc get clusteroperator etcd

输出示例（在完成中）

NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd   4.11.5    True        False         False      22h

运行以下命令确认 etcdctl health:

打开到 control plane 节点的远程 shell 会话：
```
$ oc rsh -n openshift-etcd node-1
```

检查端点健康状况：

# etcdctl endpoint health

输出示例

192.168.111.26 is healthy: committed proposal: took = 9.105375ms
192.168.111.28 is healthy: committed proposal: took = 9.15205ms
192.168.111.29 is healthy: committed proposal: took = 10.277577ms

确认节点的健康状况：

$ oc get Nodes

输出示例

NAME     STATUS   ROLES    AGE   VERSION
node-1   Ready    master   20h   v1.24.0+3882f8f
node-2   Ready    master   20h   v1.24.0+3882f8f
node-3   Ready    worker   20h   v1.24.0+3882f8f
node-4   Ready    worker   20h   v1.24.0+3882f8f
node-5   Ready    master   40m   v1.24.0+3882f8f

验证集群 Operator 是否可用：

$ oc get ClusterOperators

输出示例

NAME                               VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                     4.11.5    True        False         False      150m
baremetal                          4.11.5    True        False         False      22h
cloud-controller-manager           4.11.5    True        False         False      22h
cloud-credential                   4.11.5    True        False         False      22h
cluster-autoscaler                 4.11.5    True        False         False      22h
config-operator                    4.11.5    True        False         False      22h
console                            4.11.5    True        False         False      145m
csi-snapshot-controller            4.11.5    True        False         False      22h
dns                                4.11.5    True        False         False      22h
etcd                               4.11.5    True        False         False      22h
image-registry                     4.11.5    True        False         False      22h
ingress                            4.11.5    True        False         False      22h
insights                           4.11.5    True        False         False      22h
kube-apiserver                     4.11.5    True        False         False      22h
kube-controller-manager            4.11.5    True        False         False      22h
kube-scheduler                     4.11.5    True        False         False      22h
kube-storage-version-migrator      4.11.5    True        False         False      148m
machine-api                        4.11.5    True        False         False      22h
machine-approver                   4.11.5    True        False         False      22h
machine-config                     4.11.5    True        False         False      110m
marketplace                        4.11.5    True        False         False      22h
monitoring                         4.11.5    True        False         False      22h
network                            4.11.5    True        False         False      22h
node-tuning                        4.11.5    True        False         False      22h
openshift-apiserver                4.11.5    True        False         False      163m
openshift-controller-manager       4.11.5    True        False         False      22h
openshift-samples                  4.11.5    True        False         False      22h
operator-lifecycle-manager         4.11.5    True        False         False      22h
operator-lifecycle-manager-catalog 4.11.5    True        False         False      22h
operator-lifecycle-manager-pkgsvr  4.11.5    True        False         False      22h
service-ca                         4.11.5    True        False         False      22h
storage                            4.11.5    True        False         False      22h

验证集群版本是否正确：

$ oc get ClusterVersion

输出示例

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.5    True        False         22h     Cluster version is 4.11.5

11.6. 替换不健康的集群中的 control plane 节点

11.6.1. 删除不健康的 control plane 节点
复制链接

11.6.2. 添加新 control plane 节点
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

11.6. 替换不健康的集群中的 control plane 节点

11.6.1. 删除不健康的 control plane 节点复制链接链接已复制到粘贴板!

11.6.2. 添加新 control plane 节点复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

11.6.1. 删除不健康的 control plane 节点
复制链接

11.6.2. 添加新 control plane 节点
复制链接