11.6. 替换不健康的集群中的 control plane 节点
您可以通过删除不健康的 control plane 节点并添加新 control plane 节点,来替换 OpenShift Container Platform 集群中的不健康 control plane (master)节点(master)节点。
有关在健康集群中替换 control plane 节点的详情,请参考将 control plane 节点放在健康的集群中。
11.6.1. 删除不健康的 control plane 节点 复制链接链接已复制到粘贴板!
从集群中删除不健康的 control plane 节点。以下示例中为 node-0
。
先决条件
- 已安装具有至少三个 control plane 节点的集群。
- 至少一个 control plane 节点未就绪。
流程
检查节点状态,以确认 control plane 节点未就绪:
oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认集群不健康的
etcd-operator
日志中:oc logs -n openshift-etcd-operator etcd-operator deployment/etcd-operator
$ oc logs -n openshift-etcd-operator etcd-operator deployment/etcd-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
E0927 08:24:23.983733 1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, node-0 is unhealthy
E0927 08:24:23.983733 1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, node-0 is unhealthy
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令确认
etcd
成员:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd node-1
$ oc rsh -n openshift-etcd node-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 列出
etcdctl
成员:etcdctl member list -w table
# etcdctl member list -w table
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
确认
etcdctl
endpoint health 报告集群的不健康成员:etcdctl endpoint health
# etcdctl endpoint health
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
{"level":"warn","ts":"2022-09-27T08:25:35.953Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000680380/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 12.465641ms 192.168.111.26 is healthy: committed proposal: took = 12.297059ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
{"level":"warn","ts":"2022-09-27T08:25:35.953Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000680380/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 12.465641ms 192.168.111.26 is healthy: committed proposal: took = 12.297059ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 通过删除
Machine
自定义资源(CR)来删除不健康的 control plane:oc delete machine -n openshift-machine-api node-0
$ oc delete machine -n openshift-machine-api node-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意Machine
和Node
CR 可能不会被删除,因为它们受终结器保护。如果发生了这种情况,您必须通过删除所有终结器来手动删除Machine
CR。在
etcd-operator
日志中验证不健康的机器是否已被删除:oc logs -n openshift-etcd-operator etcd-operator deployment/ettcd-operator
$ oc logs -n openshift-etcd-operator etcd-operator deployment/ettcd-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
I0927 08:58:41.249222 1 machinedeletionhooks.go:135] skip removing the deletion hook from machine node-0 since its member is still present with any of: [{InternalIP } {InternalIP 192.168.111.25}]
I0927 08:58:41.249222 1 machinedeletionhooks.go:135] skip removing the deletion hook from machine node-0 since its member is still present with any of: [{InternalIP } {InternalIP 192.168.111.25}]
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 如果您看到删除已被跳过,如上例中所示,请手动删除不健康的
etcdctl
成员:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd node-1
$ oc rsh -n openshift-etcd node-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 列出
etcdctl
成员:etcdctl member list -w table
# etcdctl member list -w table
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认
etcdctl
endpoint health 报告集群的不健康成员:etcdctl endpoint health
# etcdctl endpoint health
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
{"level":"warn","ts":"2022-09-27T10:31:07.227Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d6e00/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 13.038278ms 192.168.111.26 is healthy: committed proposal: took = 12.950355ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
{"level":"warn","ts":"2022-09-27T10:31:07.227Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d6e00/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 13.038278ms 192.168.111.26 is healthy: committed proposal: took = 12.950355ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 从集群中删除不健康的
etcdctl
成员:etcdctl member remove 61e2a86084aafa62
# etcdctl member remove 61e2a86084aafa62
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Member 61e2a86084aafa62 removed from cluster 6881c977b97990d7
Member 61e2a86084aafa62 removed from cluster 6881c977b97990d7
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令验证不健康的
etcdctl
成员是否已移除:etcdctl member list -w table
# etcdctl member list -w table
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
11.6.2. 添加新 control plane 节点 复制链接链接已复制到粘贴板!
添加新的 control plane 节点,以替换您删除的不健康节点。在以下示例中,新节点为 node-5
。
先决条件
- 您已为第 2 天安装了 control plane 节点。如需更多信息,请参阅使用 Web 控制台 添加主机或使用 API 添加主机。
流程
为新的第 2 天 control plane 节点检索待处理的证书签名请求(CSR):
oc get csr | grep Pending
$ oc get csr | grep Pending
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> Pending
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> Pending
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 批准新节点的所有待处理的 CSR (本例中为
node-5
):oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意您必须批准 CSR 才能完成安装。
确认 control plane 节点处于
Ready
状态:oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 当集群使用
Machine
API 运行时,etcd
operator 需要一个 Machine CR 引用新节点。当集群有三个 control plane 节点时,机器 API 会被自动激活。创建
BareMetalHost
和Machine
CR,并将其链接到新的 control plane节点
CR。重要boot-it-yourself 将不会创建
BareMetalHost
和Machine
CR,因此您必须创建它们。无法创建BareMetalHost
和Machine
CR 将在etcd
operator 中生成错误。使用具有唯一
.metadata.name
值的BareMetalHost
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 应用
BareMetalHost
CR:oc apply -f <filename>
$ oc apply -f <filename>
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- 将 <filename> 替换为
BareMetalHost
CR 的名称。
使用唯一的
.metadata.name
值创建Machine
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 应用
Machine
CR:oc apply -f <filename>
$ oc apply -f <filename>
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- 将 <filename> 替换为
Machine
CR 的名称。
运行
link-machine-and-node.sh
脚本链接BareMetalHost
,Machine
, 和Node
:将以下
link-machine-and-node.sh
脚本复制到本地机器中:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 使脚本可执行:
chmod +x link-machine-and-node.sh
$ chmod +x link-machine-and-node.sh
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行脚本:
bash link-machine-and-node.sh node-5 node-5
$ bash link-machine-and-node.sh node-5 node-5
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意第一个
node-5
实例代表计算机,第二个代表该节点。
运行以下命令确认
etcd
成员:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd node-1
$ oc rsh -n openshift-etcd node-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 列出
etcdctl
成员:etcdctl member list -w table
# etcdctl member list -w table
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
监控
etcd
Operator 配置过程,直到完成:oc get clusteroperator etcd
$ oc get clusteroperator etcd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例(在完成中)
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.11.5 True False False 22h
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.11.5 True False False 22h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令确认
etcdctl
health:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd node-1
$ oc rsh -n openshift-etcd node-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 检查端点健康状况:
etcdctl endpoint health
# etcdctl endpoint health
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
192.168.111.26 is healthy: committed proposal: took = 9.105375ms 192.168.111.28 is healthy: committed proposal: took = 9.15205ms 192.168.111.29 is healthy: committed proposal: took = 10.277577ms
192.168.111.26 is healthy: committed proposal: took = 9.105375ms 192.168.111.28 is healthy: committed proposal: took = 9.15205ms 192.168.111.29 is healthy: committed proposal: took = 10.277577ms
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
确认节点的健康状况:
oc get Nodes
$ oc get Nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证集群 Operator 是否可用:
oc get ClusterOperators
$ oc get ClusterOperators
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证集群版本是否正确:
oc get ClusterVersion
$ oc get ClusterVersion
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 22h Cluster version is 4.11.5
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 22h Cluster version is 4.11.5
Copy to Clipboard Copied! Toggle word wrap Toggle overflow