11.5. 替换健康集群中的 control plane 节点
您可以通过添加新的 control plane 节点并删除现有的 control plane 节点,来替换健康的 OpenShift Container Platform 集群中的 control plane (master)节点,它有三个到五个 control plane 节点。
如果集群不健康,您必须在管理 control plane 节点前执行额外的操作。如需更多信息,请参阅在不健康的集群中替换 control plane 节点。
11.5.1. 添加新 control plane 节点 复制链接链接已复制到粘贴板!
添加新的 control plane 节点,并验证其状态是否健康。在以下示例中,新节点为 node-5
。
先决条件
- 您使用 OpenShift Container Platform 4.11 或更高版本。
- 已安装一个带有至少三个 control plane 节点的健康集群。
- 您已创建了单个 control plane 节点,用于第 2 天。
流程
为新的第 2 天 control plane 节点检索待处理的证书签名请求(CSR):
oc get csr | grep Pending
$ oc get csr | grep Pending
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> Pending
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> Pending
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 批准新节点的所有待处理的 CSR (本例中为
node-5
):oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 重要您必须批准 CSR 才能完成安装。
确认新的 control plane 节点处于
Ready
状态:oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意etcd
Operator 需要Machine
自定义资源(CR),在集群使用 Machine API 运行时引用新节点。当集群有三个或更多 control plane 节点时,机器 API 会被自动激活。创建
BareMetalHost
和Machine
CR,并将其链接到新的 control plane节点
CR。使用具有唯一
.metadata.name
值的BareMetalHost
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 应用
BareMetalHost
CR:oc apply -f <filename>
$ oc apply -f <filename>
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- 将 <filename> 替换为
BareMetalHost
CR 的名称。
使用唯一的
.metadata.name
值创建Machine
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- 将
<cluster_name>
替换为特定集群的名称,如test-day2-1-6qv96
。
要获取集群名称,请运行以下命令:
oc get infrastructure cluster -o=jsonpath='{.status.infrastructureName}{"\n"}'
$ oc get infrastructure cluster -o=jsonpath='{.status.infrastructureName}{"\n"}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 应用
Machine
CR:oc apply -f <filename>
$ oc apply -f <filename>
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- 将
<filename>
替换为Machine
CR 的名称。
运行
link-machine-and-node.sh
脚本链接BareMetalHost
,Machine
, 和Node
:将以下
link-machine-and-node.sh
脚本复制到本地机器中:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 使脚本可执行:
chmod +x link-machine-and-node.sh
$ chmod +x link-machine-and-node.sh
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行脚本:
bash link-machine-and-node.sh node-5 node-5
$ bash link-machine-and-node.sh node-5 node-5
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意第一个
node-5
实例代表计算机,第二个代表该节点。
通过执行预先存在的 control plane 节点之一来确认
etcd
成员:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd etcd-node-0
$ oc rsh -n openshift-etcd etcd-node-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 列出
etcd
成员:etcdctl member list -w table
# etcdctl member list -w table
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
监控
etcd
Operator 配置过程,直到完成:oc get clusteroperator etcd
$ oc get clusteroperator etcd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例(在完成中)
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.11.5 True False False 5h54m
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.11.5 True False False 5h54m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令确认
etcd
健康状况:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd etcd-node-0
$ oc rsh -n openshift-etcd etcd-node-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 检查端点健康状况:
etcdctl endpoint health
# etcdctl endpoint health
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
192.168.111.24 is healthy: committed proposal: took = 10.383651ms 192.168.111.26 is healthy: committed proposal: took = 11.297561ms 192.168.111.25 is healthy: committed proposal: took = 13.892416ms 192.168.111.28 is healthy: committed proposal: took = 11.870755ms
192.168.111.24 is healthy: committed proposal: took = 10.383651ms 192.168.111.26 is healthy: committed proposal: took = 11.297561ms 192.168.111.25 is healthy: committed proposal: took = 13.892416ms 192.168.111.28 is healthy: committed proposal: took = 11.870755ms
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
验证所有节点是否已就绪:
oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证集群 Operator 是否可用:
oc get ClusterOperators
$ oc get ClusterOperators
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证集群版本是否正确:
oc get ClusterVersion
$ oc get ClusterVersion
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 5h57m Cluster version is 4.11.5
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 5h57m Cluster version is 4.11.5
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
11.5.2. 删除现有的 control plane 节点 复制链接链接已复制到粘贴板!
删除您要替换的 control plane 节点。以下示例中为 node-0
。
先决条件
- 您已添加了新的健康 control plane 节点。
流程
删除预先存在的 control plane 节点的
BareMetalHost
CR:oc delete bmh -n openshift-machine-api node-0
$ oc delete bmh -n openshift-machine-api node-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认机器不健康:
oc get machine -A
$ oc get machine -A
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 删除
Machine
CR:oc delete machine -n openshift-machine-api node-0
$ oc delete machine -n openshift-machine-api node-0 machine.machine.openshift.io "node-0" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认删除
Node
CR:oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 检查
etcd-operator
日志以确认etcd
集群的状态:oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf
$ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
E0927 07:53:10.597523 1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resource
E0927 07:53:10.597523 1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resource
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 删除物理计算机,以允许
etcd
Operator 协调群集成员:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd etcd-node-1
$ oc rsh -n openshift-etcd etcd-node-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 通过检查成员和端点健康状况来监控
etcd
operator 协调的进度:etcdctl member list -w table; etcdctl endpoint health
# etcdctl member list -w table; etcdctl endpoint health
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow