11.5. 替换健康集群中的 control plane 节点
您可以通过添加新的 control plane 节点并删除现有的 control plane 节点,来替换健康的 OpenShift Container Platform 集群中的 control plane (master)节点,它有三个到五个 control plane 节点。
如果集群不健康,您必须在管理 control plane 节点前执行额外的操作。如需更多信息,请参阅在不健康的集群中替换 control plane 节点。
11.5.1. 添加新 control plane 节点 复制链接链接已复制到粘贴板!
添加新的 control plane 节点,并验证其状态是否健康。在以下示例中,新节点为 node-5。
先决条件
- 您使用 OpenShift Container Platform 4.11 或更高版本。
- 已安装一个带有至少三个 control plane 节点的健康集群。
- 您已创建了单个 control plane 节点,用于第 2 天。
流程
为新的第 2 天 control plane 节点检索待处理的证书签名请求(CSR):
oc get csr | grep Pending
$ oc get csr | grep PendingCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> Pending
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> PendingCopy to Clipboard Copied! Toggle word wrap Toggle overflow 批准新节点的所有待处理的 CSR (本例中为
node-5):oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approveCopy to Clipboard Copied! Toggle word wrap Toggle overflow 重要您必须批准 CSR 才能完成安装。
确认新的 control plane 节点处于
Ready状态:oc get nodes
$ oc get nodesCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意etcdOperator 需要Machine自定义资源(CR),在集群使用 Machine API 运行时引用新节点。当集群有三个或更多 control plane 节点时,机器 API 会被自动激活。创建
BareMetalHost和MachineCR,并将其链接到新的 control plane节点CR。使用具有唯一
.metadata.name值的BareMetalHostCR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 应用
BareMetalHostCR:oc apply -f <filename>
$ oc apply -f <filename>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- 将 <filename> 替换为
BareMetalHostCR 的名称。
使用唯一的
.metadata.name值创建MachineCR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- 将
<cluster_name>替换为特定集群的名称,如test-day2-1-6qv96。
要获取集群名称,请运行以下命令:
oc get infrastructure cluster -o=jsonpath='{.status.infrastructureName}{"\n"}'$ oc get infrastructure cluster -o=jsonpath='{.status.infrastructureName}{"\n"}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow 应用
MachineCR:oc apply -f <filename>
$ oc apply -f <filename>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- 将
<filename>替换为MachineCR 的名称。
运行
link-machine-and-node.sh脚本链接BareMetalHost,Machine, 和Node:将以下
link-machine-and-node.sh脚本复制到本地机器中:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 使脚本可执行:
chmod +x link-machine-and-node.sh
$ chmod +x link-machine-and-node.shCopy to Clipboard Copied! Toggle word wrap Toggle overflow 运行脚本:
bash link-machine-and-node.sh node-5 node-5
$ bash link-machine-and-node.sh node-5 node-5Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意第一个
node-5实例代表计算机,第二个代表该节点。
通过执行预先存在的 control plane 节点之一来确认
etcd成员:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd etcd-node-0
$ oc rsh -n openshift-etcd etcd-node-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow 列出
etcd成员:etcdctl member list -w table
# etcdctl member list -w tableCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
监控
etcdOperator 配置过程,直到完成:oc get clusteroperator etcd
$ oc get clusteroperator etcdCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例(在完成中)
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.11.5 True False False 5h54m
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.11.5 True False False 5h54mCopy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令确认
etcd健康状况:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd etcd-node-0
$ oc rsh -n openshift-etcd etcd-node-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow 检查端点健康状况:
etcdctl endpoint health
# etcdctl endpoint healthCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
192.168.111.24 is healthy: committed proposal: took = 10.383651ms 192.168.111.26 is healthy: committed proposal: took = 11.297561ms 192.168.111.25 is healthy: committed proposal: took = 13.892416ms 192.168.111.28 is healthy: committed proposal: took = 11.870755ms
192.168.111.24 is healthy: committed proposal: took = 10.383651ms 192.168.111.26 is healthy: committed proposal: took = 11.297561ms 192.168.111.25 is healthy: committed proposal: took = 13.892416ms 192.168.111.28 is healthy: committed proposal: took = 11.870755msCopy to Clipboard Copied! Toggle word wrap Toggle overflow
验证所有节点是否已就绪:
oc get nodes
$ oc get nodesCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证集群 Operator 是否可用:
oc get ClusterOperators
$ oc get ClusterOperatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证集群版本是否正确:
oc get ClusterVersion
$ oc get ClusterVersionCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 5h57m Cluster version is 4.11.5
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 5h57m Cluster version is 4.11.5Copy to Clipboard Copied! Toggle word wrap Toggle overflow
11.5.2. 删除现有的 control plane 节点 复制链接链接已复制到粘贴板!
删除您要替换的 control plane 节点。以下示例中为 node-0。
先决条件
- 您已添加了新的健康 control plane 节点。
流程
删除预先存在的 control plane 节点的
BareMetalHostCR:oc delete bmh -n openshift-machine-api node-0
$ oc delete bmh -n openshift-machine-api node-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认机器不健康:
oc get machine -A
$ oc get machine -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 删除
MachineCR:oc delete machine -n openshift-machine-api node-0
$ oc delete machine -n openshift-machine-api node-0 machine.machine.openshift.io "node-0" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow 确认删除
NodeCR:oc get nodes
$ oc get nodesCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 检查
etcd-operator日志以确认etcd集群的状态:oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf
$ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjfCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
E0927 07:53:10.597523 1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resource
E0927 07:53:10.597523 1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resourceCopy to Clipboard Copied! Toggle word wrap Toggle overflow 删除物理计算机,以允许
etcdOperator 协调群集成员:打开到 control plane 节点的远程 shell 会话:
oc rsh -n openshift-etcd etcd-node-1
$ oc rsh -n openshift-etcd etcd-node-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow 通过检查成员和端点健康状况来监控
etcdoperator 协调的进度:etcdctl member list -w table; etcdctl endpoint health
# etcdctl member list -w table; etcdctl endpoint healthCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow