This documentation is for a release that is no longer maintained
See documentation for the latest supported version 3 or the latest supported version 4.12.7. 在不健康集群中安装主 control plane 节点
此流程描述了如何在不健康的 OpenShift Container Platform 集群上安装主 control plane 节点。
前提条件
流程
确认集群的初始状态:
oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认
etcd-operator
检测到集群不健康:oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf
$ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
E0927 08:24:23.983733 1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, worker-2 is unhealthy
E0927 08:24:23.983733 1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, worker-2 is unhealthy
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认
etcdctl
成员:oc rsh -n openshift-etcd etcd-worker-3
$ oc rsh -n openshift-etcd etcd-worker-3 etcdctl member list -w table
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认
etcdctl
报告集群的不健康成员:etcdctl endpoint health
$ etcdctl endpoint health
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
{"level":"warn","ts":"2022-09-27T08:25:35.953Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000680380/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 12.465641ms 192.168.111.26 is healthy: committed proposal: took = 12.297059ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
{"level":"warn","ts":"2022-09-27T08:25:35.953Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000680380/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 12.465641ms 192.168.111.26 is healthy: committed proposal: took = 12.297059ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 通过删除
Machine
自定义资源来删除不健康的 control plane:oc delete machine -n openshift-machine-api test-day2-1-6qv96-master-2
$ oc delete machine -n openshift-machine-api test-day2-1-6qv96-master-2
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意如果不健康的集群无法成功运行,则不会删除
Machine
和Node
自定义资源 (CR)。确认
etcd-operator
没有删除不健康的机器:oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf -f
$ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf -f
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
I0927 08:58:41.249222 1 machinedeletionhooks.go:135] skip removing the deletion hook from machine test-day2-1-6qv96-master-2 since its member is still present with any of: [{InternalIP } {InternalIP 192.168.111.26}]
I0927 08:58:41.249222 1 machinedeletionhooks.go:135] skip removing the deletion hook from machine test-day2-1-6qv96-master-2 since its member is still present with any of: [{InternalIP } {InternalIP 192.168.111.26}]
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 手动删除不健康的
etcdctl
成员:oc rsh -n openshift-etcd etcd-worker-3\ etcdctl member list -w table
$ oc rsh -n openshift-etcd etcd-worker-3\ etcdctl member list -w table
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认
etcdctl
报告集群的不健康成员:etcdctl endpoint health
$ etcdctl endpoint health
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
{"level":"warn","ts":"2022-09-27T10:31:07.227Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d6e00/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 13.038278ms 192.168.111.26 is healthy: committed proposal: took = 12.950355ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
{"level":"warn","ts":"2022-09-27T10:31:07.227Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d6e00/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 13.038278ms 192.168.111.26 is healthy: committed proposal: took = 12.950355ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 通过删除
etcdctl
成员自定义资源来删除不健康的集群:etcdctl member remove 61e2a86084aafa62
$ etcdctl member remove 61e2a86084aafa62
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Member 61e2a86084aafa62 removed from cluster 6881c977b97990d7
Member 61e2a86084aafa62 removed from cluster 6881c977b97990d7
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令确认
etcdctl
的成员:etcdctl member list -w table
$ etcdctl member list -w table
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 检查并批准证书签名请求
查看证书签名请求 (CSR):
oc get csr | grep Pending
$ oc get csr | grep Pending
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:worker-6 <none> Pending
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:worker-6 <none> Pending
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 批准所有待处理的 CSR:
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意您必须批准 CSR 才能完成安装。
确认 control plane 节点就绪状态:
oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 验证
Machine
,Node
和BareMetalHost
自定义资源。如果集群使用功能
Machine
API 运行,etcd-operator
需要 Machine CR。存在时,Machine
CR 会在Running
阶段显示。创建与
BareMetalHost
和Node
链接的Machine
自定义资源。确保有
Machine
CR 引用新添加的节点。重要boot-it-yourself 将不会创建
BareMetalHost
和Machine
CR,因此您必须创建它们。如果无法创建BareMetalHost
和Machine
CR,在运行etcd-operator
时会生成错误。添加
BareMetalHost
自定义资源:oc create bmh -n openshift-machine-api custom-master3
$ oc create bmh -n openshift-machine-api custom-master3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 添加
Machine
自定义资源:oc create machine -n openshift-machine-api custom-master3
$ oc create machine -n openshift-machine-api custom-master3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行
link-machine-and-node.sh
脚本链接BareMetalHost
,Machine
, 和Node
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow bash link-machine-and-node.sh custom-master3 worker-3
$ bash link-machine-and-node.sh custom-master3 worker-3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 运行以下命令确认
etcdctl
的成员:oc rsh -n openshift-etcd etcd-worker-3
$ oc rsh -n openshift-etcd etcd-worker-3 etcdctl member list -w table
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认
etcd
Operator 已配置了所有节点:oc get clusteroperator etcd
$ oc get clusteroperator etcd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.11.5 True False False 22h
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.11.5 True False False 22h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认
etcdctl
的健康状况:oc rsh -n openshift-etcd etcd-worker-3
$ oc rsh -n openshift-etcd etcd-worker-3 etcdctl endpoint health
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
192.168.111.26 is healthy: committed proposal: took = 9.105375ms 192.168.111.28 is healthy: committed proposal: took = 9.15205ms 192.168.111.29 is healthy: committed proposal: took = 10.277577ms
192.168.111.26 is healthy: committed proposal: took = 9.105375ms 192.168.111.28 is healthy: committed proposal: took = 9.15205ms 192.168.111.29 is healthy: committed proposal: took = 10.277577ms
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认节点的健康状况:
oc get Nodes
$ oc get Nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认
ClusterOperators
的健康状况:oc get ClusterOperators
$ oc get ClusterOperators
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 确认
ClusterVersion
:oc get ClusterVersion
$ oc get ClusterVersion
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 22h Cluster version is 4.11.5
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 22h Cluster version is 4.11.5
Copy to Clipboard Copied! Toggle word wrap Toggle overflow