12.5. 在一个健康的集群中安装主 control plane 节点
此流程描述了如何在健康的 OpenShift Container Platform 集群上安装主 control plane 节点。
如果集群不健康,则在管理前需要额外的操作。如需更多信息,请参阅 其它资源。
先决条件
-
您使用带有正确的
etcd-operator
版本的 OpenShift Container Platform 4.11 或更高版本。 - 您已安装了一个具有至少三个节点的健康集群。
- 您已创建了单个 control plane 节点。
流程
检索待处理的
CertificateSigningRequests
(CSR):$ oc get csr | grep Pending
输出示例
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:worker-6 <none> Pending
批准待处理的 CSR:
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
重要您必须批准 CSR 才能完成安装。
确认主节点处于
Ready
状态:$ oc get nodes
输出示例
NAME STATUS ROLES AGE VERSION master-0 Ready master 4h42m v1.24.0+3882f8f worker-1 Ready worker 4h29m v1.24.0+3882f8f master-2 Ready master 4h43m v1.24.0+3882f8f master-3 Ready master 4h27m v1.24.0+3882f8f worker-4 Ready worker 4h30m v1.24.0+3882f8f master-5 Ready master 105s v1.24.0+3882f8f
注意当集群运行功能 Machine API 时,
etcd-operator
需要引用新节点的Machine
Custom Resource (CR)。将
Machine
CR 与BareMetalHost
和Node
链接:使用具有唯一
.metadata.name
值的BareMetalHost
CR:apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: custom-master3 namespace: openshift-machine-api annotations: spec: automatedCleaningMode: metadata bootMACAddress: 00:00:00:00:00:02 bootMode: UEFI customDeploy: method: install_coreos externallyProvisioned: true online: true userData: name: master-user-data-managed namespace: openshift-machine-api
$ oc create -f <filename>
应用
BareMetalHost
CR:$ oc apply -f <filename>
使用唯一的
.machine.name
值创建Machine
CR:apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: externally provisioned metal3.io/BareMetalHost: openshift-machine-api/custom-master3 finalizers: - machine.machine.openshift.io generation: 3 labels: machine.openshift.io/cluster-api-cluster: test-day2-1-6qv96 machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master name: custom-master3 namespace: openshift-machine-api spec: metadata: {} providerSpec: value: apiVersion: baremetal.cluster.k8s.io/v1alpha1 customDeploy: method: install_coreos hostSelector: {} image: checksum: "" url: "" kind: BareMetalMachineProviderSpec metadata: creationTimestamp: null userData: name: master-user-data-managed
$ oc create -f <filename>
应用
Machine
CR:$ oc apply -f <filename>
使用
link-machine-and-node.sh
脚本链接BareMetalHost
,Machine
, 和Node
:#!/bin/bash # Credit goes to https://bugzilla.redhat.com/show_bug.cgi?id=1801238. # This script will link Machine object and Node object. This is needed # in order to have IP address of the Node present in the status of the Machine. # set -x set -e machine="$1" node="$2" if [ -z "$machine" ] || [ -z "$node" ]; then echo "Usage: $0 MACHINE NODE" exit 1 fi # uid=$(echo "${node}" | cut -f1 -d':') node_name=$(echo "${node}" | cut -f2 -d':') oc proxy & proxy_pid=$! function kill_proxy { kill $proxy_pid } trap kill_proxy EXIT SIGINT HOST_PROXY_API_PATH="http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts" function print_nics() { local ips local eob declare -a ips readarray -t ips < <(echo "${1}" \ | jq '.[] | select(. | .type == "InternalIP") | .address' \ | sed 's/"//g') eob=',' for (( i=0; i<${#ips[@]}; i++ )); do if [ $((i+1)) -eq ${#ips[@]} ]; then eob="" fi cat <<- EOF { "ip": "${ips[$i]}", "mac": "00:00:00:00:00:00", "model": "unknown", "speedGbps": 10, "vlanId": 0, "pxe": true, "name": "eth1" }${eob} EOF done } function wait_for_json() { local name local url local curl_opts local timeout local start_time local curr_time local time_diff name="$1" url="$2" timeout="$3" shift 3 curl_opts="$@" echo -n "Waiting for $name to respond" start_time=$(date +%s) until curl -g -X GET "$url" "${curl_opts[@]}" 2> /dev/null | jq '.' 2> /dev/null > /dev/null; do echo -n "." curr_time=$(date +%s) time_diff=$((curr_time - start_time)) if [[ $time_diff -gt $timeout ]]; then printf '\nTimed out waiting for %s' "${name}" return 1 fi sleep 5 done echo " Success!" return 0 } wait_for_json oc_proxy "${HOST_PROXY_API_PATH}" 10 -H "Accept: application/json" -H "Content-Type: application/json" addresses=$(oc get node -n openshift-machine-api "${node_name}" -o json | jq -c '.status.addresses') machine_data=$(oc get machines.machine.openshift.io -n openshift-machine-api -o json "${machine}") host=$(echo "$machine_data" | jq '.metadata.annotations["metal3.io/BareMetalHost"]' | cut -f2 -d/ | sed 's/"//g') if [ -z "$host" ]; then echo "Machine $machine is not linked to a host yet." 1>&2 exit 1 fi # The address structure on the host doesn't match the node, so extract # the values we want into separate variables so we can build the patch # we need. hostname=$(echo "${addresses}" | jq '.[] | select(. | .type == "Hostname") | .address' | sed 's/"//g') set +e read -r -d '' host_patch << EOF { "status": { "hardware": { "hostname": "${hostname}", "nics": [ $(print_nics "${addresses}") ], "systemVendor": { "manufacturer": "Red Hat", "productName": "product name", "serialNumber": "" }, "firmware": { "bios": { "date": "04/01/2014", "vendor": "SeaBIOS", "version": "1.11.0-2.el7" } }, "ramMebibytes": 0, "storage": [], "cpu": { "arch": "x86_64", "model": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz", "clockMegahertz": 2199.998, "count": 4, "flags": [] } } } } EOF set -e echo "PATCHING HOST" echo "${host_patch}" | jq . curl -s \ -X PATCH \ "${HOST_PROXY_API_PATH}/${host}/status" \ -H "Content-type: application/merge-patch+json" \ -d "${host_patch}" oc get baremetalhost -n openshift-machine-api -o yaml "${host}"
$ bash link-machine-and-node.sh custom-master3 worker-5
确认
etcd
成员:$ oc rsh -n openshift-etcd etcd-worker-2 etcdctl member list -w table
输出示例
+--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |2c18942f| started |worker-3|192.168.111.26|192.168.111.26| false | |61e2a860| started |worker-2|192.168.111.25|192.168.111.25| false | |ead4f280| started |worker-5|192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+
确认
etcd-operator
配置适用于所有节点:$ oc get clusteroperator etcd
输出示例
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.11.5 True False False 5h54m
确认
etcd-operator
健康状况:$ oc rsh -n openshift-etcd etcd-worker-0 etcdctl endpoint health
输出示例
192.168.111.26 is healthy: committed proposal: took = 11.297561ms 192.168.111.25 is healthy: committed proposal: took = 13.892416ms 192.168.111.28 is healthy: committed proposal: took = 11.870755ms
确认节点健康状况:
$ oc get nodes
输出示例
NAME STATUS ROLES AGE VERSION master-0 Ready master 6h20m v1.24.0+3882f8f worker-1 Ready worker 6h7m v1.24.0+3882f8f master-2 Ready master 6h20m v1.24.0+3882f8f master-3 Ready master 6h4m v1.24.0+3882f8f worker-4 Ready worker 6h7m v1.24.0+3882f8f master-5 Ready master 99m v1.24.0+3882f8f
确认
ClusterOperators
健康状况:$ oc get ClusterOperators
输出示例
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MSG authentication 4.11.5 True False False 5h57m baremetal 4.11.5 True False False 6h19m cloud-controller-manager 4.11.5 True False False 6h20m cloud-credential 4.11.5 True False False 6h23m cluster-autoscaler 4.11.5 True False False 6h18m config-operator 4.11.5 True False False 6h19m console 4.11.5 True False False 6h4m csi-snapshot-controller 4.11.5 True False False 6h19m dns 4.11.5 True False False 6h18m etcd 4.11.5 True False False 6h17m image-registry 4.11.5 True False False 6h7m ingress 4.11.5 True False False 6h6m insights 4.11.5 True False False 6h12m kube-apiserver 4.11.5 True False False 6h16m kube-controller-manager 4.11.5 True False False 6h16m kube-scheduler 4.11.5 True False False 6h16m kube-storage-version-migrator 4.11.5 True False False 6h19m machine-api 4.11.5 True False False 6h15m machine-approver 4.11.5 True False False 6h19m machine-config 4.11.5 True False False 6h18m marketplace 4.11.5 True False False 6h18m monitoring 4.11.5 True False False 6h4m network 4.11.5 True False False 6h20m node-tuning 4.11.5 True False False 6h18m openshift-apiserver 4.11.5 True False False 6h8m openshift-controller-manager 4.11.5 True False False 6h7m openshift-samples 4.11.5 True False False 6h12m operator-lifecycle-manager 4.11.5 True False False 6h18m operator-lifecycle-manager-catalog 4.11.5 True False False 6h19m operator-lifecycle-manager-pkgsvr 4.11.5 True False False 6h12m service-ca 4.11.5 True False False 6h19m storage 4.11.5 True False False 6h19m
确认
ClusterVersion
:$ oc get ClusterVersion
输出示例
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 5h57m Cluster version is 4.11.5
删除旧的 control plane 节点:
删除
BareMetalHost
CR:$ oc delete bmh -n openshift-machine-api custom-master3
确认
Machine
不健康:$ oc get machine -A
输出示例
NAMESPACE NAME PHASE AGE openshift-machine-api custom-master3 Running 14h openshift-machine-api test-day2-1-6qv96-master-0 Failed 20h openshift-machine-api test-day2-1-6qv96-master-1 Running 20h openshift-machine-api test-day2-1-6qv96-master-2 Running 20h openshift-machine-api test-day2-1-6qv96-worker-0-8w7vr Running 19h openshift-machine-api test-day2-1-6qv96-worker-0-rxddj Running 19h
删除
Machine
CR:$ oc delete machine -n openshift-machine-api test-day2-1-6qv96-master-0 machine.machine.openshift.io "test-day2-1-6qv96-master-0" deleted
确认删除
Node
CR:$ oc get nodes
输出示例
NAME STATUS ROLES AGE VERSION worker-1 Ready worker 19h v1.24.0+3882f8f master-2 Ready master 20h v1.24.0+3882f8f master-3 Ready master 19h v1.24.0+3882f8f worker-4 Ready worker 19h v1.24.0+3882f8f master-5 Ready master 15h v1.24.0+3882f8f
检查
etcd-operator
日志以确认etcd
集群的状态:$ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf
输出示例
E0927 07:53:10.597523 1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resource
删除物理机器,以允许
etcd-operator
协调集群成员:$ oc rsh -n openshift-etcd etcd-worker-2 etcdctl member list -w table; etcdctl endpoint health
输出示例
+--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |2c18942f| started |worker-3|192.168.111.26|192.168.111.26| false | |61e2a860| started |worker-2|192.168.111.25|192.168.111.25| false | |ead4f280| started |worker-5|192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+ 192.168.111.26 is healthy: committed proposal: took = 10.458132ms 192.168.111.25 is healthy: committed proposal: took = 11.047349ms 192.168.111.28 is healthy: committed proposal: took = 11.414402ms