12.5. 在一个健康的集群中安装主 control plane 节点

此流程描述了如何在健康的 OpenShift Container Platform 集群上安装主 control plane 节点。

如果集群不健康，则在管理前需要额外的操作。如需更多信息，请参阅 其它资源。

先决条件

您使用带有正确的 etcd-operator 版本的 OpenShift Container Platform 4.11 或更高版本。
您已安装了一个具有至少三个节点的健康集群。
您已创建了单个 control plane 节点。

流程

检索待处理的 CertificateSigningRequests (CSR)：

$ oc get csr | grep Pending

输出示例

csr-5sd59   8m19s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-xzqts   10s     kubernetes.io/kubelet-serving                 system:node:worker-6                                                   <none>              Pending

批准待处理的 CSR：

$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve

重要

您必须批准 CSR 才能完成安装。

确认主节点处于 Ready 状态：

$ oc get nodes

输出示例

NAME       STATUS   ROLES    AGE     VERSION
master-0   Ready    master   4h42m   v1.24.0+3882f8f
worker-1   Ready    worker   4h29m   v1.24.0+3882f8f
master-2   Ready    master   4h43m   v1.24.0+3882f8f
master-3   Ready    master   4h27m   v1.24.0+3882f8f
worker-4   Ready    worker   4h30m   v1.24.0+3882f8f
master-5   Ready    master   105s    v1.24.0+3882f8f

注意

当集群运行功能 Machine API 时，etcd-operator 需要引用新节点的 Machine Custom Resource (CR)。

将 Machine CR 与 BareMetalHost 和 Node 链接：

使用具有唯一 .metadata.name 值的 BareMetalHost CR：

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: custom-master3
  namespace: openshift-machine-api
  annotations:
spec:
  automatedCleaningMode: metadata
  bootMACAddress: 00:00:00:00:00:02
  bootMode: UEFI
  customDeploy:
    method: install_coreos
  externallyProvisioned: true
  online: true
  userData:
    name: master-user-data-managed
    namespace: openshift-machine-api

$ oc create -f <filename>

应用 BareMetalHost CR：
```
$ oc apply -f <filename>
```

使用唯一的 .machine.name 值创建 Machine CR：

apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: externally provisioned
    metal3.io/BareMetalHost: openshift-machine-api/custom-master3
  finalizers:
  - machine.machine.openshift.io
  generation: 3
  labels:
    machine.openshift.io/cluster-api-cluster: test-day2-1-6qv96
    machine.openshift.io/cluster-api-machine-role: master
    machine.openshift.io/cluster-api-machine-type: master
  name: custom-master3
  namespace: openshift-machine-api
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: baremetal.cluster.k8s.io/v1alpha1
      customDeploy:
        method: install_coreos
      hostSelector: {}
      image:
        checksum: ""
        url: ""
      kind: BareMetalMachineProviderSpec
      metadata:
        creationTimestamp: null
      userData:
        name: master-user-data-managed

$ oc create -f <filename>

应用 Machine CR：
```
$ oc apply -f <filename>
```

使用 link-machine-and-node.sh 脚本链接 BareMetalHost, Machine, 和 Node：

#!/bin/bash

# Credit goes to https://bugzilla.redhat.com/show_bug.cgi?id=1801238.
# This script will link Machine object and Node object. This is needed
# in order to have IP address of the Node present in the status of the Machine.

# set -x
set -e

machine="$1"
node="$2"

if [ -z "$machine" ] || [ -z "$node" ]; then
    echo "Usage: $0 MACHINE NODE"
    exit 1
fi

# uid=$(echo "${node}" | cut -f1 -d':')
node_name=$(echo "${node}" | cut -f2 -d':')

oc proxy &
proxy_pid=$!
function kill_proxy {
    kill $proxy_pid
}
trap kill_proxy EXIT SIGINT

HOST_PROXY_API_PATH="http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts"

function print_nics() {
    local ips
    local eob
    declare -a ips

    readarray -t ips < <(echo "${1}" \
                         | jq '.[] | select(. | .type == "InternalIP") | .address' \
                         | sed 's/"//g')

    eob=','
    for (( i=0; i<${#ips[@]}; i++ )); do
        if [ $((i+1)) -eq ${#ips[@]} ]; then
            eob=""
        fi
        cat <<- EOF
          {
            "ip": "${ips[$i]}",
            "mac": "00:00:00:00:00:00",
            "model": "unknown",
            "speedGbps": 10,
            "vlanId": 0,
            "pxe": true,
            "name": "eth1"
          }${eob}
EOF
    done
}

function wait_for_json() {
    local name
    local url
    local curl_opts
    local timeout

    local start_time
    local curr_time
    local time_diff

    name="$1"
    url="$2"
    timeout="$3"
    shift 3
    curl_opts="$@"
    echo -n "Waiting for $name to respond"
    start_time=$(date +%s)
    until curl -g -X GET "$url" "${curl_opts[@]}" 2> /dev/null | jq '.' 2> /dev/null > /dev/null; do
        echo -n "."
        curr_time=$(date +%s)
        time_diff=$((curr_time - start_time))
        if [[ $time_diff -gt $timeout ]]; then
            printf '\nTimed out waiting for %s' "${name}"
            return 1
        fi
        sleep 5
    done
    echo " Success!"
    return 0
}
wait_for_json oc_proxy "${HOST_PROXY_API_PATH}" 10 -H "Accept: application/json" -H "Content-Type: application/json"

addresses=$(oc get node -n openshift-machine-api "${node_name}" -o json | jq -c '.status.addresses')

machine_data=$(oc get machines.machine.openshift.io -n openshift-machine-api -o json "${machine}")
host=$(echo "$machine_data" | jq '.metadata.annotations["metal3.io/BareMetalHost"]' | cut -f2 -d/ | sed 's/"//g')

if [ -z "$host" ]; then
    echo "Machine $machine is not linked to a host yet." 1>&2
    exit 1
fi

# The address structure on the host doesn't match the node, so extract
# the values we want into separate variables so we can build the patch
# we need.
hostname=$(echo "${addresses}" | jq '.[] | select(. | .type == "Hostname") | .address' | sed 's/"//g')

set +e
read -r -d '' host_patch << EOF
{
  "status": {
    "hardware": {
      "hostname": "${hostname}",
      "nics": [
$(print_nics "${addresses}")
      ],
      "systemVendor": {
        "manufacturer": "Red Hat",
        "productName": "product name",
        "serialNumber": ""
      },
      "firmware": {
        "bios": {
          "date": "04/01/2014",
          "vendor": "SeaBIOS",
          "version": "1.11.0-2.el7"
        }
      },
      "ramMebibytes": 0,
      "storage": [],
      "cpu": {
        "arch": "x86_64",
        "model": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
        "clockMegahertz": 2199.998,
        "count": 4,
        "flags": []
      }
    }
  }
}
EOF
set -e

echo "PATCHING HOST"
echo "${host_patch}" | jq .

curl -s \
     -X PATCH \
     "${HOST_PROXY_API_PATH}/${host}/status" \
     -H "Content-type: application/merge-patch+json" \
     -d "${host_patch}"

oc get baremetalhost -n openshift-machine-api -o yaml "${host}"

$ bash link-machine-and-node.sh custom-master3 worker-5

确认 etcd 成员：

$ oc rsh -n openshift-etcd etcd-worker-2
etcdctl member list -w table

输出示例

+--------+---------+--------+--------------+--------------+---------+
|   ID   |  STATUS |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
+--------+---------+--------+--------------+--------------+---------+
|2c18942f| started |worker-3|192.168.111.26|192.168.111.26|  false  |
|61e2a860| started |worker-2|192.168.111.25|192.168.111.25|  false  |
|ead4f280| started |worker-5|192.168.111.28|192.168.111.28|  false  |
+--------+---------+--------+--------------+--------------+---------+

确认 etcd-operator 配置适用于所有节点：

$ oc get clusteroperator etcd

输出示例

NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.11.5    True        False         False      5h54m

确认 etcd-operator 健康状况：

$ oc rsh -n openshift-etcd etcd-worker-0
etcdctl endpoint health

输出示例

192.168.111.26 is healthy: committed proposal: took = 11.297561ms
192.168.111.25 is healthy: committed proposal: took = 13.892416ms
192.168.111.28 is healthy: committed proposal: took = 11.870755ms

确认节点健康状况：

$ oc get nodes

输出示例

NAME       STATUS   ROLES    AGE     VERSION
master-0   Ready    master   6h20m   v1.24.0+3882f8f
worker-1   Ready    worker   6h7m    v1.24.0+3882f8f
master-2   Ready    master   6h20m   v1.24.0+3882f8f
master-3   Ready    master   6h4m    v1.24.0+3882f8f
worker-4   Ready    worker   6h7m    v1.24.0+3882f8f
master-5   Ready    master   99m     v1.24.0+3882f8f

确认 ClusterOperators 健康状况：

$ oc get ClusterOperators

输出示例

NAME                                      VERSION AVAILABLE PROGRESSING DEGRADED SINCE MSG
authentication                            4.11.5  True      False       False    5h57m
baremetal                                 4.11.5  True      False       False    6h19m
cloud-controller-manager                  4.11.5  True      False       False    6h20m
cloud-credential                          4.11.5  True      False       False    6h23m
cluster-autoscaler                        4.11.5  True      False       False    6h18m
config-operator                           4.11.5  True      False       False    6h19m
console                                   4.11.5  True      False       False    6h4m
csi-snapshot-controller                   4.11.5  True      False       False    6h19m
dns                                       4.11.5  True      False       False    6h18m
etcd                                      4.11.5  True      False       False    6h17m
image-registry                            4.11.5  True      False       False    6h7m
ingress                                   4.11.5  True      False       False    6h6m
insights                                  4.11.5  True      False       False    6h12m
kube-apiserver                            4.11.5  True      False       False    6h16m
kube-controller-manager                   4.11.5  True      False       False    6h16m
kube-scheduler                            4.11.5  True      False       False    6h16m
kube-storage-version-migrator             4.11.5  True      False       False    6h19m
machine-api                               4.11.5  True      False       False    6h15m
machine-approver                          4.11.5  True      False       False    6h19m
machine-config                            4.11.5  True      False       False    6h18m
marketplace                               4.11.5  True      False       False    6h18m
monitoring                                4.11.5  True      False       False    6h4m
network                                   4.11.5  True      False       False    6h20m
node-tuning                               4.11.5  True      False       False    6h18m
openshift-apiserver                       4.11.5  True      False       False    6h8m
openshift-controller-manager              4.11.5  True      False       False    6h7m
openshift-samples                         4.11.5  True      False       False    6h12m
operator-lifecycle-manager                4.11.5  True      False       False    6h18m
operator-lifecycle-manager-catalog        4.11.5  True      False       False    6h19m
operator-lifecycle-manager-pkgsvr         4.11.5  True      False       False    6h12m
service-ca                                4.11.5  True      False       False    6h19m
storage                                   4.11.5  True      False       False    6h19m

确认 ClusterVersion ：

$ oc get ClusterVersion

输出示例

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.5    True        False         5h57m   Cluster version is 4.11.5

删除旧的 control plane 节点：

删除 BareMetalHost CR：

$ oc delete bmh -n openshift-machine-api custom-master3

确认 Machine 不健康：

$ oc get machine -A

输出示例

NAMESPACE              NAME                              PHASE    AGE
openshift-machine-api  custom-master3                    Running  14h
openshift-machine-api  test-day2-1-6qv96-master-0        Failed   20h
openshift-machine-api  test-day2-1-6qv96-master-1        Running  20h
openshift-machine-api  test-day2-1-6qv96-master-2        Running  20h
openshift-machine-api  test-day2-1-6qv96-worker-0-8w7vr  Running  19h
openshift-machine-api  test-day2-1-6qv96-worker-0-rxddj  Running  19h

删除 Machine CR：

$ oc delete machine -n openshift-machine-api   test-day2-1-6qv96-master-0
machine.machine.openshift.io "test-day2-1-6qv96-master-0" deleted

确认删除 Node CR：

$ oc get nodes

输出示例

NAME       STATUS   ROLES    AGE   VERSION
worker-1   Ready    worker   19h   v1.24.0+3882f8f
master-2   Ready    master   20h   v1.24.0+3882f8f
master-3   Ready    master   19h   v1.24.0+3882f8f
worker-4   Ready    worker   19h   v1.24.0+3882f8f
master-5   Ready    master   15h   v1.24.0+3882f8f

检查 etcd-operator 日志以确认 etcd 集群的状态：

$ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf

输出示例

E0927 07:53:10.597523       1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resource

删除物理机器，以允许 etcd-operator 协调集群成员：

$ oc rsh -n openshift-etcd etcd-worker-2
etcdctl member list -w table; etcdctl endpoint health

输出示例

+--------+---------+--------+--------------+--------------+---------+
|   ID   |  STATUS |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
+--------+---------+--------+--------------+--------------+---------+
|2c18942f| started |worker-3|192.168.111.26|192.168.111.26|  false  |
|61e2a860| started |worker-2|192.168.111.25|192.168.111.25|  false  |
|ead4f280| started |worker-5|192.168.111.28|192.168.111.28|  false  |
+--------+---------+--------+--------------+--------------+---------+
192.168.111.26 is healthy: committed proposal: took = 10.458132ms
192.168.111.25 is healthy: committed proposal: took = 11.047349ms
192.168.111.28 is healthy: committed proposal: took = 11.414402ms

其他资源

在不健康集群中安装主 control plane 节点

12.5. 在一个健康的集群中安装主 control plane 节点

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Red Hat legal and privacy links

Red Hat legal and privacy links