搜索

12.5. 在一个健康的集群中安装主 control plane 节点

download PDF

此流程描述了如何在健康的 OpenShift Container Platform 集群上安装主 control plane 节点。

如果集群不健康,则在管理前需要额外的操作。如需更多信息,请参阅 其它资源

先决条件

  • 您使用带有正确的 etcd-operator 版本的 OpenShift Container Platform 4.11 或更高版本。
  • 您已安装了一个具有至少三个节点的健康集群。
  • 您已创建了单个 control plane 节点。

流程

  1. 检索待处理的 CertificateSigningRequests (CSR):

    $ oc get csr | grep Pending

    输出示例

    csr-5sd59   8m19s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
    csr-xzqts   10s     kubernetes.io/kubelet-serving                 system:node:worker-6                                                   <none>              Pending

  2. 批准待处理的 CSR:

    $ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
    重要

    您必须批准 CSR 才能完成安装。

  3. 确认主节点处于 Ready 状态:

    $ oc get nodes

    输出示例

    NAME       STATUS   ROLES    AGE     VERSION
    master-0   Ready    master   4h42m   v1.24.0+3882f8f
    worker-1   Ready    worker   4h29m   v1.24.0+3882f8f
    master-2   Ready    master   4h43m   v1.24.0+3882f8f
    master-3   Ready    master   4h27m   v1.24.0+3882f8f
    worker-4   Ready    worker   4h30m   v1.24.0+3882f8f
    master-5   Ready    master   105s    v1.24.0+3882f8f

    注意

    当集群运行功能 Machine API 时,etcd-operator 需要引用新节点的 Machine Custom Resource (CR)。

  4. Machine CR 与 BareMetalHostNode 链接:

    1. 使用具有唯一 .metadata.name 值的 BareMetalHost CR:

      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        name: custom-master3
        namespace: openshift-machine-api
        annotations:
      spec:
        automatedCleaningMode: metadata
        bootMACAddress: 00:00:00:00:00:02
        bootMode: UEFI
        customDeploy:
          method: install_coreos
        externallyProvisioned: true
        online: true
        userData:
          name: master-user-data-managed
          namespace: openshift-machine-api
      $ oc create -f <filename>
    2. 应用 BareMetalHost CR:

      $ oc apply -f <filename>
    3. 使用唯一的 .machine.name 值创建 Machine CR:

      apiVersion: machine.openshift.io/v1beta1
      kind: Machine
      metadata:
        annotations:
          machine.openshift.io/instance-state: externally provisioned
          metal3.io/BareMetalHost: openshift-machine-api/custom-master3
        finalizers:
        - machine.machine.openshift.io
        generation: 3
        labels:
          machine.openshift.io/cluster-api-cluster: test-day2-1-6qv96
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
        name: custom-master3
        namespace: openshift-machine-api
      spec:
        metadata: {}
        providerSpec:
          value:
            apiVersion: baremetal.cluster.k8s.io/v1alpha1
            customDeploy:
              method: install_coreos
            hostSelector: {}
            image:
              checksum: ""
              url: ""
            kind: BareMetalMachineProviderSpec
            metadata:
              creationTimestamp: null
            userData:
              name: master-user-data-managed
      $ oc create -f <filename>
    4. 应用 Machine CR:

      $ oc apply -f <filename>
    5. 使用 link-machine-and-node.sh 脚本链接 BareMetalHost, Machine, 和 Node

      #!/bin/bash
      
      # Credit goes to https://bugzilla.redhat.com/show_bug.cgi?id=1801238.
      # This script will link Machine object and Node object. This is needed
      # in order to have IP address of the Node present in the status of the Machine.
      
      # set -x
      set -e
      
      machine="$1"
      node="$2"
      
      if [ -z "$machine" ] || [ -z "$node" ]; then
          echo "Usage: $0 MACHINE NODE"
          exit 1
      fi
      
      # uid=$(echo "${node}" | cut -f1 -d':')
      node_name=$(echo "${node}" | cut -f2 -d':')
      
      oc proxy &
      proxy_pid=$!
      function kill_proxy {
          kill $proxy_pid
      }
      trap kill_proxy EXIT SIGINT
      
      HOST_PROXY_API_PATH="http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts"
      
      function print_nics() {
          local ips
          local eob
          declare -a ips
      
          readarray -t ips < <(echo "${1}" \
                               | jq '.[] | select(. | .type == "InternalIP") | .address' \
                               | sed 's/"//g')
      
          eob=','
          for (( i=0; i<${#ips[@]}; i++ )); do
              if [ $((i+1)) -eq ${#ips[@]} ]; then
                  eob=""
              fi
              cat <<- EOF
                {
                  "ip": "${ips[$i]}",
                  "mac": "00:00:00:00:00:00",
                  "model": "unknown",
                  "speedGbps": 10,
                  "vlanId": 0,
                  "pxe": true,
                  "name": "eth1"
                }${eob}
      EOF
          done
      }
      
      function wait_for_json() {
          local name
          local url
          local curl_opts
          local timeout
      
          local start_time
          local curr_time
          local time_diff
      
          name="$1"
          url="$2"
          timeout="$3"
          shift 3
          curl_opts="$@"
          echo -n "Waiting for $name to respond"
          start_time=$(date +%s)
          until curl -g -X GET "$url" "${curl_opts[@]}" 2> /dev/null | jq '.' 2> /dev/null > /dev/null; do
              echo -n "."
              curr_time=$(date +%s)
              time_diff=$((curr_time - start_time))
              if [[ $time_diff -gt $timeout ]]; then
                  printf '\nTimed out waiting for %s' "${name}"
                  return 1
              fi
              sleep 5
          done
          echo " Success!"
          return 0
      }
      wait_for_json oc_proxy "${HOST_PROXY_API_PATH}" 10 -H "Accept: application/json" -H "Content-Type: application/json"
      
      addresses=$(oc get node -n openshift-machine-api "${node_name}" -o json | jq -c '.status.addresses')
      
      machine_data=$(oc get machines.machine.openshift.io -n openshift-machine-api -o json "${machine}")
      host=$(echo "$machine_data" | jq '.metadata.annotations["metal3.io/BareMetalHost"]' | cut -f2 -d/ | sed 's/"//g')
      
      if [ -z "$host" ]; then
          echo "Machine $machine is not linked to a host yet." 1>&2
          exit 1
      fi
      
      # The address structure on the host doesn't match the node, so extract
      # the values we want into separate variables so we can build the patch
      # we need.
      hostname=$(echo "${addresses}" | jq '.[] | select(. | .type == "Hostname") | .address' | sed 's/"//g')
      
      set +e
      read -r -d '' host_patch << EOF
      {
        "status": {
          "hardware": {
            "hostname": "${hostname}",
            "nics": [
      $(print_nics "${addresses}")
            ],
            "systemVendor": {
              "manufacturer": "Red Hat",
              "productName": "product name",
              "serialNumber": ""
            },
            "firmware": {
              "bios": {
                "date": "04/01/2014",
                "vendor": "SeaBIOS",
                "version": "1.11.0-2.el7"
              }
            },
            "ramMebibytes": 0,
            "storage": [],
            "cpu": {
              "arch": "x86_64",
              "model": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz",
              "clockMegahertz": 2199.998,
              "count": 4,
              "flags": []
            }
          }
        }
      }
      EOF
      set -e
      
      echo "PATCHING HOST"
      echo "${host_patch}" | jq .
      
      curl -s \
           -X PATCH \
           "${HOST_PROXY_API_PATH}/${host}/status" \
           -H "Content-type: application/merge-patch+json" \
           -d "${host_patch}"
      
      oc get baremetalhost -n openshift-machine-api -o yaml "${host}"
      $ bash link-machine-and-node.sh custom-master3 worker-5
  5. 确认 etcd 成员:

    $ oc rsh -n openshift-etcd etcd-worker-2
    etcdctl member list -w table

    输出示例

    +--------+---------+--------+--------------+--------------+---------+
    |   ID   |  STATUS |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
    +--------+---------+--------+--------------+--------------+---------+
    |2c18942f| started |worker-3|192.168.111.26|192.168.111.26|  false  |
    |61e2a860| started |worker-2|192.168.111.25|192.168.111.25|  false  |
    |ead4f280| started |worker-5|192.168.111.28|192.168.111.28|  false  |
    +--------+---------+--------+--------------+--------------+---------+

  6. 确认 etcd-operator 配置适用于所有节点:

    $ oc get clusteroperator etcd

    输出示例

    NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    etcd   4.11.5    True        False         False      5h54m

  7. 确认 etcd-operator 健康状况:

    $ oc rsh -n openshift-etcd etcd-worker-0
    etcdctl endpoint health

    输出示例

    192.168.111.26 is healthy: committed proposal: took = 11.297561ms
    192.168.111.25 is healthy: committed proposal: took = 13.892416ms
    192.168.111.28 is healthy: committed proposal: took = 11.870755ms

  8. 确认节点健康状况:

    $ oc get nodes

    输出示例

    NAME       STATUS   ROLES    AGE     VERSION
    master-0   Ready    master   6h20m   v1.24.0+3882f8f
    worker-1   Ready    worker   6h7m    v1.24.0+3882f8f
    master-2   Ready    master   6h20m   v1.24.0+3882f8f
    master-3   Ready    master   6h4m    v1.24.0+3882f8f
    worker-4   Ready    worker   6h7m    v1.24.0+3882f8f
    master-5   Ready    master   99m     v1.24.0+3882f8f

  9. 确认 ClusterOperators 健康状况:

    $ oc get ClusterOperators

    输出示例

    NAME                                      VERSION AVAILABLE PROGRESSING DEGRADED SINCE MSG
    authentication                            4.11.5  True      False       False    5h57m
    baremetal                                 4.11.5  True      False       False    6h19m
    cloud-controller-manager                  4.11.5  True      False       False    6h20m
    cloud-credential                          4.11.5  True      False       False    6h23m
    cluster-autoscaler                        4.11.5  True      False       False    6h18m
    config-operator                           4.11.5  True      False       False    6h19m
    console                                   4.11.5  True      False       False    6h4m
    csi-snapshot-controller                   4.11.5  True      False       False    6h19m
    dns                                       4.11.5  True      False       False    6h18m
    etcd                                      4.11.5  True      False       False    6h17m
    image-registry                            4.11.5  True      False       False    6h7m
    ingress                                   4.11.5  True      False       False    6h6m
    insights                                  4.11.5  True      False       False    6h12m
    kube-apiserver                            4.11.5  True      False       False    6h16m
    kube-controller-manager                   4.11.5  True      False       False    6h16m
    kube-scheduler                            4.11.5  True      False       False    6h16m
    kube-storage-version-migrator             4.11.5  True      False       False    6h19m
    machine-api                               4.11.5  True      False       False    6h15m
    machine-approver                          4.11.5  True      False       False    6h19m
    machine-config                            4.11.5  True      False       False    6h18m
    marketplace                               4.11.5  True      False       False    6h18m
    monitoring                                4.11.5  True      False       False    6h4m
    network                                   4.11.5  True      False       False    6h20m
    node-tuning                               4.11.5  True      False       False    6h18m
    openshift-apiserver                       4.11.5  True      False       False    6h8m
    openshift-controller-manager              4.11.5  True      False       False    6h7m
    openshift-samples                         4.11.5  True      False       False    6h12m
    operator-lifecycle-manager                4.11.5  True      False       False    6h18m
    operator-lifecycle-manager-catalog        4.11.5  True      False       False    6h19m
    operator-lifecycle-manager-pkgsvr         4.11.5  True      False       False    6h12m
    service-ca                                4.11.5  True      False       False    6h19m
    storage                                   4.11.5  True      False       False    6h19m

  10. 确认 ClusterVersion

    $ oc get ClusterVersion

    输出示例

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.11.5    True        False         5h57m   Cluster version is 4.11.5

  11. 删除旧的 control plane 节点:

    1. 删除 BareMetalHost CR:

      $ oc delete bmh -n openshift-machine-api custom-master3
    2. 确认 Machine 不健康:

      $ oc get machine -A

      输出示例

      NAMESPACE              NAME                              PHASE    AGE
      openshift-machine-api  custom-master3                    Running  14h
      openshift-machine-api  test-day2-1-6qv96-master-0        Failed   20h
      openshift-machine-api  test-day2-1-6qv96-master-1        Running  20h
      openshift-machine-api  test-day2-1-6qv96-master-2        Running  20h
      openshift-machine-api  test-day2-1-6qv96-worker-0-8w7vr  Running  19h
      openshift-machine-api  test-day2-1-6qv96-worker-0-rxddj  Running  19h

    3. 删除 Machine CR:

      $ oc delete machine -n openshift-machine-api   test-day2-1-6qv96-master-0
      machine.machine.openshift.io "test-day2-1-6qv96-master-0" deleted
    4. 确认删除 Node CR:

      $ oc get nodes

      输出示例

      NAME       STATUS   ROLES    AGE   VERSION
      worker-1   Ready    worker   19h   v1.24.0+3882f8f
      master-2   Ready    master   20h   v1.24.0+3882f8f
      master-3   Ready    master   19h   v1.24.0+3882f8f
      worker-4   Ready    worker   19h   v1.24.0+3882f8f
      master-5   Ready    master   15h   v1.24.0+3882f8f

  12. 检查 etcd-operator 日志以确认 etcd 集群的状态:

    $ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf

    输出示例

    E0927 07:53:10.597523       1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resource

  13. 删除物理机器,以允许 etcd-operator 协调集群成员:

    $ oc rsh -n openshift-etcd etcd-worker-2
    etcdctl member list -w table; etcdctl endpoint health

    输出示例

    +--------+---------+--------+--------------+--------------+---------+
    |   ID   |  STATUS |  NAME  |  PEER ADDRS  | CLIENT ADDRS | LEARNER |
    +--------+---------+--------+--------------+--------------+---------+
    |2c18942f| started |worker-3|192.168.111.26|192.168.111.26|  false  |
    |61e2a860| started |worker-2|192.168.111.25|192.168.111.25|  false  |
    |ead4f280| started |worker-5|192.168.111.28|192.168.111.28|  false  |
    +--------+---------+--------+--------------+--------------+---------+
    192.168.111.26 is healthy: committed proposal: took = 10.458132ms
    192.168.111.25 is healthy: committed proposal: took = 11.047349ms
    192.168.111.28 is healthy: committed proposal: took = 11.414402ms

Red Hat logoGithubRedditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

© 2024 Red Hat, Inc.