Chapter 12. Expanding the cluster
You can expand a cluster installed with the Assisted Installer by adding hosts using the user interface or the API.
Additional resources
12.1. Checking for multi-architecture support
You must check that your cluster can support multiple architectures before you add a node with a different architecture.
Procedure
- Log in to the cluster using the CLI.
Check that your cluster uses the architecture payload by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm release info -o json | jq .metadata.metadata
$ oc adm release info -o json | jq .metadata.metadata
Verification
If you see the following output, your cluster supports multiple architectures:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow { "release.openshift.io/architecture": "multi" }
{ "release.openshift.io/architecture": "multi" }
12.2. Installing multi-architecture compute clusters
A cluster with an x86_64
or arm64
control plane can support worker nodes that have two different CPU architectures. Multi-architecture clusters combine the strengths of each architecture and support a variety of workloads.
For example, you can add arm64
, IBM Power® (ppc64le
), or IBM Z® (s390x
) worker nodes to an existing OpenShift Container Platform cluster with an x86_64
.
The main steps of the installation are as follows:
- Create and register a multi-architecture compute cluster.
-
Create an
x86_64
orarm64
infrastructure environment, download the ISO discovery image for the environment, and add the control plane. Anarm64
infrastructure environment is available for Amazon Web Services (AWS) and Google Cloud (GC) only. -
Create an
arm64
,ppc64le
, ors390x
infrastructure environment, download the ISO discovery images forarm64
,ppc64le
, ors390x
, and add the worker nodes.
Supported platforms
For the supported platforms for each OpenShift Container Platform version, see About clusters with multi-architecture compute machines. Use the appropriate platforms for the version you are installing.
Main steps
- Start the procedure for installing OpenShift Container Platform using the API. For details, see Installing with the Assisted Installer API in the Additional Resources section.
When you reach the "Registering a new cluster" step of the installation, register the cluster as a multi-architecture compute cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl -s -X POST https://api.openshift.com/api/assisted-install/v2/clusters \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d "$(jq --null-input \ --slurpfile pull_secret ~/Downloads/pull-secret.txt '
$ curl -s -X POST https://api.openshift.com/api/assisted-install/v2/clusters \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d "$(jq --null-input \ --slurpfile pull_secret ~/Downloads/pull-secret.txt ' { "name": "testcluster", "openshift_version": "<version-number>-multi",
1 "cpu_architecture" : "multi"
2 "control_plane_count": "<number>"
3 "base_dns_domain": "example.com", "pull_secret": $pull_secret[0] | tojson } ')" | jq '.id'
Note- 1
- Use the
multi-
option for the OpenShift Container Platform version number; for example,"4.18-multi"
. - 2
- Set the CPU architecture to
"multi"
. - 3
- Set the number of control plane nodes to "3", "4", or "5". The option of 4 or 5 control plane nodes is available from OpenShift Container Platform 4.18 and later. Single-node OpenShift is not supported for a multi-architecture compute cluster. The
control_plane_count
field replaceshigh_availability_mode
, which is deprecated.
When you reach the "Registering a new infrastructure environment" step of the installation, set
cpu_architecture
tox86_64
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl https://api.openshift.com/api/assisted-install/v2/infra-envs \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d "$(jq --null-input \ --slurpfile pull_secret ~/Downloads/pull-secret.txt \ --arg cluster_id ${CLUSTER_ID} '
$ curl https://api.openshift.com/api/assisted-install/v2/infra-envs \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d "$(jq --null-input \ --slurpfile pull_secret ~/Downloads/pull-secret.txt \ --arg cluster_id ${CLUSTER_ID} ' { "name": "testcluster-infra-env", "image_type":"full-iso", "cluster_id": $cluster_id, "cpu_architecture" : "x86_64" "pull_secret": $pull_secret[0] | tojson } ')" | jq '.id'
When you reach the "Adding hosts" step of the installation, set
host_role
tomaster
:NoteFor more information, see Assigning Roles to Hosts in Additional Resources.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl https://api.openshift.com/api/assisted-install/v2/infra-envs/${INFRA_ENV_ID}/hosts/<host_id> \ -X PATCH \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d '
$ curl https://api.openshift.com/api/assisted-install/v2/infra-envs/${INFRA_ENV_ID}/hosts/<host_id> \ -X PATCH \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d ' { "host_role":"master" } ' | jq
-
Download the discovery image for the
x86_64
architecture. -
Boot the
x86_64
architecture hosts using the generated discovery image. - Start the installation and wait for the cluster to be fully installed.
Repeat the "Registering a new infrastructure environment" step of the installation. This time, set
cpu_architecture
to one of the following:ppc64le
(for IBM Power®),s390x
(for IBM Z®), orarm64
. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl -s -X POST https://api.openshift.com/api/assisted-install/v2/clusters \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d "$(jq --null-input \ --slurpfile pull_secret ~/Downloads/pull-secret.txt '
$ curl -s -X POST https://api.openshift.com/api/assisted-install/v2/clusters \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d "$(jq --null-input \ --slurpfile pull_secret ~/Downloads/pull-secret.txt ' { "name": "testcluster", "openshift_version": "4.12", "cpu_architecture" : "arm64" "control_plane_count": "3" "base_dns_domain": "example.com", "pull_secret": $pull_secret[0] | tojson } ')" | jq '.id'
Repeat the "Adding hosts" step of the installation. This time, set
host_role
toworker
:NoteFor more details, see Assigning Roles to Hosts in Additional Resources.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl https://api.openshift.com/api/assisted-install/v2/infra-envs/${INFRA_ENV_ID}/hosts/<host_id> \ -X PATCH \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d '
$ curl https://api.openshift.com/api/assisted-install/v2/infra-envs/${INFRA_ENV_ID}/hosts/<host_id> \ -X PATCH \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d ' { "host_role":"worker" } ' | jq
- Download the discovery image for the arm64, ppc64le or s390x architecture.
- Boot the architecture hosts using the generated discovery image.
- Start the installation and wait for the cluster to be fully installed.
Verification
View the arm64, ppc64le, or s390x worker nodes in the cluster by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes -o wide
$ oc get nodes -o wide
12.3. Adding hosts with the web console
You can add hosts to clusters that were created using the Assisted Installer.
- Adding hosts to Assisted Installer clusters is only supported for clusters running OpenShift Container Platform version 4.11 and later.
-
When adding a control plane node during Day 2 operations, ensure that the new node shares the same subnet as the Day 1 network. The subnet is specified in the
machineNetwork
field of theinstall-config.yaml
file. This requirement applies to cluster-managed networks such as bare metal or vSphere, and not to user-managed networks.
Procedure
- Log in to OpenShift Cluster Manager and click the cluster that you want to expand.
- Click Add hosts and download the discovery ISO for the new host, adding an SSH public key and configuring cluster-wide proxy settings as needed.
- Optional: Modify ignition files as needed.
- Boot the target host using the discovery ISO, and wait for the host to be discovered in the console.
-
Select the host role. It can be either a
worker
or acontrol plane
host. - Start the installation.
As the installation proceeds, the installation generates pending certificate signing requests (CSRs) for the host. When prompted, approve the pending CSRs to complete the installation.
When the host is successfully installed, it is listed as a host in the cluster web console.
New hosts will be encrypted using the same method as the original cluster.
12.4. Adding hosts with the API
You can add hosts to clusters using the Assisted Installer REST API.
Prerequisites
-
Install the Red Hat OpenShift Cluster Manager CLI (
ocm
). - Log in to Red Hat OpenShift Cluster Manager as a user with cluster creation privileges.
-
Install
jq
. - Ensure that all the required DNS records exist for the cluster that you want to expand.
When adding a control plane node during Day 2 operations, ensure that the new node shares the same subnet as the Day 1 network. The subnet is specified in the machineNetwork
field of the install-config.yaml
file. This requirement applies to cluster-managed networks such as bare metal or vSphere, and not to user-managed networks.
Procedure
- Authenticate against the Assisted Installer REST API and generate an API token for your session. The generated token is valid for 15 minutes only.
Set the
$API_URL
variable by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow export API_URL=<api_url>
$ export API_URL=<api_url>
1 - 1
- Replace
<api_url>
with the Assisted Installer API URL, for example,https://api.openshift.com
Import the cluster by running the following commands:
Set the
$CLUSTER_ID
variable:Log in to the cluster and run the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow export CLUSTER_ID=$(oc get clusterversion -o jsonpath='{.items[].spec.clusterID}')
$ export CLUSTER_ID=$(oc get clusterversion -o jsonpath='{.items[].spec.clusterID}')
Display the
$CLUSTER_ID
variable output:Copy to Clipboard Copied! Toggle word wrap Toggle overflow echo ${CLUSTER_ID}
$ echo ${CLUSTER_ID}
Set the
$CLUSTER_REQUEST
variable that is used to import the cluster:Copy to Clipboard Copied! Toggle word wrap Toggle overflow export CLUSTER_REQUEST=$(jq --null-input --arg openshift_cluster_id "$CLUSTER_ID" \ '{
$ export CLUSTER_REQUEST=$(jq --null-input --arg openshift_cluster_id "$CLUSTER_ID" \ '{ "api_vip_dnsname": "<api_vip>",
1 "openshift_cluster_id": "<cluster_id>",
2 "name": "<openshift_cluster_name>"
3 }')
- 1
- Replace
<api_vip>
with the hostname for the cluster’s API server. This can be the DNS domain for the API server or the IP address of the single node which the host can reach. For example,api.compute-1.example.com
. - 2
- Replace
<cluster_id>
with the$CLUSTER_ID
output from the previous substep. - 3
- Replace
<openshift_cluster_name>
with the plain text name for the cluster. The cluster name should match the cluster name that was set during the Day 1 cluster installation.
Import the cluster and set the
$CLUSTER_ID
variable. Run the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow CLUSTER_ID=$(curl "$API_URL/api/assisted-install/v2/clusters/import" -H "Authorization: Bearer ${API_TOKEN}" -H 'accept: application/json' -H 'Content-Type: application/json' \ -d "$CLUSTER_REQUEST" | tee /dev/stderr | jq -r '.id')
$ CLUSTER_ID=$(curl "$API_URL/api/assisted-install/v2/clusters/import" -H "Authorization: Bearer ${API_TOKEN}" -H 'accept: application/json' -H 'Content-Type: application/json' \ -d "$CLUSTER_REQUEST" | tee /dev/stderr | jq -r '.id')
Generate the
InfraEnv
resource for the cluster and set the$INFRA_ENV_ID
variable by running the following commands:- Download the pull secret file from Red Hat OpenShift Cluster Manager at console.redhat.com.
Set the
$INFRA_ENV_REQUEST
variable:Copy to Clipboard Copied! Toggle word wrap Toggle overflow export INFRA_ENV_REQUEST=$(jq --null-input \ --slurpfile pull_secret <path_to_pull_secret_file> \ --arg ssh_pub_key "$(cat <path_to_ssh_pub_key>)" \ --arg cluster_id "$CLUSTER_ID" '{ "name": "<infraenv_name>", "pull_secret": $pull_secret[0] | tojson, "cluster_id": $cluster_id, "ssh_authorized_key": $ssh_pub_key, "image_type": "<iso_image_type>" }')
export INFRA_ENV_REQUEST=$(jq --null-input \ --slurpfile pull_secret <path_to_pull_secret_file> \
1 --arg ssh_pub_key "$(cat <path_to_ssh_pub_key>)" \
2 --arg cluster_id "$CLUSTER_ID" '{ "name": "<infraenv_name>",
3 "pull_secret": $pull_secret[0] | tojson, "cluster_id": $cluster_id, "ssh_authorized_key": $ssh_pub_key, "image_type": "<iso_image_type>"
4 }')
- 1
- Replace
<path_to_pull_secret_file>
with the path to the local file containing the downloaded pull secret from Red Hat OpenShift Cluster Manager at console.redhat.com. - 2
- Replace
<path_to_ssh_pub_key>
with the path to the public SSH key required to access the host. If you do not set this value, you cannot access the host while in discovery mode. - 3
- Replace
<infraenv_name>
with the plain text name for theInfraEnv
resource. - 4
- Replace
<iso_image_type>
with the ISO image type, eitherfull-iso
orminimal-iso
.
Post the
$INFRA_ENV_REQUEST
to the /v2/infra-envs API and set the$INFRA_ENV_ID
variable:Copy to Clipboard Copied! Toggle word wrap Toggle overflow INFRA_ENV_ID=$(curl "$API_URL/api/assisted-install/v2/infra-envs" -H "Authorization: Bearer ${API_TOKEN}" -H 'accept: application/json' -H 'Content-Type: application/json' -d "$INFRA_ENV_REQUEST" | tee /dev/stderr | jq -r '.id')
$ INFRA_ENV_ID=$(curl "$API_URL/api/assisted-install/v2/infra-envs" -H "Authorization: Bearer ${API_TOKEN}" -H 'accept: application/json' -H 'Content-Type: application/json' -d "$INFRA_ENV_REQUEST" | tee /dev/stderr | jq -r '.id')
Get the URL of the discovery ISO for the cluster host by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl -s "$API_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID" -H "Authorization: Bearer ${API_TOKEN}" | jq -r '.download_url'
$ curl -s "$API_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID" -H "Authorization: Bearer ${API_TOKEN}" | jq -r '.download_url'
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow https://api.openshift.com/api/assisted-images/images/41b91e72-c33e-42ee-b80f-b5c5bbf6431a?arch=x86_64&image_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NTYwMjYzNzEsInN1YiI6IjQxYjkxZTcyLWMzM2UtNDJlZS1iODBmLWI1YzViYmY2NDMxYSJ9.1EX_VGaMNejMhrAvVRBS7PDPIQtbOOc8LtG8OukE1a4&type=minimal-iso&version=4.12
https://api.openshift.com/api/assisted-images/images/41b91e72-c33e-42ee-b80f-b5c5bbf6431a?arch=x86_64&image_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NTYwMjYzNzEsInN1YiI6IjQxYjkxZTcyLWMzM2UtNDJlZS1iODBmLWI1YzViYmY2NDMxYSJ9.1EX_VGaMNejMhrAvVRBS7PDPIQtbOOc8LtG8OukE1a4&type=minimal-iso&version=4.12
Download the ISO:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl -L -s '<iso_url>' --output rhcos-live-minimal.iso
$ curl -L -s '<iso_url>' --output rhcos-live-minimal.iso
1 - 1
- Replace
<iso_url>
with the URL for the ISO from the previous step.
-
Boot the new worker host from the downloaded
rhcos-live-minimal.iso
. Get the list of hosts in the cluster that are not installed. Keep running the following command until the new host shows up:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl -s "$API_URL/api/assisted-install/v2/clusters/$CLUSTER_ID" -H "Authorization: Bearer ${API_TOKEN}" | jq -r '.hosts[] | select(.status != "installed").id'
$ curl -s "$API_URL/api/assisted-install/v2/clusters/$CLUSTER_ID" -H "Authorization: Bearer ${API_TOKEN}" | jq -r '.hosts[] | select(.status != "installed").id'
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 2294ba03-c264-4f11-ac08-2f1bb2f8c296
2294ba03-c264-4f11-ac08-2f1bb2f8c296
Set the
$HOST_ID
variable for the new host, for example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow HOST_ID=<host_id>
$ HOST_ID=<host_id>
1 - 1
- Replace
<host_id>
with the host ID from the previous step.
Check that the host is ready to install by running the following command:
NoteEnsure that you copy the entire command including the complete
jq
expression.Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl -s $API_URL/api/assisted-install/v2/clusters/$CLUSTER_ID -H "Authorization: Bearer ${API_TOKEN}" | jq '
$ curl -s $API_URL/api/assisted-install/v2/clusters/$CLUSTER_ID -H "Authorization: Bearer ${API_TOKEN}" | jq ' def host_name($host): if (.suggested_hostname // "") == "" then if (.inventory // "") == "" then "Unknown hostname, please wait" else .inventory | fromjson | .hostname end else .suggested_hostname end; def is_notable($validation): ["failure", "pending", "error"] | any(. == $validation.status); def notable_validations($validations_info): [ $validations_info // "{}" | fromjson | to_entries[].value[] | select(is_notable(.)) ]; { "Hosts validations": { "Hosts": [ .hosts[] | select(.status != "installed") | { "id": .id, "name": host_name(.), "status": .status, "notable_validations": notable_validations(.validations_info) } ] }, "Cluster validations info": { "notable_validations": notable_validations(.validations_info) } } ' -r
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow { "Hosts validations": { "Hosts": [ { "id": "97ec378c-3568-460c-bc22-df54534ff08f", "name": "localhost.localdomain", "status": "insufficient", "notable_validations": [ { "id": "ntp-synced", "status": "failure", "message": "Host couldn't synchronize with any NTP server" }, { "id": "api-domain-name-resolved-correctly", "status": "error", "message": "Parse error for domain name resolutions result" }, { "id": "api-int-domain-name-resolved-correctly", "status": "error", "message": "Parse error for domain name resolutions result" }, { "id": "apps-domain-name-resolved-correctly", "status": "error", "message": "Parse error for domain name resolutions result" } ] } ] }, "Cluster validations info": { "notable_validations": [] } }
{ "Hosts validations": { "Hosts": [ { "id": "97ec378c-3568-460c-bc22-df54534ff08f", "name": "localhost.localdomain", "status": "insufficient", "notable_validations": [ { "id": "ntp-synced", "status": "failure", "message": "Host couldn't synchronize with any NTP server" }, { "id": "api-domain-name-resolved-correctly", "status": "error", "message": "Parse error for domain name resolutions result" }, { "id": "api-int-domain-name-resolved-correctly", "status": "error", "message": "Parse error for domain name resolutions result" }, { "id": "apps-domain-name-resolved-correctly", "status": "error", "message": "Parse error for domain name resolutions result" } ] } ] }, "Cluster validations info": { "notable_validations": [] } }
When the previous command shows that the host is ready, start the installation using the /v2/infra-envs/{infra_env_id}/hosts/{host_id}/actions/install API by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl -X POST -s "$API_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/hosts/$HOST_ID/actions/install" -H "Authorization: Bearer ${API_TOKEN}"
$ curl -X POST -s "$API_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/hosts/$HOST_ID/actions/install" -H "Authorization: Bearer ${API_TOKEN}"
As the installation proceeds, the installation generates pending certificate signing requests (CSRs) for the host.
ImportantYou must approve the CSRs to complete the installation.
Keep running the following API call to monitor the cluster installation:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl -s "$API_URL/api/assisted-install/v2/clusters/$CLUSTER_ID" -H "Authorization: Bearer ${API_TOKEN}" | jq '{
$ curl -s "$API_URL/api/assisted-install/v2/clusters/$CLUSTER_ID" -H "Authorization: Bearer ${API_TOKEN}" | jq '{ "Cluster day-2 hosts": [ .hosts[] | select(.status != "installed") | {id, requested_hostname, status, status_info, progress, status_updated_at, updated_at, infra_env_id, cluster_id, created_at} ] }'
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow { "Cluster day-2 hosts": [ { "id": "a1c52dde-3432-4f59-b2ae-0a530c851480", "requested_hostname": "control-plane-1", "status": "added-to-existing-cluster", "status_info": "Host has rebooted and no further updates will be posted. Please check console for progress and to possibly approve pending CSRs", "progress": { "current_stage": "Done", "installation_percentage": 100, "stage_started_at": "2022-07-08T10:56:20.476Z", "stage_updated_at": "2022-07-08T10:56:20.476Z" }, "status_updated_at": "2022-07-08T10:56:20.476Z", "updated_at": "2022-07-08T10:57:15.306369Z", "infra_env_id": "b74ec0c3-d5b5-4717-a866-5b6854791bd3", "cluster_id": "8f721322-419d-4eed-aa5b-61b50ea586ae", "created_at": "2022-07-06T22:54:57.161614Z" } ] }
{ "Cluster day-2 hosts": [ { "id": "a1c52dde-3432-4f59-b2ae-0a530c851480", "requested_hostname": "control-plane-1", "status": "added-to-existing-cluster", "status_info": "Host has rebooted and no further updates will be posted. Please check console for progress and to possibly approve pending CSRs", "progress": { "current_stage": "Done", "installation_percentage": 100, "stage_started_at": "2022-07-08T10:56:20.476Z", "stage_updated_at": "2022-07-08T10:56:20.476Z" }, "status_updated_at": "2022-07-08T10:56:20.476Z", "updated_at": "2022-07-08T10:57:15.306369Z", "infra_env_id": "b74ec0c3-d5b5-4717-a866-5b6854791bd3", "cluster_id": "8f721322-419d-4eed-aa5b-61b50ea586ae", "created_at": "2022-07-06T22:54:57.161614Z" } ] }
Optional: Run the following command to see all the events for the cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow curl -s "$API_URL/api/assisted-install/v2/events?cluster_id=$CLUSTER_ID" -H "Authorization: Bearer ${API_TOKEN}" | jq -c '.[] | {severity, message, event_time, host_id}'
$ curl -s "$API_URL/api/assisted-install/v2/events?cluster_id=$CLUSTER_ID" -H "Authorization: Bearer ${API_TOKEN}" | jq -c '.[] | {severity, message, event_time, host_id}'
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow {"severity":"info","message":"Host compute-0: updated status from insufficient to known (Host is ready to be installed)","event_time":"2022-07-08T11:21:46.346Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Host compute-0: updated status from known to installing (Installation is in progress)","event_time":"2022-07-08T11:28:28.647Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Host compute-0: updated status from installing to installing-in-progress (Starting installation)","event_time":"2022-07-08T11:28:52.068Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Uploaded logs for host compute-0 cluster 8f721322-419d-4eed-aa5b-61b50ea586ae","event_time":"2022-07-08T11:29:47.802Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Host compute-0: updated status from installing-in-progress to added-to-existing-cluster (Host has rebooted and no further updates will be posted. Please check console for progress and to possibly approve pending CSRs)","event_time":"2022-07-08T11:29:48.259Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Host: compute-0, reached installation stage Rebooting","event_time":"2022-07-08T11:29:48.261Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"}
{"severity":"info","message":"Host compute-0: updated status from insufficient to known (Host is ready to be installed)","event_time":"2022-07-08T11:21:46.346Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Host compute-0: updated status from known to installing (Installation is in progress)","event_time":"2022-07-08T11:28:28.647Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Host compute-0: updated status from installing to installing-in-progress (Starting installation)","event_time":"2022-07-08T11:28:52.068Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Uploaded logs for host compute-0 cluster 8f721322-419d-4eed-aa5b-61b50ea586ae","event_time":"2022-07-08T11:29:47.802Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Host compute-0: updated status from installing-in-progress to added-to-existing-cluster (Host has rebooted and no further updates will be posted. Please check console for progress and to possibly approve pending CSRs)","event_time":"2022-07-08T11:29:48.259Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"} {"severity":"info","message":"Host: compute-0, reached installation stage Rebooting","event_time":"2022-07-08T11:29:48.261Z","host_id":"9d7b3b44-1125-4ad0-9b14-76550087b445"}
- Log in to the cluster and approve the pending CSRs to complete the installation.
Verification
Check that the new host was successfully added to the cluster with a status of
Ready
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes
$ oc get nodes
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS ROLES AGE VERSION control-plane-1.example.com Ready master,worker 56m v1.25.0 compute-1.example.com Ready worker 11m v1.25.0
NAME STATUS ROLES AGE VERSION control-plane-1.example.com Ready master,worker 56m v1.25.0 compute-1.example.com Ready worker 11m v1.25.0
12.5. Replacing a control plane node in a healthy cluster
You can replace a control plane (master) node in a healthy OpenShift Container Platform cluster that has three to five control plane nodes, by adding a new control plane node and removing an existing control plane node.
If the cluster is unhealthy, you must peform additional operations before you can manage the control plane nodes. See Replacing a control plane node in an unhealthy cluster for more information.
12.5.1. Adding a new control plane node
Add the new control plane node, and verify that it is healthy. In the example below, the new node is node-5
.
Prerequisites
- You are using OpenShift Container Platform 4.11 or later.
- You have installed a healthy cluster with at least three control plane nodes.
- You have created a single control plane node to be added to the cluster for Day 2.
Procedure
Retrieve pending Certificate Signing Requests (CSRs) for the new Day 2 control plane node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr | grep Pending
$ oc get csr | grep Pending
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> Pending
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> Pending
Approve all pending CSRs for the new node (
node-5
in this example):Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
ImportantYou must approve the CSRs to complete the installation.
Confirm that the new control plane node is in
Ready
status:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes
$ oc get nodes
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS ROLES AGE VERSION node-0 Ready master 4h42m v1.24.0+3882f8f node-1 Ready master 4h27m v1.24.0+3882f8f node-2 Ready master 4h43m v1.24.0+3882f8f node-3 Ready worker 4h29m v1.24.0+3882f8f node-4 Ready worker 4h30m v1.24.0+3882f8f node-5 Ready master 105s v1.24.0+3882f8f
NAME STATUS ROLES AGE VERSION node-0 Ready master 4h42m v1.24.0+3882f8f node-1 Ready master 4h27m v1.24.0+3882f8f node-2 Ready master 4h43m v1.24.0+3882f8f node-3 Ready worker 4h29m v1.24.0+3882f8f node-4 Ready worker 4h30m v1.24.0+3882f8f node-5 Ready master 105s v1.24.0+3882f8f
NoteThe
etcd
operator requires aMachine
custom resource (CR) that references the new node when the cluster runs with a Machine API. The machine API is automatically activated when the cluster has three or more control plane nodes.Create the
BareMetalHost
andMachine
CRs and link them to the new control plane’sNode
CR.Create the
BareMetalHost
CR with a unique.metadata.name
value:Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: node-5 namespace: openshift-machine-api spec: automatedCleaningMode: metadata bootMACAddress: 00:00:00:00:00:02 bootMode: UEFI customDeploy: method: install_coreos externallyProvisioned: true online: true userData: name: master-user-data-managed namespace: openshift-machine-api
apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: node-5 namespace: openshift-machine-api spec: automatedCleaningMode: metadata bootMACAddress: 00:00:00:00:00:02 bootMode: UEFI customDeploy: method: install_coreos externallyProvisioned: true online: true userData: name: master-user-data-managed namespace: openshift-machine-api
Apply the
BareMetalHost
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc apply -f <filename>
$ oc apply -f <filename>
1 - 1
- Replace <filename> with the name of the
BareMetalHost
CR.
Create the
Machine
CR using the unique.metadata.name
value:Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: externally provisioned metal3.io/BareMetalHost: openshift-machine-api/node-5 finalizers: - machine.machine.openshift.io labels: machine.openshift.io/cluster-api-cluster: <cluster_name> machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master name: node-5 namespace: openshift-machine-api spec: metadata: {} providerSpec: value: apiVersion: baremetal.cluster.k8s.io/v1alpha1 customDeploy: method: install_coreos hostSelector: {} image: checksum: "" url: "" kind: BareMetalMachineProviderSpec metadata: creationTimestamp: null userData: name: master-user-data-managed
apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: externally provisioned metal3.io/BareMetalHost: openshift-machine-api/node-5 finalizers: - machine.machine.openshift.io labels: machine.openshift.io/cluster-api-cluster: <cluster_name>
1 machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master name: node-5 namespace: openshift-machine-api spec: metadata: {} providerSpec: value: apiVersion: baremetal.cluster.k8s.io/v1alpha1 customDeploy: method: install_coreos hostSelector: {} image: checksum: "" url: "" kind: BareMetalMachineProviderSpec metadata: creationTimestamp: null userData: name: master-user-data-managed
- 1
- Replace
<cluster_name>
with the name of the specific cluster, for example,test-day2-1-6qv96
.
To get the cluster name, run the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get infrastructure cluster -o=jsonpath='{.status.infrastructureName}{"\n"}'
$ oc get infrastructure cluster -o=jsonpath='{.status.infrastructureName}{"\n"}'
Apply the
Machine
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc apply -f <filename>
$ oc apply -f <filename>
1 - 1
- Replace
<filename>
with the name of theMachine
CR.
Link
BareMetalHost
,Machine
, andNode
by running thelink-machine-and-node.sh
script:Copy the
link-machine-and-node.sh
script below to a local machine:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Credit goes to https://bugzilla.redhat.com/show_bug.cgi?id=1801238. This script will link Machine object and Node object. This is needed in order to have IP address of the Node present in the status of the Machine. The address structure on the host doesn't match the node, so extract the values we want into separate variables so we can build the patch we need.
#!/bin/bash # Credit goes to # https://bugzilla.redhat.com/show_bug.cgi?id=1801238. # This script will link Machine object # and Node object. This is needed # in order to have IP address of # the Node present in the status of the Machine. set -e machine="$1" node="$2" if [ -z "$machine" ] || [ -z "$node" ]; then echo "Usage: $0 MACHINE NODE" exit 1 fi node_name=$(echo "${node}" | cut -f2 -d':') oc proxy & proxy_pid=$! function kill_proxy { kill $proxy_pid } trap kill_proxy EXIT SIGINT HOST_PROXY_API_PATH="http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts" function print_nics() { local ips local eob declare -a ips readarray -t ips < <(echo "${1}" \ | jq '.[] | select(. | .type == "InternalIP") | .address' \ | sed 's/"//g') eob=',' for (( i=0; i<${#ips[@]}; i++ )); do if [ $((i+1)) -eq ${#ips[@]} ]; then eob="" fi cat <<- EOF { "ip": "${ips[$i]}", "mac": "00:00:00:00:00:00", "model": "unknown", "speedGbps": 10, "vlanId": 0, "pxe": true, "name": "eth1" }${eob} EOF done } function wait_for_json() { local name local url local curl_opts local timeout local start_time local curr_time local time_diff name="$1" url="$2" timeout="$3" shift 3 curl_opts="$@" echo -n "Waiting for $name to respond" start_time=$(date +%s) until curl -g -X GET "$url" "${curl_opts[@]}" 2> /dev/null | jq '.' 2> /dev/null > /dev/null; do echo -n "." curr_time=$(date +%s) time_diff=$((curr_time - start_time)) if [[ $time_diff -gt $timeout ]]; then printf '\nTimed out waiting for %s' "${name}" return 1 fi sleep 5 done echo " Success!" return 0 } wait_for_json oc_proxy "${HOST_PROXY_API_PATH}" 10 -H "Accept: application/json" -H "Content-Type: application/json" addresses=$(oc get node -n openshift-machine-api "${node_name}" -o json | jq -c '.status.addresses') machine_data=$(oc get machines.machine.openshift.io -n openshift-machine-api -o json "${machine}") host=$(echo "$machine_data" | jq '.metadata.annotations["metal3.io/BareMetalHost"]' | cut -f2 -d/ | sed 's/"//g') if [ -z "$host" ]; then echo "Machine $machine is not linked to a host yet." 1>&2 exit 1 fi # The address structure on the host doesn't match the node, so extract # the values we want into separate variables so we can build the patch # we need. hostname=$(echo "${addresses}" | jq '.[] | select(. | .type == "Hostname") | .address' | sed 's/"//g') set +e read -r -d '' host_patch << EOF { "status": { "hardware": { "hostname": "${hostname}", "nics": [ $(print_nics "${addresses}") ], "systemVendor": { "manufacturer": "Red Hat", "productName": "product name", "serialNumber": "" }, "firmware": { "bios": { "date": "04/01/2014", "vendor": "SeaBIOS", "version": "1.11.0-2.el7" } }, "ramMebibytes": 0, "storage": [], "cpu": { "arch": "x86_64", "model": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz", "clockMegahertz": 2199.998, "count": 4, "flags": [] } } } } EOF set -e echo "PATCHING HOST" echo "${host_patch}" | jq . curl -s \ -X PATCH \ "${HOST_PROXY_API_PATH}/${host}/status" \ -H "Content-type: application/merge-patch+json" \ -d "${host_patch}" oc get baremetalhost -n openshift-machine-api -o yaml "${host}"
Make the script executable:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chmod +x link-machine-and-node.sh
$ chmod +x link-machine-and-node.sh
Run the script:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow bash link-machine-and-node.sh node-5 node-5
$ bash link-machine-and-node.sh node-5 node-5
NoteThe first
node-5
instance represents the machine, and the second represents the node.
Confirm members of
etcd
by executing into one of the pre-existing control plane nodes:Open a remote shell session to the control plane node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc rsh -n openshift-etcd etcd-node-0
$ oc rsh -n openshift-etcd etcd-node-0
List
etcd
members:Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl member list -w table
# etcdctl member list -w table
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow +--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |76ae1d00| started |node-0 |192.168.111.24|192.168.111.24| false | |2c18942f| started |node-1 |192.168.111.26|192.168.111.26| false | |61e2a860| started |node-2 |192.168.111.25|192.168.111.25| false | |ead5f280| started |node-5 |192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+
+--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |76ae1d00| started |node-0 |192.168.111.24|192.168.111.24| false | |2c18942f| started |node-1 |192.168.111.26|192.168.111.26| false | |61e2a860| started |node-2 |192.168.111.25|192.168.111.25| false | |ead5f280| started |node-5 |192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+
Monitor the
etcd
operator configuration process until completion:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get clusteroperator etcd
$ oc get clusteroperator etcd
Example output (upon completion)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.11.5 True False False 5h54m
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.11.5 True False False 5h54m
Confirm
etcd
health by running the following commands:Open a remote shell session to the control plane node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc rsh -n openshift-etcd etcd-node-0
$ oc rsh -n openshift-etcd etcd-node-0
Check endpoint health:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl endpoint health
# etcdctl endpoint health
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 192.168.111.24 is healthy: committed proposal: took = 10.383651ms 192.168.111.26 is healthy: committed proposal: took = 11.297561ms 192.168.111.25 is healthy: committed proposal: took = 13.892416ms 192.168.111.28 is healthy: committed proposal: took = 11.870755ms
192.168.111.24 is healthy: committed proposal: took = 10.383651ms 192.168.111.26 is healthy: committed proposal: took = 11.297561ms 192.168.111.25 is healthy: committed proposal: took = 13.892416ms 192.168.111.28 is healthy: committed proposal: took = 11.870755ms
Verify that all nodes are ready:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes
$ oc get nodes
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS ROLES AGE VERSION node-0 Ready master 6h20m v1.24.0+3882f8f node-1 Ready master 6h20m v1.24.0+3882f8f node-2 Ready master 6h4m v1.24.0+3882f8f node-3 Ready worker 6h7m v1.24.0+3882f8f node-4 Ready worker 6h7m v1.24.0+3882f8f node-5 Ready master 99m v1.24.0+3882f8f
NAME STATUS ROLES AGE VERSION node-0 Ready master 6h20m v1.24.0+3882f8f node-1 Ready master 6h20m v1.24.0+3882f8f node-2 Ready master 6h4m v1.24.0+3882f8f node-3 Ready worker 6h7m v1.24.0+3882f8f node-4 Ready worker 6h7m v1.24.0+3882f8f node-5 Ready master 99m v1.24.0+3882f8f
Verify that the cluster Operators are all available:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get ClusterOperators
$ oc get ClusterOperators
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MSG authentication 4.11.5 True False False 5h57m baremetal 4.11.5 True False False 6h19m cloud-controller-manager 4.11.5 True False False 6h20m cloud-credential 4.11.5 True False False 6h23m cluster-autoscaler 4.11.5 True False False 6h18m config-operator 4.11.5 True False False 6h19m console 4.11.5 True False False 6h4m csi-snapshot-controller 4.11.5 True False False 6h19m dns 4.11.5 True False False 6h18m etcd 4.11.5 True False False 6h17m image-registry 4.11.5 True False False 6h7m ingress 4.11.5 True False False 6h6m insights 4.11.5 True False False 6h12m kube-apiserver 4.11.5 True False False 6h16m kube-controller-manager 4.11.5 True False False 6h16m kube-scheduler 4.11.5 True False False 6h16m kube-storage-version-migrator 4.11.5 True False False 6h19m machine-api 4.11.5 True False False 6h15m machine-approver 4.11.5 True False False 6h19m machine-config 4.11.5 True False False 6h18m marketplace 4.11.5 True False False 6h18m monitoring 4.11.5 True False False 6h4m network 4.11.5 True False False 6h20m node-tuning 4.11.5 True False False 6h18m openshift-apiserver 4.11.5 True False False 6h8m openshift-controller-manager 4.11.5 True False False 6h7m openshift-samples 4.11.5 True False False 6h12m operator-lifecycle-manager 4.11.5 True False False 6h18m operator-lifecycle-manager-catalog 4.11.5 True False False 6h19m operator-lifecycle-manager-pkgsvr 4.11.5 True False False 6h12m service-ca 4.11.5 True False False 6h19m storage 4.11.5 True False False 6h19m
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MSG authentication 4.11.5 True False False 5h57m baremetal 4.11.5 True False False 6h19m cloud-controller-manager 4.11.5 True False False 6h20m cloud-credential 4.11.5 True False False 6h23m cluster-autoscaler 4.11.5 True False False 6h18m config-operator 4.11.5 True False False 6h19m console 4.11.5 True False False 6h4m csi-snapshot-controller 4.11.5 True False False 6h19m dns 4.11.5 True False False 6h18m etcd 4.11.5 True False False 6h17m image-registry 4.11.5 True False False 6h7m ingress 4.11.5 True False False 6h6m insights 4.11.5 True False False 6h12m kube-apiserver 4.11.5 True False False 6h16m kube-controller-manager 4.11.5 True False False 6h16m kube-scheduler 4.11.5 True False False 6h16m kube-storage-version-migrator 4.11.5 True False False 6h19m machine-api 4.11.5 True False False 6h15m machine-approver 4.11.5 True False False 6h19m machine-config 4.11.5 True False False 6h18m marketplace 4.11.5 True False False 6h18m monitoring 4.11.5 True False False 6h4m network 4.11.5 True False False 6h20m node-tuning 4.11.5 True False False 6h18m openshift-apiserver 4.11.5 True False False 6h8m openshift-controller-manager 4.11.5 True False False 6h7m openshift-samples 4.11.5 True False False 6h12m operator-lifecycle-manager 4.11.5 True False False 6h18m operator-lifecycle-manager-catalog 4.11.5 True False False 6h19m operator-lifecycle-manager-pkgsvr 4.11.5 True False False 6h12m service-ca 4.11.5 True False False 6h19m storage 4.11.5 True False False 6h19m
Verify that the cluster version is correct:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get ClusterVersion
$ oc get ClusterVersion
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 5h57m Cluster version is 4.11.5
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 5h57m Cluster version is 4.11.5
12.5.2. Removing the existing control plane node
Remove the control plane node that you are replacing. This is node-0
in the example below.
Prerequisites
- You have added a new healthy control plane node.
Procedure
Delete the
BareMetalHost
CR of the pre-existing control plane node:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete bmh -n openshift-machine-api node-0
$ oc delete bmh -n openshift-machine-api node-0
Confirm that the machine is unhealthy:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get machine -A
$ oc get machine -A
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAMESPACE NAME PHASE AGE openshift-machine-api node-0 Failed 20h openshift-machine-api node-1 Running 20h openshift-machine-api node-2 Running 20h openshift-machine-api node-3 Running 19h openshift-machine-api node-4 Running 19h openshift-machine-api node-5 Running 14h
NAMESPACE NAME PHASE AGE openshift-machine-api node-0 Failed 20h openshift-machine-api node-1 Running 20h openshift-machine-api node-2 Running 20h openshift-machine-api node-3 Running 19h openshift-machine-api node-4 Running 19h openshift-machine-api node-5 Running 14h
Delete the
Machine
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete machine -n openshift-machine-api node-0
$ oc delete machine -n openshift-machine-api node-0 machine.machine.openshift.io "node-0" deleted
Confirm removal of the
Node
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes
$ oc get nodes
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS ROLES AGE VERSION node-1 Ready master 20h v1.24.0+3882f8f node-2 Ready master 19h v1.24.0+3882f8f node-3 Ready worker 19h v1.24.0+3882f8f node-4 Ready worker 19h v1.24.0+3882f8f node-5 Ready master 15h v1.24.0+3882f8f
NAME STATUS ROLES AGE VERSION node-1 Ready master 20h v1.24.0+3882f8f node-2 Ready master 19h v1.24.0+3882f8f node-3 Ready worker 19h v1.24.0+3882f8f node-4 Ready worker 19h v1.24.0+3882f8f node-5 Ready master 15h v1.24.0+3882f8f
Check
etcd-operator
logs to confirm status of theetcd
cluster:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf
$ oc logs -n openshift-etcd-operator etcd-operator-8668df65d-lvpjf
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow E0927 07:53:10.597523 1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resource
E0927 07:53:10.597523 1 base_controller.go:272] ClusterMemberRemovalController reconciliation failed: cannot remove member: 192.168.111.23 because it is reported as healthy but it doesn't have a machine nor a node resource
Remove the physical machine to allow the
etcd
operator to reconcile the cluster members:Open a remote shell session to the control plane node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc rsh -n openshift-etcd etcd-node-1
$ oc rsh -n openshift-etcd etcd-node-1
Monitor the progress of
etcd
operator reconciliation by checking members and endpoint health:Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl member list -w table; etcdctl endpoint health
# etcdctl member list -w table; etcdctl endpoint health
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow +--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |2c18942f| started | node-1 |192.168.111.26|192.168.111.26| false | |61e2a860| started | node-2 |192.168.111.25|192.168.111.25| false | |ead4f280| started | node-5 |192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+ 192.168.111.26 is healthy: committed proposal: took = 10.458132ms 192.168.111.25 is healthy: committed proposal: took = 11.047349ms 192.168.111.28 is healthy: committed proposal: took = 11.414402ms
+--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |2c18942f| started | node-1 |192.168.111.26|192.168.111.26| false | |61e2a860| started | node-2 |192.168.111.25|192.168.111.25| false | |ead4f280| started | node-5 |192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+ 192.168.111.26 is healthy: committed proposal: took = 10.458132ms 192.168.111.25 is healthy: committed proposal: took = 11.047349ms 192.168.111.28 is healthy: committed proposal: took = 11.414402ms
12.6. Replacing a control plane node in an unhealthy cluster
You can replace an unhealthy control plane (master) node in an OpenShift Container Platform cluster that has three to five control plane nodes, by removing the unhealthy control plane node and adding a new one.
For details on replacing a control plane node in a healthy cluster, see Replacing a control plane node in a healthy cluster.
12.6.1. Removing an unhealthy control plane node
Remove the unhealthy control plane node from the cluster. This is node-0
in the example below.
Prerequisites
- You have installed a cluster with at least three control plane nodes.
- At least one of the control plane nodes is not ready.
Procedure
Check the node status to confirm that a control plane node is not ready:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes
$ oc get nodes
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS ROLES AGE VERSION node-0 NotReady master 20h v1.24.0+3882f8f node-1 Ready master 20h v1.24.0+3882f8f node-2 Ready master 20h v1.24.0+3882f8f node-3 Ready worker 20h v1.24.0+3882f8f node-4 Ready worker 20h v1.24.0+3882f8f
NAME STATUS ROLES AGE VERSION node-0 NotReady master 20h v1.24.0+3882f8f node-1 Ready master 20h v1.24.0+3882f8f node-2 Ready master 20h v1.24.0+3882f8f node-3 Ready worker 20h v1.24.0+3882f8f node-4 Ready worker 20h v1.24.0+3882f8f
Confirm in the
etcd-operator
logs that the cluster is unhealthy:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -n openshift-etcd-operator etcd-operator deployment/etcd-operator
$ oc logs -n openshift-etcd-operator etcd-operator deployment/etcd-operator
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow E0927 08:24:23.983733 1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, node-0 is unhealthy
E0927 08:24:23.983733 1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, node-0 is unhealthy
Confirm the
etcd
members by running the following commands:Open a remote shell session to the control plane node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc rsh -n openshift-etcd node-1
$ oc rsh -n openshift-etcd node-1
List the
etcdctl
members:Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl member list -w table
# etcdctl member list -w table
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow +--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |61e2a860| started | node-0 |192.168.111.25|192.168.111.25| false | |2c18942f| started | node-1 |192.168.111.26|192.168.111.26| false | |ead4f280| started | node-2 |192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+
+--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |61e2a860| started | node-0 |192.168.111.25|192.168.111.25| false | |2c18942f| started | node-1 |192.168.111.26|192.168.111.26| false | |ead4f280| started | node-2 |192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+
Confirm that
etcdctl
endpoint health reports an unhealthy member of the cluster:Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl endpoint health
# etcdctl endpoint health
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow {"level":"warn","ts":"2022-09-27T08:25:35.953Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000680380/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 12.465641ms 192.168.111.26 is healthy: committed proposal: took = 12.297059ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
{"level":"warn","ts":"2022-09-27T08:25:35.953Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000680380/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 12.465641ms 192.168.111.26 is healthy: committed proposal: took = 12.297059ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
Remove the unhealthy control plane by deleting the
Machine
custom resource (CR):Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete machine -n openshift-machine-api node-0
$ oc delete machine -n openshift-machine-api node-0
NoteThe
Machine
andNode
CRs might not be deleted because they are protected by finalizers. If this occurs, you must delete theMachine
CR manually by removing all finalizers.Verify in the
etcd-operator
logs whether the unhealthy machine has been removed:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -n openshift-etcd-operator etcd-operator deployment/ettcd-operator
$ oc logs -n openshift-etcd-operator etcd-operator deployment/ettcd-operator
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow I0927 08:58:41.249222 1 machinedeletionhooks.go:135] skip removing the deletion hook from machine node-0 since its member is still present with any of: [{InternalIP } {InternalIP 192.168.111.25}]
I0927 08:58:41.249222 1 machinedeletionhooks.go:135] skip removing the deletion hook from machine node-0 since its member is still present with any of: [{InternalIP } {InternalIP 192.168.111.25}]
If you see that removal has been skipped, as in the above log example, manually remove the unhealthy
etcdctl
member:Open a remote shell session to the control plane node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc rsh -n openshift-etcd node-1
$ oc rsh -n openshift-etcd node-1
List the
etcdctl
members:Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl member list -w table
# etcdctl member list -w table
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow +--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |61e2a860| started | node-0 |192.168.111.25|192.168.111.25| false | |2c18942f| started | node-1 |192.168.111.26|192.168.111.26| false | |ead4f280| started | node-2 |192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+
+--------+---------+--------+--------------+--------------+---------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | LEARNER | +--------+---------+--------+--------------+--------------+---------+ |61e2a860| started | node-0 |192.168.111.25|192.168.111.25| false | |2c18942f| started | node-1 |192.168.111.26|192.168.111.26| false | |ead4f280| started | node-2 |192.168.111.28|192.168.111.28| false | +--------+---------+--------+--------------+--------------+---------+
Confirm that
etcdctl
endpoint health reports an unhealthy member of the cluster:Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl endpoint health
# etcdctl endpoint health
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow {"level":"warn","ts":"2022-09-27T10:31:07.227Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d6e00/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 13.038278ms 192.168.111.26 is healthy: committed proposal: took = 12.950355ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
{"level":"warn","ts":"2022-09-27T10:31:07.227Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000d6e00/192.168.111.25","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.111.25: connect: no route to host\""} 192.168.111.28 is healthy: committed proposal: took = 13.038278ms 192.168.111.26 is healthy: committed proposal: took = 12.950355ms 192.168.111.25 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
Remove the unhealthy
etcdctl
member from the cluster:Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl member remove 61e2a86084aafa62
# etcdctl member remove 61e2a86084aafa62
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Member 61e2a86084aafa62 removed from cluster 6881c977b97990d7
Member 61e2a86084aafa62 removed from cluster 6881c977b97990d7
Verify that the unhealthy
etcdctl
member was removed by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl member list -w table
# etcdctl member list -w table
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow +----------+---------+--------+--------------+--------------+-------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |LEARNER| +----------+---------+--------+--------------+--------------+-------+ | 2c18942f | started | node-1 |192.168.111.26|192.168.111.26| false | | ead4f280 | started | node-2 |192.168.111.28|192.168.111.28| false | +----------+---------+--------+--------------+--------------+-------+
+----------+---------+--------+--------------+--------------+-------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |LEARNER| +----------+---------+--------+--------------+--------------+-------+ | 2c18942f | started | node-1 |192.168.111.26|192.168.111.26| false | | ead4f280 | started | node-2 |192.168.111.28|192.168.111.28| false | +----------+---------+--------+--------------+--------------+-------+
12.6.2. Adding a new control plane node
Add a new control plane node to replace the unhealthy node that you removed. In the example below, the new node is node-5
.
Prerequisites
- You have installed a control plane node for Day 2. For more information, see Adding hosts with the web console or Adding hosts with the API.
Procedure
Retrieve pending Certificate Signing Requests (CSRs) for the new Day 2 control plane node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr | grep Pending
$ oc get csr | grep Pending
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> Pending
csr-5sd59 8m19s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-xzqts 10s kubernetes.io/kubelet-serving system:node:node-5 <none> Pending
Approve all pending CSRs for the new node (
node-5
in this example):Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
NoteYou must approve the CSRs to complete the installation.
Confirm that the control plane node is in
Ready
status:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes
$ oc get nodes
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS ROLES AGE VERSION node-1 Ready master 20h v1.24.0+3882f8f node-2 Ready master 20h v1.24.0+3882f8f node-3 Ready worker 20h v1.24.0+3882f8f node-4 Ready worker 20h v1.24.0+3882f8f node-5 Ready master 2m52s v1.24.0+3882f8f
NAME STATUS ROLES AGE VERSION node-1 Ready master 20h v1.24.0+3882f8f node-2 Ready master 20h v1.24.0+3882f8f node-3 Ready worker 20h v1.24.0+3882f8f node-4 Ready worker 20h v1.24.0+3882f8f node-5 Ready master 2m52s v1.24.0+3882f8f
The
etcd
operator requires aMachine
CR referencing the new node when the cluster runs with a Machine API. The machine API is automatically activated when the cluster has three control plane nodes.Create the
BareMetalHost
andMachine
CRs and link them to the new control plane’sNode
CR.ImportantBoot-it-yourself will not create
BareMetalHost
andMachine
CRs, so you must create them. Failure to create theBareMetalHost
andMachine
CRs will generate errors in theetcd
operator.Create the
BareMetalHost
CR with a unique.metadata.name
value:Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: node-5 namespace: openshift-machine-api spec: automatedCleaningMode: metadata bootMACAddress: 00:00:00:00:00:02 bootMode: UEFI customDeploy: method: install_coreos externallyProvisioned: true online: true userData: name: master-user-data-managed namespace: openshift-machine-api
apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: node-5 namespace: openshift-machine-api spec: automatedCleaningMode: metadata bootMACAddress: 00:00:00:00:00:02 bootMode: UEFI customDeploy: method: install_coreos externallyProvisioned: true online: true userData: name: master-user-data-managed namespace: openshift-machine-api
Apply the
BareMetalHost
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc apply -f <filename>
$ oc apply -f <filename>
1 - 1
- Replace <filename> with the name of the
BareMetalHost
CR.
Create the
Machine
CR using the unique.metadata.name
value:Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: externally provisioned metal3.io/BareMetalHost: openshift-machine-api/node-5 finalizers: - machine.machine.openshift.io labels: machine.openshift.io/cluster-api-cluster: test-day2-1-6qv96 machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master name: node-5 namespace: openshift-machine-api spec: metadata: {} providerSpec: value: apiVersion: baremetal.cluster.k8s.io/v1alpha1 customDeploy: method: install_coreos hostSelector: {} image: checksum: "" url: "" kind: BareMetalMachineProviderSpec metadata: creationTimestamp: null userData: name: master-user-data-managed
apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: externally provisioned metal3.io/BareMetalHost: openshift-machine-api/node-5 finalizers: - machine.machine.openshift.io labels: machine.openshift.io/cluster-api-cluster: test-day2-1-6qv96 machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master name: node-5 namespace: openshift-machine-api spec: metadata: {} providerSpec: value: apiVersion: baremetal.cluster.k8s.io/v1alpha1 customDeploy: method: install_coreos hostSelector: {} image: checksum: "" url: "" kind: BareMetalMachineProviderSpec metadata: creationTimestamp: null userData: name: master-user-data-managed
Apply the
Machine
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc apply -f <filename>
$ oc apply -f <filename>
1 - 1
- Replace <filename> with the name of the
Machine
CR.
Link
BareMetalHost
,Machine
, andNode
by running thelink-machine-and-node.sh
script:Copy the
link-machine-and-node.sh
script below to a local machine:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Credit goes to https://bugzilla.redhat.com/show_bug.cgi?id=1801238. This script will link Machine object and Node object. This is needed in order to have IP address of the Node present in the status of the Machine. The address structure on the host doesn't match the node, so extract the values we want into separate variables so we can build the patch we need.
#!/bin/bash # Credit goes to # https://bugzilla.redhat.com/show_bug.cgi?id=1801238. # This script will link Machine object # and Node object. This is needed # in order to have IP address of # the Node present in the status of the Machine. set -e machine="$1" node="$2" if [ -z "$machine" ] || [ -z "$node" ]; then echo "Usage: $0 MACHINE NODE" exit 1 fi node_name=$(echo "${node}" | cut -f2 -d':') oc proxy & proxy_pid=$! function kill_proxy { kill $proxy_pid } trap kill_proxy EXIT SIGINT HOST_PROXY_API_PATH="http://localhost:8001/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts" function print_nics() { local ips local eob declare -a ips readarray -t ips < <(echo "${1}" \ | jq '.[] | select(. | .type == "InternalIP") | .address' \ | sed 's/"//g') eob=',' for (( i=0; i<${#ips[@]}; i++ )); do if [ $((i+1)) -eq ${#ips[@]} ]; then eob="" fi cat <<- EOF { "ip": "${ips[$i]}", "mac": "00:00:00:00:00:00", "model": "unknown", "speedGbps": 10, "vlanId": 0, "pxe": true, "name": "eth1" }${eob} EOF done } function wait_for_json() { local name local url local curl_opts local timeout local start_time local curr_time local time_diff name="$1" url="$2" timeout="$3" shift 3 curl_opts="$@" echo -n "Waiting for $name to respond" start_time=$(date +%s) until curl -g -X GET "$url" "${curl_opts[@]}" 2> /dev/null | jq '.' 2> /dev/null > /dev/null; do echo -n "." curr_time=$(date +%s) time_diff=$((curr_time - start_time)) if [[ $time_diff -gt $timeout ]]; then printf '\nTimed out waiting for %s' "${name}" return 1 fi sleep 5 done echo " Success!" return 0 } wait_for_json oc_proxy "${HOST_PROXY_API_PATH}" 10 -H "Accept: application/json" -H "Content-Type: application/json" addresses=$(oc get node -n openshift-machine-api "${node_name}" -o json | jq -c '.status.addresses') machine_data=$(oc get machines.machine.openshift.io -n openshift-machine-api -o json "${machine}") host=$(echo "$machine_data" | jq '.metadata.annotations["metal3.io/BareMetalHost"]' | cut -f2 -d/ | sed 's/"//g') if [ -z "$host" ]; then echo "Machine $machine is not linked to a host yet." 1>&2 exit 1 fi # The address structure on the host doesn't match the node, so extract # the values we want into separate variables so we can build the patch # we need. hostname=$(echo "${addresses}" | jq '.[] | select(. | .type == "Hostname") | .address' | sed 's/"//g') set +e read -r -d '' host_patch << EOF { "status": { "hardware": { "hostname": "${hostname}", "nics": [ $(print_nics "${addresses}") ], "systemVendor": { "manufacturer": "Red Hat", "productName": "product name", "serialNumber": "" }, "firmware": { "bios": { "date": "04/01/2014", "vendor": "SeaBIOS", "version": "1.11.0-2.el7" } }, "ramMebibytes": 0, "storage": [], "cpu": { "arch": "x86_64", "model": "Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz", "clockMegahertz": 2199.998, "count": 4, "flags": [] } } } } EOF set -e echo "PATCHING HOST" echo "${host_patch}" | jq . curl -s \ -X PATCH \ "${HOST_PROXY_API_PATH}/${host}/status" \ -H "Content-type: application/merge-patch+json" \ -d "${host_patch}" oc get baremetalhost -n openshift-machine-api -o yaml "${host}"
Make the script executable:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chmod +x link-machine-and-node.sh
$ chmod +x link-machine-and-node.sh
Run the script:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow bash link-machine-and-node.sh node-5 node-5
$ bash link-machine-and-node.sh node-5 node-5
NoteThe first
node-5
instance represents the machine, and the second represents the node.
Confirm members of
etcd
by running the following commands:Open a remote shell session to the control plane node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc rsh -n openshift-etcd node-1
$ oc rsh -n openshift-etcd node-1
List the
etcdctl
members:Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl member list -w table
# etcdctl member list -w table
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow +---------+-------+--------+--------------+--------------+-------+ | ID | STATUS| NAME | PEER ADDRS | CLIENT ADDRS |LEARNER| +---------+-------+--------+--------------+--------------+-------+ | 2c18942f|started| node-1 |192.168.111.26|192.168.111.26| false | | ead4f280|started| node-2 |192.168.111.28|192.168.111.28| false | | 79153c5a|started| node-5 |192.168.111.29|192.168.111.29| false | +---------+-------+--------+--------------+--------------+-------+
+---------+-------+--------+--------------+--------------+-------+ | ID | STATUS| NAME | PEER ADDRS | CLIENT ADDRS |LEARNER| +---------+-------+--------+--------------+--------------+-------+ | 2c18942f|started| node-1 |192.168.111.26|192.168.111.26| false | | ead4f280|started| node-2 |192.168.111.28|192.168.111.28| false | | 79153c5a|started| node-5 |192.168.111.29|192.168.111.29| false | +---------+-------+--------+--------------+--------------+-------+
Monitor the
etcd
operator configuration process until completion:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get clusteroperator etcd
$ oc get clusteroperator etcd
Example output (upon completion)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.11.5 True False False 22h
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.11.5 True False False 22h
Confirm
etcdctl
health by running the following commands:Open a remote shell session to the control plane node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc rsh -n openshift-etcd node-1
$ oc rsh -n openshift-etcd node-1
Check endpoint health:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow etcdctl endpoint health
# etcdctl endpoint health
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 192.168.111.26 is healthy: committed proposal: took = 9.105375ms 192.168.111.28 is healthy: committed proposal: took = 9.15205ms 192.168.111.29 is healthy: committed proposal: took = 10.277577ms
192.168.111.26 is healthy: committed proposal: took = 9.105375ms 192.168.111.28 is healthy: committed proposal: took = 9.15205ms 192.168.111.29 is healthy: committed proposal: took = 10.277577ms
Confirm the health of the nodes:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get Nodes
$ oc get Nodes
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS ROLES AGE VERSION node-1 Ready master 20h v1.24.0+3882f8f node-2 Ready master 20h v1.24.0+3882f8f node-3 Ready worker 20h v1.24.0+3882f8f node-4 Ready worker 20h v1.24.0+3882f8f node-5 Ready master 40m v1.24.0+3882f8f
NAME STATUS ROLES AGE VERSION node-1 Ready master 20h v1.24.0+3882f8f node-2 Ready master 20h v1.24.0+3882f8f node-3 Ready worker 20h v1.24.0+3882f8f node-4 Ready worker 20h v1.24.0+3882f8f node-5 Ready master 40m v1.24.0+3882f8f
Verify that the cluster Operators are all available:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get ClusterOperators
$ oc get ClusterOperators
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.11.5 True False False 150m baremetal 4.11.5 True False False 22h cloud-controller-manager 4.11.5 True False False 22h cloud-credential 4.11.5 True False False 22h cluster-autoscaler 4.11.5 True False False 22h config-operator 4.11.5 True False False 22h console 4.11.5 True False False 145m csi-snapshot-controller 4.11.5 True False False 22h dns 4.11.5 True False False 22h etcd 4.11.5 True False False 22h image-registry 4.11.5 True False False 22h ingress 4.11.5 True False False 22h insights 4.11.5 True False False 22h kube-apiserver 4.11.5 True False False 22h kube-controller-manager 4.11.5 True False False 22h kube-scheduler 4.11.5 True False False 22h kube-storage-version-migrator 4.11.5 True False False 148m machine-api 4.11.5 True False False 22h machine-approver 4.11.5 True False False 22h machine-config 4.11.5 True False False 110m marketplace 4.11.5 True False False 22h monitoring 4.11.5 True False False 22h network 4.11.5 True False False 22h node-tuning 4.11.5 True False False 22h openshift-apiserver 4.11.5 True False False 163m openshift-controller-manager 4.11.5 True False False 22h openshift-samples 4.11.5 True False False 22h operator-lifecycle-manager 4.11.5 True False False 22h operator-lifecycle-manager-catalog 4.11.5 True False False 22h operator-lifecycle-manager-pkgsvr 4.11.5 True False False 22h service-ca 4.11.5 True False False 22h storage 4.11.5 True False False 22h
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.11.5 True False False 150m baremetal 4.11.5 True False False 22h cloud-controller-manager 4.11.5 True False False 22h cloud-credential 4.11.5 True False False 22h cluster-autoscaler 4.11.5 True False False 22h config-operator 4.11.5 True False False 22h console 4.11.5 True False False 145m csi-snapshot-controller 4.11.5 True False False 22h dns 4.11.5 True False False 22h etcd 4.11.5 True False False 22h image-registry 4.11.5 True False False 22h ingress 4.11.5 True False False 22h insights 4.11.5 True False False 22h kube-apiserver 4.11.5 True False False 22h kube-controller-manager 4.11.5 True False False 22h kube-scheduler 4.11.5 True False False 22h kube-storage-version-migrator 4.11.5 True False False 148m machine-api 4.11.5 True False False 22h machine-approver 4.11.5 True False False 22h machine-config 4.11.5 True False False 110m marketplace 4.11.5 True False False 22h monitoring 4.11.5 True False False 22h network 4.11.5 True False False 22h node-tuning 4.11.5 True False False 22h openshift-apiserver 4.11.5 True False False 163m openshift-controller-manager 4.11.5 True False False 22h openshift-samples 4.11.5 True False False 22h operator-lifecycle-manager 4.11.5 True False False 22h operator-lifecycle-manager-catalog 4.11.5 True False False 22h operator-lifecycle-manager-pkgsvr 4.11.5 True False False 22h service-ca 4.11.5 True False False 22h storage 4.11.5 True False False 22h
Verify that the cluster version is correct:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get ClusterVersion
$ oc get ClusterVersion
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 22h Cluster version is 4.11.5
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.5 True False 22h Cluster version is 4.11.5