Replacing nodes
Instructions for how to safely replace a node in an OpenShift Data Foundation cluster.
Abstract
Making open source more inclusive Copy linkLink copied to clipboard!
Red Hat is committed to replacing problematic language in our code, documentation, and web properties. We are beginning with these four terms: master, slave, blacklist, and whitelist. Because of the enormity of this endeavor, these changes will be implemented gradually over several upcoming releases. For more details, see our CTO Chris Wright’s message.
Providing feedback on Red Hat documentation Copy linkLink copied to clipboard!
We appreciate your input on our documentation. Do let us know how we can make it better.
To give feedback, create a Jira ticket:
- Log in to the Jira.
- Click Create in the top navigation bar
- Enter a descriptive title in the Summary field.
- Enter your suggestion for improvement in the Description field. Include links to the relevant parts of the documentation.
- Select Documentation in the Components field.
- Click Create at the bottom of the dialogue.
Preface Copy linkLink copied to clipboard!
For OpenShift Data Foundation, node replacement can be performed proactively for an operational node and reactively for a failed node for the following deployments:
For Amazon Web Services (AWS)
- User-provisioned infrastructure
- Installer-provisioned infrastructure
For VMware
- User-provisioned infrastructure
- Installer-provisioned infrastructure
For Microsoft Azure
- Installer-provisioned infrastructure
For local storage devices
- Bare metal
- VMware
- IBM Power
- For replacing your storage nodes in external mode, see Red Hat Ceph Storage documentation.
Chapter 1. OpenShift Data Foundation deployed using dynamic devices Copy linkLink copied to clipboard!
1.1. OpenShift Data Foundation deployed on AWS Copy linkLink copied to clipboard!
To replace an operational node, see:
To replace a failed node, see:
1.1.1. Replacing an operational AWS node on user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
When replacing an AWS node on user-provisioned infrastructure, the new node needs to be created in the same AWS zone as the original node.
Procedure
- Identify the node that you need to replace.
Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Delete the node:
oc delete nodes <node_name>
$ oc delete nodes <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Create a new Amazon Web Service (AWS) machine instance with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all the required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <certificate_name>
$ oc adm certificate approve <certificate_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <certificate_name>
- Specify the name of the CSR.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.1.2. Replacing an operational AWS node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Procedure
- Log in to the OpenShift Web Console, and click Compute → Nodes.
- Identify the node that you need to replace. Take a note of its Machine Name.
Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click Action menu (⋮) → Delete Machine.
- Click Delete to confirm that the machine is deleted. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity might take at least 5 - 10 minutes or more.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.1.3. Replacing a failed AWS node on user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
- Identify the Amazon Web Service (AWS) machine instance of the node that you need to replace.
- Log in to AWS, and terminate the AWS machine instance that you identified.
- Create a new AWS machine instance with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all the required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <certificate_name>
$ oc adm certificate approve <certificate_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <certificate_name>
- Specify the name of the CSR.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Execute the following command to apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.1.4. Replacing a failed AWS node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Procedure
- Log in to the OpenShift Web Console, and click Compute → Nodes.
- Identify the faulty node, and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-draining
, and click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
ImportantThis activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
- Optional: If the failed Amazon Web Service (AWS) instance is not removed automatically, terminate the instance from the AWS console.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.2. OpenShift Data Foundation deployed on VMware Copy linkLink copied to clipboard!
To replace an operational node, see:
To replace a failed node, see:
1.2.1. Replacing an operational VMware node on user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
- Identify the node and its Virtual Machine (VM) that you need replace.
Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Delete the node:
oc delete nodes <node_name>
$ oc delete nodes <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to VMware vSphere, and terminate the VM that you identified:
ImportantDelete the VM only from the inventory and not from the disk.
- Create a new VM on VMware vSphere with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all the required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <certificate_name>
$ oc adm certificate approve <certificate_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <certificate_name>
- Specify the name of the CSR.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.2.2. Replacing an operational VMware node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Procedure
- Log in to the OpenShift Web Console, and click Compute → Nodes.
- Identify the node that you need to replace. Take a note of its Machine Name.
Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine is deleted. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity might take at least 5 - 10 minutes or more.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.2.3. Replacing a failed VMware node on user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
- Identify the node and its Virtual Machine (VM) that you need to replace.
Delete the node:
oc delete nodes <node_name>
$ oc delete nodes <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Log in to VMware vSphere and terminate the VM that you identified.
ImportantDelete the VM only from the inventory and not from the disk.
- Create a new VM on VMware vSphere with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all the required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <certificate_name>
$ oc adm certificate approve <certificate_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <certificate_name>
- Specify the name of the CSR.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.2.4. Replacing a failed VMware node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Procedure
- Log in to the OpenShift Web Console, and click Compute → Nodes.
- Identify the faulty node, and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-draining
, and click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created. Wait for te new machine to start.
ImportantThis activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
- Optional: If the failed Virtual Machine (VM) is not removed automatically, terminate the VM from VMware vSphere.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.3. OpenShift Data Foundation deployed on Microsoft Azure Copy linkLink copied to clipboard!
1.3.1. Replacing operational nodes on Azure installer-provisioned infrastructure Copy linkLink copied to clipboard!
Procedure
- Log in to the OpenShift Web Console, and click Compute → Nodes.
- Identify the node that you need to replace. Take a note of its Machine Name.
Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine is deleted. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity might take at least 5 - 10 minutes or more.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Execute the following command to apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads→ Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.3.2. Replacing failed nodes on Azure installer-provisioned infrastructure Copy linkLink copied to clipboard!
Procedure
- Log in to the OpenShift Web Console, and click Compute → Nodes.
- Identify the faulty node, and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-draining
, and click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created. Wait for the new machine to start.
ImportantThis activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
- Optional: If the failed Azure instance is not removed automatically, terminate the instance from the Azure console.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that new the Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.4. OpenShift Data Foundation deployed on Google cloud Copy linkLink copied to clipboard!
1.4.1. Replacing operational nodes on Google Cloud installer-provisioned infrastructure Copy linkLink copied to clipboard!
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node using the following command:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
1.4.2. Replacing failed nodes on Google Cloud installer-provisioned infrastructure Copy linkLink copied to clipboard!
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the faulty node and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-draining
and click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the web user interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From the command line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
- Optional: If the failed Google Cloud instance is not removed automatically, terminate the instance from Google Cloud console.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that new the Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
Chapter 2. OpenShift Data Foundation deployed using local storage devices Copy linkLink copied to clipboard!
2.1. Replacing storage nodes on bare metal infrastructure Copy linkLink copied to clipboard!
- To replace an operational node, see Section 2.1.1, “Replacing an operational node on bare metal user-provisioned infrastructure”.
- To replace a failed node, see Section 2.1.2, “Replacing a failed node on bare metal user-provisioned infrastructure”.
2.1.1. Replacing an operational node on bare metal user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with similar infrastructure, resources, and disks to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
Identify the node, and get the labels on the node that you need to replace:
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Identify the monitor pod (if any), and OSDs that are running in the node that you need to replace:
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step:
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node:
oc delete node <node_name>
$ oc delete node <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get a new bare-metal machine with the required infrastructure. See Installing on bare metal.
ImportantFor information about how to replace a master node when you have installed OpenShift Data Foundation on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.
- Create a new OpenShift Container Platform node using the new bare-metal machine.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all the required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <certificate_name>
$ oc adm certificate approve <certificate_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <certificate_name>
- Specify the name of the CSR.
- Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Identify the namespace where OpenShift local storage operator is installed, and assign it to the
local_storage_project
variable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow echo $local_storage_project
echo $local_storage_project
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
openshift-local-storage
openshift-local-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to the
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node, and remove the failed node:oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In this example,
server3.example.com
is removed, andnewnode.example.com
is the new node.Determine the
localVolumeSet
to edit:oc get -n $local_storage_project localvolumeset
# oc get -n $local_storage_project localvolumeset
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE localblock 25h
NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSet
definition to include the new node, and remove the failed node:oc edit -n $local_storage_project localvolumeset localblock
# oc edit -n $local_storage_project localvolumeset localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the this example,
server3.example.com
is removed andnewnode.example.com
is the new node.
Verify that the new
localblock
Persistent Volume (PV) is available:$oc get pv | grep localblock | grep Available
$oc get pv | grep localblock | grep Available
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-551d950 512Gi RWO Delete Available localblock 26s
local-pv-551d950 512Gi RWO Delete Available localblock 26s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
openshift-storage
project:oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix.You can add comma separated OSD IDs in the command to remove more than one OSD, for example,
FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
fails, and the pod is not in the expectedCompleted
state, check the pod logs for further debugging:For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the Persistent Volume (PV) associated with the Persistent Volume Claim (PVC):
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Released
state, delete it:oc delete pv <persistent_volume>
# oc delete pv <persistent_volume>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d6bf175b
# oc delete pv local-pv-d6bf175b
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollector
pod deployment:oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollector
pod deployment, delete it:oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job
:oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Data Foundation pods are in
Running
state.Ensure that the new incremental
mon
is created, and is in theRunning
state:oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and monitor pod might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
2.1.2. Replacing a failed node on bare metal user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with similar infrastructure, resources, and disks to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
Identify the node, and get the labels on the node that you need to replace:
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Identify the monitor pod (if any), and OSDs that are running in the node that you need to replace:
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step:
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in
Terminating
state:oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node:
oc delete node <node_name>
$ oc delete node <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get a new bare-metal machine with the required infrastructure. See Installing on bare metal.
ImportantFor information about how to replace a master node when you have installed OpenShift Data Foundation on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.
- Create a new OpenShift Container Platform node using the new bare-metal machine.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all the required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <certificate_name>
$ oc adm certificate approve <certificate_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <certificate_name>
- Specify the name of the CSR.
- Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Identify the namespace where OpenShift local storage operator is installed, and assign it to the
local_storage_project
variable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow echo $local_storage_project
echo $local_storage_project
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
openshift-local-storage
openshift-local-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to the
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node, and remove the failed node:oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In this example,
server3.example.com
is removed, andnewnode.example.com
is the new node.Determine the
localVolumeSet
to edit:oc get -n $local_storage_project localvolumeset
# oc get -n $local_storage_project localvolumeset
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE localblock 25h
NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSet
definition to include the new node, and remove the failed node:oc edit -n $local_storage_project localvolumeset localblock
# oc edit -n $local_storage_project localvolumeset localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the this example,
server3.example.com
is removed andnewnode.example.com
is the new node.
Verify that the new
localblock
Persistent Volume (PV) is available:$oc get pv | grep localblock | grep Available
$oc get pv | grep localblock | grep Available
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-551d950 512Gi RWO Delete Available localblock 26s
local-pv-551d950 512Gi RWO Delete Available localblock 26s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
openshift-storage
project:oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix.You can add comma separated OSD IDs in the command to remove more than one OSD, for example,
FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
fails, and the pod is not in the expectedCompleted
state, check the pod logs for further debugging:For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the Persistent Volume (PV) associated with the Persistent Volume Claim (PVC):
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Released
state, delete it:oc delete pv <persistent_volume>
# oc delete pv <persistent_volume>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d6bf175b
# oc delete pv local-pv-d6bf175b
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollector
pod deployment:oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollector
pod deployment, delete it:oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job
:oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Data Foundation pods are in
Running
state.Ensure that the new incremental
mon
is created, and is in theRunning
state:oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and monitor pod might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
2.2. Replacing storage nodes on IBM Z or IBM® LinuxONE infrastructure Copy linkLink copied to clipboard!
You can choose one of the following procedures to replace storage nodes:
2.2.1. Replacing operational nodes on IBM Z or IBM® LinuxONE infrastructure Copy linkLink copied to clipboard!
Use this procedure to replace an operational node on IBM Z or IBM® LinuxONE infrastructure.
Procedure
Identify the node and get labels on the node to be replaced. Make a note of the rack label.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the mon (if any) and object storage device (OSD) pods that are running in the node to be replaced.
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the nodes as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in the
Terminating
state.oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node.
oc delete node <node_name>
$ oc delete node <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Get a new IBM Z storage node as a replacement.
Check for certificate signing requests (CSRs) related to OpenShift Data Foundation that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Data Foundation CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the
openshift-storage
label to the new node using any one of the following:- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Add a new worker node to
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.Determine which
localVolumeSet
to edit.Replace local-storage-project in the following commands with the name of your local storage project. The default project name is
openshift-local-storage
in OpenShift Data Foundation 4.6 and later. Previous versions uselocal-storage
by default.oc get -n local-storage-project localvolumeset
# oc get -n local-storage-project localvolumeset NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSet
definition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.com
was removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storage
project.oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
Identify the PVC as afterwards we need to delete PV associated with that specific PVC.
osd_id_to_remove=1 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
$ osd_id_to_remove=1 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd prefix
. In this example, the deployment name isrook-ceph-osd-1
.Example output:
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the PVC name is
ocs-deviceset-localblock-0-data-0-g2mmc
.Remove the failed OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -f -
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can remove more than one OSD by adding comma separated OSD IDs in the command. (For example: FAILED_OSD_IDS=0,1,2)
WarningThis step results in OSD being completely removed from the cluster. Ensure that the correct value of
osd_id_to_remove
is provided.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-osd_id_to_remove -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-osd_id_to_remove -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-osd_id_to_remove -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-osd_id_to_remove -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow It may be necessary to manually cleanup the removed OSD as follows:
ceph osd crush remove osd.osd_id_to_remove ceph osd rm osd_id_to_remove ceph auth del osd.osd_id_to_remove ceph osd crush rm osd_id_to_remove
ceph osd crush remove osd.osd_id_to_remove ceph osd rm osd_id_to_remove ceph auth del osd.osd_id_to_remove ceph osd crush rm osd_id_to_remove
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV associated with the failed node.
Identify the PV associated with the PVC.
The PVC name must be identical to the name that is obtained while removing the failed OSD from the cluster.
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-5c9b8982 500Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc localblock 24h worker-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Released
state, delete it.oc delete pv <persistent-volume>
# oc delete pv <persistent-volume>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-5c9b8982
# oc delete pv local-pv-5c9b8982 persistentvolume "local-pv-5c9b8982" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the
crashcollector
pod deployment.oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollector
pod deployment, delete it.oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal
job.oc delete job ocs-osd-removal-${osd_id_to_remove}
# oc delete job ocs-osd-removal-${osd_id_to_remove}
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
2.2.2. Replacing failed nodes on IBM Z or IBM® LinuxONE infrastructure Copy linkLink copied to clipboard!
Procedure
- Log in to the OpenShift Web Console, and click Compute → Nodes.
- Identify the faulty node, and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-draining
, and click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created. Wait for new machine to start.
ImportantThis activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
- Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= | cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= | cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all the other required OpenShift Data Foundation pods are in Running state.
Verify that new Object Storage Device (OSD) pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
2.3. Replacing storage nodes on IBM Power infrastructure Copy linkLink copied to clipboard!
For OpenShift Data Foundation, you can perform node replacement proactively for an operational node, and reactively for a failed node, for the deployments related to IBM Power.
2.3.1. Replacing an operational or failed storage node on IBM Power Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with the similar infrastructure and resources to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
Identify the node, and get the labels on the node that you need to replace:
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Identify the
mon
(if any), and Object Storage Device (OSD) pods that are running in the node that you need to replace:oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step:
For example:
oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in
Terminating
state:oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node:
oc delete node <node_name>
$ oc delete node <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Get a new IBM Power machine with the required infrastructure. See Installing a cluster on IBM Power.
- Create a new OpenShift Container Platform node using the new IBM Power machine.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all the required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <certificate_name>
$ oc adm certificate approve <certificate_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <certificate_name>
- Specify the name of the CSR.
- Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=''
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=''
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Identify the namespace where OpenShift local storage operator is installed, and assign it to the
local_storage_project
variable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow echo $local_storage_project
echo $local_storage_project
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
openshift-local-storage
openshift-local-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a newly added worker node to the
localVolume
.Determine the
localVolume
you need to edit:oc get -n $local_storage_project localvolume
# oc get -n $local_storage_project localvolume
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE localblock 25h
NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolume
definition to include the new node, and remove the failed node:oc edit -n $local_storage_project localvolume localblock
# oc edit -n $local_storage_project localvolume localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the this example,
worker-0
is removed andworker-3
is the new node.
Verify that the new
localblock
Persistent Volume (PV) is available:oc get pv | grep localblock
$ oc get pv | grep localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS AGE local-pv-3e8964d3 500Gi RWO Delete Bound ocs-deviceset-localblock-2-data-0-mdbg9 localblock 25h local-pv-414755e0 500Gi RWO Delete Bound ocs-deviceset-localblock-1-data-0-4cslf localblock 25h local-pv-b481410 500Gi RWO Delete Available localblock 3m24s local-pv-5c9b8982 500Gi RWO Delete Bound ocs-deviceset-localblock-0-data-0-g2mmc localblock 25h
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS AGE local-pv-3e8964d3 500Gi RWO Delete Bound ocs-deviceset-localblock-2-data-0-mdbg9 localblock 25h local-pv-414755e0 500Gi RWO Delete Bound ocs-deviceset-localblock-1-data-0-4cslf localblock 25h local-pv-b481410 500Gi RWO Delete Available localblock 3m24s local-pv-5c9b8982 500Gi RWO Delete Bound ocs-deviceset-localblock-0-data-0-g2mmc localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
openshift-storage
project:oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
Identify the Persistent Volume Claim (PVC):
osd_id_to_remove=1
$ osd_id_to_remove=1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${<osd_id_to_remove>} | grep ceph.rook.io/pvc
$ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${<osd_id_to_remove>} | grep ceph.rook.io/pvc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
<osd_id_to_remove>
is the integer in the pod name immediately after therook-ceph-osd
prefix.In this example, the deployment name is
rook-ceph-osd-1
.Example output:
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example,FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.WarningThis step results in the OSD being completely removed from the cluster. Ensure that the correct value of
osd_id_to_remove
is provided.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job has succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
fails, and the pod is not in the expectedCompleted
state, check the pod logs for further debugging.For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV associated with the failed node.
Identify the PV associated with the PVC:
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-5c9b8982 500Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc localblock 24h worker-0
local-pv-5c9b8982 500Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc localblock 24h worker-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The PVC name must be identical to the name that is obtained while removing the failed OSD from the cluster.
If there is a PV in
Released
state, delete it:oc delete pv <persistent_volume>
# oc delete pv <persistent_volume>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-5c9b8982
# oc delete pv local-pv-5c9b8982
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
persistentvolume "local-pv-5c9b8982" deleted
persistentvolume "local-pv-5c9b8982" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the
crashcollector
pod deployment:oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollector
pod deployment, delete it:oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job
:oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Data Foundation pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state:oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-mon-b-74f6dc9dd6-4llzq 1/1 Running 0 6h14m rook-ceph-mon-c-74948755c-h7wtx 1/1 Running 0 4h24m rook-ceph-mon-d-598f69869b-4bv49 1/1 Running 0 162m
rook-ceph-mon-b-74f6dc9dd6-4llzq 1/1 Running 0 6h14m rook-ceph-mon-c-74948755c-h7wtx 1/1 Running 0 4h24m rook-ceph-mon-d-598f69869b-4bv49 1/1 Running 0 162m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The OSD and monitor pod might take several minutes to get to the
Running
state.Verify that the new OSD pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
2.4. Replacing storage nodes on VMware infrastructure Copy linkLink copied to clipboard!
To replace an operational node, see:
To replace a failed node,see:
2.4.1. Replacing an operational node on VMware user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with similar infrastructure, resources, and disks to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
Identify the node, and get the labels on the node that you need to replace:
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Identify the monitor pod (if any), and OSDs that are running in the node that you need to replace:
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step:
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node:
oc delete node <node_name>
$ oc delete node <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Log in to VMware vSphere and terminate the Virtual Machine (VM) that you have identified.
- Create a new VM on VMware vSphere with the required infrastructure. See Infrastructure requirements.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all the required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <certificate_name>
$ oc adm certificate approve <certificate_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <certificate_name>
- Specify the name of the CSR.
- Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Identify the namespace where OpenShift local storage operator is installed, and assign it to the
local_storage_project
variable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow echo $local_storage_project
echo $local_storage_project
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
openshift-local-storage
openshift-local-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to the
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node, and remove the failed node:oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In this example,
server3.example.com
is removed, andnewnode.example.com
is the new node.Determine the
localVolumeSet
to edit:oc get -n $local_storage_project localvolumeset
# oc get -n $local_storage_project localvolumeset
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE localblock 25h
NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSet
definition to include the new node, and remove the failed node:oc edit -n $local_storage_project localvolumeset localblock
# oc edit -n $local_storage_project localvolumeset localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the this example,
server3.example.com
is removed andnewnode.example.com
is the new node.
Verify that the new
localblock
Persistent Volume (PV) is available:$oc get pv | grep localblock | grep Available
$oc get pv | grep localblock | grep Available
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-551d950 512Gi RWO Delete Available localblock 26s
local-pv-551d950 512Gi RWO Delete Available localblock 26s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
openshift-storage
project:oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix.You can add comma separated OSD IDs in the command to remove more than one OSD, for example,
FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
fails, and the pod is not in the expectedCompleted
state, check the pod logs for further debugging:For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the Persistent Volume (PV) associated with the Persistent Volume Claim (PVC):
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Released
state, delete it:oc delete pv <persistent_volume>
# oc delete pv <persistent_volume>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d6bf175b
# oc delete pv local-pv-d6bf175b
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollector
pod deployment:oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollector
pod deployment, delete it:oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job
:oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Data Foundation pods are in
Running
state.Ensure that the new incremental
mon
is created, and is in theRunning
state:oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and monitor pod might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
2.4.2. Replacing an operational node on VMware installer-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with the similar infrastructure, resources, and disks to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
- Log in to the OpenShift Web Console, and click Compute → Nodes.
- Identify the node that you need to replace. Take a note of its Machine Name.
Get labels on the node:
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Identify the
mon
(if any), and Object Storage Devices (OSDs) that are running in the node:oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods that you identified in the previous step:
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Machines. Search for the required machine.
- Besides the required machine, click Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity might take at least 5 - 10 minutes or more.
- Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
- Physically add a new device to the node.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Identify the namespace where the OpenShift local storage operator is installed, and assign it to the
local_storage_project
variable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow echo $local_storage_project
echo $local_storage_project
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
openshift-local-storage
openshift-local-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to the
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node.oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In this example,
server3.example.com
is removed, andnewnode.example.com
is the new node.Determine the
localVolumeSet
you need to edit:oc get -n $local_storage_project localvolumeset
# oc get -n $local_storage_project localvolumeset
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE localblock 25h
NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSet
definition to include the new node and remove the failed node:oc edit -n $local_storage_project localvolumeset localblock
# oc edit -n $local_storage_project localvolumeset localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In this example,
server3.example.com
is removed, andnewnode.example.com
is the new node.
Verify that the new
localblock
Persistent Volume (PV) is available:oc get pv | grep localblock | grep Available
$ oc get pv | grep localblock | grep Available
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-551d950 512Gi RWO Delete Available localblock 26s
local-pv-551d950 512Gi RWO Delete Available localblock 26s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
openshift-storage
project:oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix.You can add comma separated OSD IDs in the command to remove more than one OSD, for example,
FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging.For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PV associated with the Persistent Volume Claim (PVC):
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Released
state, delete it:oc delete pv <persistent_volume>
# oc delete pv <persistent_volume>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d6bf175b
# oc delete pv local-pv-d6bf175b
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollector
pod deployment:oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollector
pod deployment, delete it:oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job
:oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Data Foundation pods are in
Running
state.Ensure that the new incremental
mon
is created and is in theRunning
state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and monitor pod might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
2.4.3. Replacing a failed node on VMware user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with similar infrastructure, resources, and disks to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
Identify the node, and get the labels on the node that you need to replace:
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Identify the monitor pod (if any), and OSDs that are running in the node that you need to replace:
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step:
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in
Terminating
state:oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node:
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node:
oc delete node <node_name>
$ oc delete node <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Log in to VMware vSphere and terminate the Virtual Machine (VM) that you have identified.
- Create a new VM on VMware vSphere with the required infrastructure. See Infrastructure requirements.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all the required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <certificate_name>
$ oc adm certificate approve <certificate_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <certificate_name>
- Specify the name of the CSR.
- Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Identify the namespace where OpenShift local storage operator is installed, and assign it to the
local_storage_project
variable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow echo $local_storage_project
echo $local_storage_project
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
openshift-local-storage
openshift-local-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to the
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node, and remove the failed node:oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In this example,
server3.example.com
is removed, andnewnode.example.com
is the new node.Determine the
localVolumeSet
to edit:oc get -n $local_storage_project localvolumeset
# oc get -n $local_storage_project localvolumeset
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE localblock 25h
NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSet
definition to include the new node, and remove the failed node:oc edit -n $local_storage_project localvolumeset localblock
# oc edit -n $local_storage_project localvolumeset localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the this example,
server3.example.com
is removed andnewnode.example.com
is the new node.
Verify that the new
localblock
Persistent Volume (PV) is available:$oc get pv | grep localblock | grep Available
$oc get pv | grep localblock | grep Available
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-551d950 512Gi RWO Delete Available localblock 26s
local-pv-551d950 512Gi RWO Delete Available localblock 26s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
openshift-storage
project:oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix.You can add comma separated OSD IDs in the command to remove more than one OSD, for example,
FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
fails, and the pod is not in the expectedCompleted
state, check the pod logs for further debugging:For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the Persistent Volume (PV) associated with the Persistent Volume Claim (PVC):
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Released
state, delete it:oc delete pv <persistent_volume>
# oc delete pv <persistent_volume>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d6bf175b
# oc delete pv local-pv-d6bf175b
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollector
pod deployment:oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollector
pod deployment, delete it:oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job
:oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Data Foundation pods are in
Running
state.Ensure that the new incremental
mon
is created, and is in theRunning
state:oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and monitor pod might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.
2.4.4. Replacing a failed node on VMware installer-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Ensure that the replacement nodes are configured with the similar infrastructure, resources, and disks to the node that you replace.
- You must be logged into the OpenShift Container Platform cluster.
Procedure
- Log in to the OpenShift Web Console, and click Compute → Nodes.
- Identify the node that you need to replace. Take a note of its Machine Name.
Get the labels on the node:
oc get nodes --show-labels | grep _<node_name>_
$ oc get nodes --show-labels | grep _<node_name>_
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node_name>
- Specify the name of node that you need to replace.
Identify the
mon
(if any) and Object Storage Devices (OSDs) that are running in the node:oc get pods -n openshift-storage -o wide | grep -i _<node_name>_
$ oc get pods -n openshift-storage -o wide | grep -i _<node_name>_
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step:
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable:
oc adm cordon _<node_name>_
$ oc adm cordon _<node_name>_
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in
Terminating
state:oc get pods -A -o wide | grep -i _<node_name>_ | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
$ oc get pods -A -o wide | grep -i _<node_name>_ | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node:
oc adm drain _<node_name>_ --force --delete-emptydir-data=true --ignore-daemonsets
$ oc adm drain _<node_name>_ --force --delete-emptydir-data=true --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Machines. Search for the required machine.
- Besides the required machine, click Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine is deleted. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity might take at least 5 - 10 minutes or more.
- Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
- Physically add a new device to the node.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the user interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storage
, and click Save.
- From the command-line interface
- Apply the OpenShift Data Foundation label to the new node:
oc label node _<new_node_name>_ cluster.ocs.openshift.io/openshift-storage=""
$ oc label node _<new_node_name>_ cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <new_node_name>
- Specify the name of the new node.
Identify the namespace where the OpenShift local storage operator is installed, and assign it to the
local_storage_project
variable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow echo $local_storage_project
echo $local_storage_project
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
openshift-local-storage
openshift-local-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to the
localVolumeDiscovery
andlocalVolumeSet
.Update the
localVolumeDiscovery
definition to include the new node and remove the failed node:oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In this example,
server3.example.com
is removed andnewnode.example.com
is the new node.Determine the
localVolumeSet
you need to edit.oc get -n $local_storage_project localvolumeset
# oc get -n $local_storage_project localvolumeset
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE localblock 25h
NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSet
definition to include the new node and remove the failed node:oc edit -n $local_storage_project localvolumeset localblock
# oc edit -n $local_storage_project localvolumeset localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In this example,
server3.example.com
is removed andnewnode.example.com
is the new node.
Verify that the new
localblock
PV is available:oc get pv | grep localblock | grep Available
$ oc get pv | grep localblock | grep Available
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-551d950 512Gi RWO Delete Available localblock 26s
local-pv-551d950 512Gi RWO Delete Available localblock 26s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
openshift-storage
project:oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <failed_osd_id>
Is the integer in the pod name immediately after the
rook-ceph-osd
prefix.You can add comma separated OSD IDs in the command to remove more than one OSD, for example,
FAILED_OSD_IDS=0,1,2
.The
FORCE_OSD_REMOVAL
value must be changed totrue
in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging:For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PV associated with the Persistent Volume Claim (PVC):
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Released
state, delete it:oc delete pv _<persistent_volume>_
# oc delete pv _<persistent_volume>_
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d6bf175b
# oc delete pv local-pv-d6bf175b
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollector
pod deployment:oc get deployment --selector=app=rook-ceph-crashcollector,node_name=_<failed_node_name>_ -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=_<failed_node_name>_ -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollector
pod deployment, delete it:oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=_<failed_node_name>_ -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=_<failed_node_name>_ -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job
:oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods. Confirm that at least the following pods on the new node are in
Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Data Foundation pods are in
Running
state.Ensure that the new incremental
mon
is created, and is in theRunning
state:oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and monitor pod might take several minutes to get to the
Running
state.Verify that new OSD pods are running on the replacement node:
oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the one or more selected hosts:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display the list of available block devices:
lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for the
crypt
keyword beside the one or moreocs-deviceset
names.
- If the verification steps fail, contact Red Hat Support.