OpenShift Container Storage is now OpenShift Data Foundation starting with version 4.9.
Replacing nodes
How to prepare replacement nodes and replace failed nodes
Abstract
Making open source more inclusive Copy linkLink copied to clipboard!
Red Hat is committed to replacing problematic language in our code, documentation, and web properties. We are beginning with these four terms: master, slave, blacklist, and whitelist. Because of the enormity of this endeavor, these changes will be implemented gradually over several upcoming releases. For more details, see our CTO Chris Wright’s message.
Providing feedback on Red Hat documentation Copy linkLink copied to clipboard!
We appreciate your input on our documentation. Do let us know how we can make it better. To give feedback:
For simple comments on specific passages:
- Make sure you are viewing the documentation in the Multi-page HTML format. In addition, ensure you see the Feedback button in the upper right corner of the document.
- Use your mouse cursor to highlight the part of text that you want to comment on.
- Click the Add Feedback pop-up that appears below the highlighted text.
- Follow the displayed instructions.
For submitting more complex feedback, create a Bugzilla ticket:
- Go to the Bugzilla website.
- In the Component section, choose documentation.
- Fill in the Description field with your suggestion for improvement. Include a link to the relevant part(s) of documentation.
- Click Submit Bug.
Preface Copy linkLink copied to clipboard!
For OpenShift Container Storage, node replacement can be performed proactively for an operational node and reactively for a failed node for the following deployments:
For Amazon Web Services (AWS)
- User-provisioned infrastructure
- Installer-provisioned infrastructure
For VMware
- User-provisioned infrastructure
- Installer-provisioned infrastructure
For Red Hat Virtualization
- Installer-provisioned infrastructure
For Microsoft Azure
- Installer-provisioned infrastructure
For local storage devices
- Bare metal
- VMware
- Red Hat Virtualization
- IBM Power Systems
- For replacing your storage nodes in external mode, see Red Hat Ceph Storage documentation.
Chapter 1. OpenShift Container Storage deployed using dynamic devices Copy linkLink copied to clipboard!
1.1. OpenShift Container Storage deployed on AWS Copy linkLink copied to clipboard!
To replace an operational node, see:
To replace a failed node, see:
1.1.1. Replacing an operational AWS node on user-provisioned infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace an operational node on AWS user-provisioned infrastructure.
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
- Identify the node that needs to be replaced.
Mark the node as unschedulable using the following command:
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node using the following command:
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Delete the node using the following command:
oc delete nodes <node_name>
$ oc delete nodes <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Create a new AWS machine instance with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pendingstate:oc get csr
$ oc get csrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node.
- From the web user interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From the command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.1.2. Replacing an operational AWS node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Use this procedure to replace an operational node on AWS installer-provisioned infrastructure (IPI).
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node using the following command:
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.1.3. Replacing a failed AWS node on user-provisioned infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node which is not operational on AWS user-provisioned infrastructure (UPI) for OpenShift Container Storage.
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
- Identify the AWS machine instance of the node that needs to be replaced.
- Log in to AWS and terminate the identified AWS machine instance.
- Create a new AWS machine instance with the required infrastructure. See platform requirements.
- Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pendingstate:oc get csr
$ oc get csrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.1.4. Replacing a failed AWS node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node which is not operational on AWS installer-provisioned infrastructure (IPI) for OpenShift Container Storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the faulty node and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-drainingand click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- [Optional]: If the failed AWS instance is not removed automatically, terminate the instance from AWS console.
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.2. OpenShift Container Storage deployed on VMware Copy linkLink copied to clipboard!
To replace an operational node, see:
To replace a failed node, see:
1.2.1. Replacing an operational VMware node on user-provisioned infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace an operational node on VMware user-provisioned infrastructure (UPI).
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
- Identify the node and its VM that needs to be replaced.
Mark the node as unschedulable using the following command:
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node using the following command:
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Delete the node using the following command:
oc delete nodes <node_name>
$ oc delete nodes <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to vSphere and terminate the identified VM.
ImportantVM should be deleted only from the inventory and not from the disk.
- Create a new VM on vSphere with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pendingstate:oc get csr
$ oc get csrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.2.2. Replacing an operational VMware node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Use this procedure to replace an operational node on VMware installer-provisioned infrastructure (IPI).
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node using the following command:
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.2.3. Replacing a failed VMware node on user-provisioned infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node on VMware user-provisioned infrastructure (UPI).
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
- Identify the node and its VM that needs to be replaced.
Delete the node using the following command:
oc delete nodes <node_name>
$ oc delete nodes <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to vSphere and terminate the identified VM.
ImportantVM should be deleted only from the inventory and not from the disk.
- Create a new VM on vSphere with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pendingstate:oc get csr
$ oc get csrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.2.4. Replacing a failed VMware node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node which is not operational on VMware installer-provisioned infrastructure (IPI) for OpenShift Container Storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the faulty node and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-drainingand click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- [Optional]: If the failed VM is not removed automatically, terminate the VM from vSphere.
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.3. OpenShift Container Storage deployed on Red Hat Virtualization Copy linkLink copied to clipboard!
- To replace an operational node, see Section 1.3.1, “Replacing an operational Red Hat Virtualization node on installer-provisioned infrastructure”
- To replace a failed node, see Section 2.4.2, “Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure”
1.3.1. Replacing an operational Red Hat Virtualization node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Use this procedure to replace an operational node on Red Hat Virtualization installer-provisioned infrastructure (IPI).
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node using the following command:
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for new machine to start and transition into
Runningstate.ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.3.2. Replacing a failed Red Hat Virtualization node on installer-provisioned infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node which is not operational on Red Hat Virtualization installer-provisioned infrastructure (IPI) for OpenShift Container Storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the faulty node. Take a note of its Machine Name.
Log in to Red Hat Virtualization Administration Portal and remove the virtual disks associated with mon and OSDs from the failed Virtual Machine.
This step is required so that the disks are not deleted when the VM instance is deleted as part of the Delete machine step.
ImportantDo not select the Remove Permanently option when removing the disk(s).
- In the OpenShift Web Console, click Compute → Machines. Search for the required machine.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-drainingand click Save. Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- (Optional) If the failed VM is not removed automatically, remove the VM from Red Hat Virtualization Administration Portal.
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.4. OpenShift Container Storage deployed on Microsoft Azure Copy linkLink copied to clipboard!
To replace an operational node, see Section 1.4.1, “Replacing operational nodes on Azure installer-provisioned infrastructure” To replace a failed node, see Section 1.4.2, “Replacing failed nodes on Azure installer-provisioned infrastructure”
1.4.1. Replacing operational nodes on Azure installer-provisioned infrastructure Copy linkLink copied to clipboard!
Use this procedure to replace an operational node on Azure installer-provisioned infrastructure (IPI).
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node using the following command:
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
1.4.2. Replacing failed nodes on Azure installer-provisioned infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node which is not operational on Azure installer-provisioned infrastructure (IPI) for OpenShift Container Storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the faulty node and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-drainingand click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- [Optional]: If the failed Azure instance is not removed automatically, terminate the instance from Azure console.
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
Chapter 2. OpenShift Container Storage deployed using local storage devices Copy linkLink copied to clipboard!
2.1. Replacing storage nodes on bare metal infrastructure Copy linkLink copied to clipboard!
- To replace an operational node, see Section 2.1.1, “Replacing an operational node on bare metal user-provisioned infrastructure”
- To replace a failed node, see Section 2.1.2, “Replacing a failed node on bare metal user-provisioned infrastructure”
2.1.1. Replacing an operational node on bare metal user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the
LocalVolumeDiscoveryandLocalVolumeSetobjects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
Identify the NODE and get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
mon(if any) and OSDs that are running in the node to be replaced.oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node.
oc delete node <node_name>
$ oc delete node <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
ImportantFor information about how to replace a master node when you have installed OpenShift Container Storage on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.
- Create a new OpenShift Container Platform node using the new bare metal machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
oc get csr
$ oc get csrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_projectvariable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to
localVolumeDiscoveryandlocalVolumeSet.Update the
localVolumeDiscoverydefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.Determine which
localVolumeSetto edit.oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSetdefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.
Verify that the new
localblockPV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal-jobfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job.oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-jobCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in
Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.1.2. Replacing a failed node on bare metal user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the
LocalVolumeDiscoveryandLocalVolumeSetobjects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
Identify the NODE and get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
mon(if any) and OSDs that are running in the node to be replaced.oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in Terminating state.
oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node.
oc delete node <node_name>
$ oc delete node <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
ImportantFor information about how to replace a master node when you have installed OpenShift Container Storage on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.
- Create a new OpenShift Container Platform node using the new bare metal machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
oc get csr
$ oc get csrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_projectvariable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to
localVolumeDiscoveryandlocalVolumeSet.Update the
localVolumeDiscoverydefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.Determine which
localVolumeSetto edit.oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSetdefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.
Verify that the new
localblockPV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal-jobfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job.oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-jobCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in
Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.2. Replacing storage nodes on IBM Z or LinuxONE infrastructure Copy linkLink copied to clipboard!
You can choose one of the following procedures to replace storage nodes:
2.2.1. Replacing operational nodes on IBM Z or LinuxONE infrastructure Copy linkLink copied to clipboard!
Use this procedure to replace an operational node on IBM Z or LinuxONE infrastructure.
Procedure
- Log in to OpenShift Web Console.
- Click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node using the following command:
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantThis activity may take at least 5-10 minutes. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.2.2. Replacing failed nodes on IBM Z or LinuxONE infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node which is not operational on IBM Z or LinuxONE infrastructure for OpenShift Container Storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the faulty node and click on its Machine Name.
- Click Actions → Edit Annotations, and click Add More.
-
Add
machine.openshift.io/exclude-node-drainingand click Save. - Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
- Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From the web user interface
- For the new node, click Action Menu (⋮) → Edit Labels
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From the command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= | cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= | cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.3. Replacing storage nodes on VMware infrastructure Copy linkLink copied to clipboard!
To replace an operational node, see:
To replace a failed node,see:
2.3.1. Replacing an operational node on VMware user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the
LocalVolumeDiscoveryandLocalVolumeSetobjects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
Identify the NODE and get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
mon(if any) and OSDs that are running in the node to be replaced.oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node.
oc delete node <node_name>
$ oc delete node <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Log in to vSphere and terminate the identified VM.
- Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
oc get csr
$ oc get csrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_projectvariable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to
localVolumeDiscoveryandlocalVolumeSet.Update the
localVolumeDiscoverydefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.Determine which
localVolumeSetto edit.oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSetdefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.
Verify that the new
localblockPV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal-jobfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job.oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-jobCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in
Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.3.2. Replacing an operational node on VMware installer-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the
LocalVolumeDiscoveryandLocalVolumeSetobjects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
mon(if any) and OSDs that are running in the node to be replaced.oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
- Physically add a new device to the node.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_projectvariable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to
localVolumeDiscoveryandlocalVolumeSet.Update the
localVolumeDiscoverydefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.Determine which
localVolumeSetto edit.oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSetdefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.
Verify that the new
localblockPV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal-jobfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PV associated with the PVC.
#oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
#oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Releasedstate, delete it.oc delete pv <persistent-volume>
# oc delete pv <persistent-volume>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
#oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d9c5cbd6" deleted
#oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d9c5cbd6" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollectorpod deployment.oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollectorpod deployment, delete it.oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job.oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-jobCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in
Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.3.3. Replacing a failed node on VMware user-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the
LocalVolumeDiscoveryandLocalVolumeSetobjects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
Identify the NODE and get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
mon(if any) and OSDs that are running in the node to be replaced.oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in Terminating state.
oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node.
oc delete node <node_name>
$ oc delete node <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Log in to vSphere and terminate the identified VM.
- Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
oc get csr
$ oc get csrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Platform CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_projectvariable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to
localVolumeDiscoveryandlocalVolumeSet.Update the
localVolumeDiscoverydefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.Determine which
localVolumeSetto edit.oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSetdefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.
Verify that the new
localblockPV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal-jobfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job.oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-jobCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in
Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.3.4. Replacing a failed node on VMware installer-provisioned infrastructure Copy linkLink copied to clipboard!
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the
LocalVolumeDiscoveryandLocalVolumeSetobjects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
mon(if any) and OSDs that are running in the node to be replaced.oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in Terminating state.
oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
- Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
- Physically add a new device to the node.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_projectvariable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to
localVolumeDiscoveryandlocalVolumeSet.Update the
localVolumeDiscoverydefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.Determine which
localVolumeSetto edit.oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSetdefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.
Verify that the new
localblockPV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal-jobfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PV associated with the PVC.
#oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
#oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Releasedstate, delete it.oc delete pv <persistent-volume>
# oc delete pv <persistent-volume>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
#oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d9c5cbd6" deleted
#oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d9c5cbd6" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollectorpod deployment.oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollectorpod deployment, delete it.oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job.oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-jobCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in
Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.4. Replacing storage nodes on Red Hat Virtualization infrastructure Copy linkLink copied to clipboard!
- To replace an operational node, see Section 2.4.1, “Replacing an operational node on Red Hat Virtualization installer-provisioned infrastructure”
- To replace a failed node, see Section 2.4.2, “Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure”
2.4.1. Replacing an operational node on Red Hat Virtualization installer-provisioned infrastructure Copy linkLink copied to clipboard!
Use this procedure to replace an operational node on Red Hat Virtualization installer-provisioned infrastructure (IPI).
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the
LocalVolumeDiscoveryandLocalVolumeSetobjects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the mon (if any) and OSDs that are running in the node to be replaced.
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the nodes as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
- Physically add the new device(s) to the node.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_projectvariable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to
localVolumeDiscoveryandlocalVolumeSet.Update the
localVolumeDiscoverydefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.Determine which
localVolumeSetto edit.oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSetdefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.
Verify that the new
localblockPV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal-jobfails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PV associated with the PVC.
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 512Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h server3.example.com
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 512Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h server3.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in
Releasedstate, delete it.oc delete pv <persistent-volume>
# oc delete pv <persistent-volume>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d6bf175b" deleted
# oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d6bf175b" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollectorpod deployment.oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing
crashcollectorpod, delete it.oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removaljob.oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-jobCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in
Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8sCopy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.4.2. Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node which is not operational on Red Hat Virtualization installer-provisioned infrastructure (IPI) for OpenShift Container Storage.
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure, resources and disks to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the
LocalVolumeDiscoveryandLocalVolumeSetobjects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
- Log in to OpenShift Web Console and click Compute → Nodes.
- Identify the node that needs to be replaced. Take a note of its Machine Name.
Get the labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the mon (if any) and OSDs that are running in the node to be replaced.
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in the
Terminatingstate.oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Machines. Search for the required machine.
- Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
- Click Compute → Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
- Physically add the new device(s) to the node.
Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
- Add cluster.ocs.openshift.io/openshift-storage and click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_projectvariable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to
localVolumeDiscoveryandlocalVolumeSet.Update the
localVolumeDiscoverydefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.Determine which
localVolumeSetto edit.oc get -n $local_storage_project localvolumeset NAME AGE localblock 25h
# oc get -n $local_storage_project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSetdefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
server3.example.comwas removed andnewnode.example.comis the new node.
Verify that the new
localblockPV is available.$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26s
$oc get pv | grep localblock | grep Available local-pv-551d950 512Gi RWO Delete Available localblock 26sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal-jobfails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PV associated with the PVC.
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 512Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h server3.example.com
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 512Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h server3.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow If there is a PV in Released state, delete it.
oc delete pv <persistent-volume>
# oc delete pv <persistent-volume>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d6bf175b" deleted
# oc delete pv local-pv-d6bf175b persistentvolume "local-pv-d6bf175b" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
crashcollectorpod deployment.oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow If there is an existing crashcollector pod deployment, delete it.
oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removaljob.oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-jobCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in
Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8s
rook-ceph-mon-a-cd575c89b-b6k66 2/2 Running 0 38m rook-ceph-mon-b-6776bc469b-tzzt8 2/2 Running 0 38m rook-ceph-mon-d-5ff5d488b5-7v8xh 2/2 Running 0 4m8sCopy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
2.5. Replacing storage nodes on IBM Power Systems infrastructure Copy linkLink copied to clipboard!
For OpenShift Container Storage, node replacement can be performed proactively for an operational node and reactively for a failed node for the IBM Power Systems related deployments.
2.5.1. Replacing an operational or failed storage node on IBM Power Systems Copy linkLink copied to clipboard!
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into OpenShift Container Platform (RHOCP) cluster.
-
If you upgraded to OpenShift Container Storage 4.8 from a previous version and have not already created the
LocalVolumeDiscoveryobject, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage.
Procedure
Identify the node and get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the
mon(if any) and object storage device (OSD) pods that are running in the node to be replaced.oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in Terminating state
oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the node.
oc delete node <node_name>
$ oc delete node <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Get a new IBM Power machine with required infrastructure. See Installing a cluster on IBM Power Systems.
- Create a new OpenShift Container Platform node using the new IBM Power Systems machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in
Pendingstate:oc get csr
$ oc get csrCopy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Storage CSRs for the new node:
oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using your preferred interface:
- From User interface
- For the new node, click Action Menu (⋮) → Edit Labels.
-
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=''
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=''Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify the namespace where OpenShift local storage operator is installed and assign it to
local_storage_projectvariable:local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storage$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local) echo $local_storage_project openshift-local-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a new worker node to
localVolumeDiscovery.Update the
localVolumeDiscoverydefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
worker-0was removed andworker-3is the new node.
Add a newly added worker node to localVolume.
Determine which
localVolumeto edit.oc get -n $local_storage_project localvolume NAME AGE localblock 25h
# oc get -n $local_storage_project localvolume NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumedefinition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
In the above example,
worker-0was removed andworker-3is the new node.
Verify that the new
localblockPV is available.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
Identify the PVC as afterwards we need to delete PV associated with that specific PVC.
osd_id_to_remove=1 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc$ osd_id_to_remove=1 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvcCopy to Clipboard Copied! Toggle word wrap Toggle overflow where,
osd_id_to_removeis the integer in the pod name immediately after therook-ceph-osd prefix. In this example, the deployment name isrook-ceph-osd-1.Example output:
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmcceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmcCopy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the PVC name is
ocs-deviceset-localblock-0-data-0-g2mmc.Remove the failed OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -f -$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can remove more than one OSD by adding comma separated OSD IDs in the command. (For example: FAILED_OSD_IDS=0,1,2)
WarningThis step results in OSD being completely removed from the cluster. Ensure that the correct value of
osd_id_to_removeis provided.
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal-jobfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV associated with the failed node.
Identify the PV associated with the PVC. PVC name should be identical to what we obtained in Step 16(a).
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-5c9b8982 500Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc localblock 24h worker-0
# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-5c9b8982 500Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc localblock 24h worker-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV.
oc delete pv <persistent-volume>
# oc delete pv <persistent-volume>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-5c9b8982 persistentvolume "local-pv-5c9b8982" deleted
# oc delete pv local-pv-5c9b8982 persistentvolume "local-pv-5c9b8982" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Delete the
crashcollectorpod deployment.oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
ocs-osd-removal-job.oc delete -n openshift-storage job ocs-osd-removal-job
# oc delete -n openshift-storage job ocs-osd-removal-jobCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads → Pods, confirm that at least the following pods on the new node are in Running state:
-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-mon-b-74f6dc9dd6-4llzq 1/1 Running 0 6h14m rook-ceph-mon-c-74948755c-h7wtx 1/1 Running 0 4h24m rook-ceph-mon-d-598f69869b-4bv49 1/1 Running 0 162m
rook-ceph-mon-b-74f6dc9dd6-4llzq 1/1 Running 0 6h14m rook-ceph-mon-c-74948755c-h7wtx 1/1 Running 0 4h24m rook-ceph-mon-d-598f69869b-4bv49 1/1 Running 0 162mCopy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.