OpenShift Container Storage is now OpenShift Data Foundation starting with version 4.9.
此内容没有您所选择的语言版本。
Chapter 9. Replacing storage nodes for OpenShift Container Storage
For OpenShift Container Storage, node replacement can be performed proactively for an operational node and reactively for a failed node for the following deployments:
For Amazon Web Services (AWS)
- User-provisioned infrastructure
- Installer-provisioned infrastructure
For VMware
- User-provisioned infrastructure
For local storage devices
- Bare metal
- Amazon EC2 I3
- VMware
- For replacing your storage nodes in external mode, see Red Hat Ceph Storage documentation.
9.1. OpenShift Container Storage deployed on AWS
9.1.1. Replacing an operational AWS node on user-provisioned infrastructure
Perform this procedure to replace an operational node on AWS user-provisioned infrastructure.
Procedure
- Identify the node that needs to be replaced.
Mark the node as unschedulable using the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Drain the node using the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Delete the node using the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete nodes <node_name>
$ oc delete nodes <node_name>
- Create a new AWS machine instance with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.1.2. Replacing an operational AWS node on installer-provisioned infrastructure
Use this procedure to replace an operational node on AWS installer-provisioned infrastructure (IPI).
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Drain the node using the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
-
Click Compute
Machines. Search for the required machine. -
Besides the required machine, click the Action menu (⋮)
Delete Machine. - Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.1.3. Replacing a failed AWS node on user-provisioned infrastructure
Perform this procedure to replace a failed node which is not operational on AWS user-provisioned infrastructure (UPI) for OpenShift Container Storage.
Procedure
- Identify the AWS machine instance of the node that needs to be replaced.
- Log in to AWS and terminate the identified AWS machine instance.
- Create a new AWS machine instance with the required infrastructure. See platform requirements.
- Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, contact Red Hat Support.
9.1.4. Replacing a failed AWS node on installer-provisioned infrastructure
Perform this procedure to replace a failed node which is not operational on AWS installer-provisioned infrastructure (IPI) for OpenShift Container Storage.
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the faulty node and click on its Machine Name.
-
Click Actions
Edit Annotations, and click Add More. -
Add
machine.openshift.io/exclude-node-draining
and click Save. -
Click Actions
Delete Machine, and click Delete. A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.2. OpenShift Container Storage deployed on VMware
9.2.1. Replacing an operational VMware node on user-provisioned infrastructure
Perform this procedure to replace an operational node on VMware user-provisioned infrastructure (UPI).
Procedure
- Identify the node and its VM that needs to be replaced.
Mark the node as unschedulable using the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Drain the node using the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Delete the node using the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete nodes <node_name>
$ oc delete nodes <node_name>
Log in to vSphere and terminate the identified VM.
ImportantVM should be deleted only from the inventory and not from the disk.
- Create a new VM on vSphere with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.2.2. Replacing a failed VMware node on user-provisioned infrastructure
Perform this procedure to replace a failed node on VMware user-provisioned infrastructure (UPI).
Procedure
- Identify the node and its VM that needs to be replaced.
Delete the node using the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete nodes <node_name>
$ oc delete nodes <node_name>
Log in to vSphere and terminate the identified VM.
ImportantVM should be deleted only from the inventory and not from the disk.
- Create a new VM on vSphere with the required infrastructure. See Platform requirements.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pending
state:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.3. OpenShift Container Storage deployed using local storage devices
9.3.1. Replacing storage nodes on bare metal infrastructure
- To replace an operational node, see Section 9.3.1.1, “Replacing an operational node on bare metal user-provisioned infrastructure”
- To replace a failed node, see Section 9.3.1.2, “Replacing a failed node on bare metal user-provisioned infrastructure”
9.3.1.1. Replacing an operational node on bare metal user-provisioned infrastructure
Prerequisites
- You must be logged into the OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced. Make a note of the rack label.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and object storage device (OSD) pods that are running in the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the nodes as unschedulable.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Drain the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete node <node_name>
$ oc delete node <node_name>
- Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
- Create a new OpenShift Container Platform node using the new bare metal machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in
Pending
state:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Storage CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes in OpenShift Web Console, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.
Add a new disk entry to
LocalVolume
CR.Edit
LocalVolume
CR and remove or comment out faileddevice /dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
. In this example, the new device is/dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n local-storage localvolume
# oc get -n local-storage localvolume NAME AGE local-block 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n local-storage localvolume local-block
# oc edit -n local-storage localvolume local-block
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow [...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY81260978128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY80440W5U128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB85AABDE128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A storageClassName: localblock volumeMode: Block [...]
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY81260978128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY80440W5U128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB85AABDE128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
Display PVs with
localblock
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pv | grep localblock
$ oc get pv | grep localblock
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow local-pv-3e8964d3 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 931Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
local-pv-3e8964d3 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 931Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
Delete the PV associated with the failed node.
Identify the
DeviceSet
associated with the OSD to be replaced.Copy to Clipboard Copied! Toggle word wrap Toggle overflow osd_id_to_remove=0 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
# osd_id_to_remove=0 # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd prefix
. In this example, the deployment name isrook-ceph-osd-0
.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Identify the PV associated with the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in the previous step.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 931Gi RWO localblock 24h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 931Gi RWO localblock 24h
In this example, the associated PV is
local-pv-d9c5cbd6
.Delete the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pvc <pvc-name> -n openshift-storage
# oc delete pvc <pvc-name> -n openshift-storage
Delete the PV.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pv local-pv-d9c5cbd6
# oc delete pv local-pv-d9c5cbd6
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Remove the failed OSD from the cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf
ocs-osd-removal
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete OSD pod deployment and crashcollector pod deployment.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
Delete the
rook-ceph-operator
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Verify that the
rook-ceph-operator
pod is restarted.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD and
mon
might take several minutes after the operator restarts.
Delete the
ocs-osd-removal
job.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete job ocs-osd-removal-${osd_id_to_remove}
# oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Make sure that the new incremental
mon
is created and is in the Running state.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
OSD and Mon might take several minutes to get to the
Running
state.- If verification steps fail, contact Red Hat Support.
9.3.1.2. Replacing a failed node on bare metal user-provisioned infrastructure
Prerequisites
- You must be logged into the OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced. Make a note of the rack label.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and object storage device (OSD) pods that are running in the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete node <node_name>
$ oc delete node <node_name>
- Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
- Create a new OpenShift Container Platform node using the new bare metal machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in
Pending
state:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Storage CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes in OpenShift Web Console, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.
Add a new disk entry to
LocalVolume
CR.Edit
LocalVolume
CR and remove or comment out faileddevice /dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
. In this example, the new device is/dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n local-storage localvolume
# oc get -n local-storage localvolume NAME AGE local-block 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n local-storage localvolume local-block
# oc edit -n local-storage localvolume local-block
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow [...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY81260978128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY80440W5U128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB85AABDE128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A storageClassName: localblock volumeMode: Block [...]
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY81260978128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY80440W5U128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB85AABDE128A - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
Display PVs with
localblock
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pv | grep localblock
$ oc get pv | grep localblock
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow local-pv-3e8964d3 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 931Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
local-pv-3e8964d3 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 931Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
Delete the PV associated with the failed node.
Identify the
DeviceSet
associated with the OSD to be replaced.Copy to Clipboard Copied! Toggle word wrap Toggle overflow osd_id_to_remove=0 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
# osd_id_to_remove=0 # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd prefix
. In this example, the deployment name isrook-ceph-osd-0
.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Identify the PV associated with the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in the previous step.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 931Gi RWO localblock 24h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 931Gi RWO localblock 24h
In this example, the associated PV is
local-pv-d9c5cbd6
.Delete the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pvc <pvc-name> -n openshift-storage
# oc delete pvc <pvc-name> -n openshift-storage
Delete the PV.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pv local-pv-d9c5cbd6
# oc delete pv local-pv-d9c5cbd6
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Remove the failed OSD from the cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf
ocs-osd-removal
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete OSD pod deployment and crashcollector pod deployment.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
Delete the
rook-ceph-operator
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Verify that the
rook-ceph-operator
pod is restarted.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD and
mon
might take several minutes after the operator restarts.
Delete the
ocs-osd-removal
job.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete job ocs-osd-removal-${osd_id_to_remove}
# oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Make sure that the new incremental
mon
is created and is in the Running state.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
OSD and Mon might take several minutes to get to the
Running
state.- If verification steps fail, contact Red Hat Support.
9.3.2. Replacing storage nodes on Amazon EC2 infrastructure
To replace an operational Amazon EC2 node on user-provisioned and installer provisioned infrastructures, see:
To replace a failed Amazon EC2 node on user-provisioned and installer provisioned infrastructures, see:
9.3.2.1. Replacing an operational Amazon EC2 node on user-provisioned infrastructure
Perform this procedure to replace an operational node on Amazon EC2 I3 user-provisioned infrastructure (UPI).
Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
- You must be logged into the OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the nodes as unschedulable.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Drain the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete node <node_name>
$ oc delete node <node_name>
- Create a new Amazon EC2 I3 machine instance with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform node using the new Amazon EC2 I3 machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes in the OpenShift web console. Confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolume
CR. You can either remove or comment out the failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n local-storage localvolume
$ oc get -n local-storage localvolume
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME AGE local-block 25h
NAME AGE local-block 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n local-storage localvolume local-block
$ oc edit -n local-storage localvolume local-block
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow [...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
You can see that in this CR the below two new devices using by-id have been added.
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
-
Display PVs with
localblock
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pv | grep localblock
$ oc get pv | grep localblock
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
Delete each PV and OSD associated with the failed node using the following steps.
Identify the DeviceSet associated with the OSD to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow osd_id_to_remove=0 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
Identify the PV associated with the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in the DeviceSet identified in an earlier step.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
In this example, the associated PV is
local-pv-8176b2bf
.Delete the PVC which was identified in earlier steps. In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
$ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
Delete the PV which was identified in earlier steps. In this example, the PV name is
local-pv-8176b2bf
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pv local-pv-8176b2bf
$ oc delete pv local-pv-8176b2bf
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolume "local-pv-8176b2bf" deleted
persistentvolume "local-pv-8176b2bf" deleted
Remove the failed OSD from the cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete the OSD pod deployment.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
Delete
crashcollector
pod deployment identified in an earlier step.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3m
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3m
Delete the
rook-ceph-operator
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
$ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Verify that the
rook-ceph-operator
pod is restarted.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD may take several minutes after the operator starts.
Delete the
ocs-osd-removal
job(s).Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete job ocs-osd-removal-${osd_id_to_remove}
$ oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
OSDs and mon’s might take several minutes to get to the Running state.
- If verification steps fail, contact Red Hat Support.
9.3.2.2. Replacing an operational Amazon EC2 node on installer-provisioned infrastructure
Use this procedure to replace an operational node on Amazon EC2 I3 installer-provisioned infrastructure (IPI).
Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
- You must be logged into the OpenShift Container Platform (OCP) cluster.
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the node that needs to be replaced. Take a note of its Machine Name.
Get labels on the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the nodes as unschedulable.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Drain the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
-
Click Compute
Machines. Search for the required machine. -
Besides the required machine, click the Action menu (⋮)
Delete Machine. - Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
-
Click Compute
Nodes in the OpenShift web console. Confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolume
CR. You can either remove or comment out the failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n local-storage localvolume
$ oc get -n local-storage localvolume
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME AGE local-block 25h
NAME AGE local-block 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n local-storage localvolume local-block
$ oc edit -n local-storage localvolume local-block
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow [...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
You can see that in this CR the below two new devices using by-id have been added.
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
-
Display PVs with
localblock
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pv | grep localblock
$ oc get pv | grep localblock
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
Delete each PV and OSD associated with the failed node using the following steps.
Identify the DeviceSet associated with the OSD to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow osd_id_to_remove=0 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
Identify the PV associated with the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in the DeviceSet identified in an earlier step.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
In this example, the associated PV is
local-pv-8176b2bf
.Delete the PVC which was identified in earlier steps. In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
$ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
Delete the PV which was identified in earlier steps. In this example, the PV name is
local-pv-8176b2bf
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pv local-pv-8176b2bf
$ oc delete pv local-pv-8176b2bf
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolume "local-pv-8176b2bf" deleted
persistentvolume "local-pv-8176b2bf" deleted
Remove the failed OSD from the cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete the OSD pod deployment.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
Delete
crashcollector
pod deployment identified in an earlier step.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3m
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3m
Delete the
rook-ceph-operator
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
$ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Verify that the
rook-ceph-operator
pod is restarted.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD may take several minutes after the operator starts.
Delete the
ocs-osd-removal
job(s).Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete job ocs-osd-removal-${osd_id_to_remove}
$ oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
OSDs and mon’s might take several minutes to get to the Running state.
- If verification steps fail, contact Red Hat Support.
9.3.2.3. Replacing a failed Amazon EC2 node on user-provisioned infrastructure
The ephemeral storage of Amazon EC2 I3 for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Amazon EC2 infrastructure.
Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
- You must be logged into the OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the nodes as unschedulable.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete node <node_name>
$ oc delete node <node_name>
- Create a new Amazon EC2 I3 machine instance with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform node using the new Amazon EC2 I3 machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes in the OpenShift web console. Confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolume
CR. You can either remove or comment out the failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n local-storage localvolume
$ oc get -n local-storage localvolume
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME AGE local-block 25h
NAME AGE local-block 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n local-storage localvolume local-block
$ oc edit -n local-storage localvolume local-block
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow [...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
You can see that in this CR the below two new devices using by-id have been added.
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
-
Display PVs with
localblock
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pv | grep localblock
$ oc get pv | grep localblock
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
Delete each PV and OSD associated with the failed node using the following steps.
Identify the DeviceSet associated with the OSD to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow osd_id_to_remove=0 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
Identify the PV associated with the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in the DeviceSet identified in an earlier step.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
In this example, the associated PV is
local-pv-8176b2bf
.Delete the PVC which was identified in earlier steps. In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
$ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
Delete the PV which was identified in earlier steps. In this example, the PV name is
local-pv-8176b2bf
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pv local-pv-8176b2bf
$ oc delete pv local-pv-8176b2bf
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolume "local-pv-8176b2bf" deleted
persistentvolume "local-pv-8176b2bf" deleted
Remove the failed OSD from the cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete the OSD pod deployment.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
Delete
crashcollector
pod deployment identified in an earlier step.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3m
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3m
Delete the
rook-ceph-operator
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
$ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Verify that the
rook-ceph-operator
pod is restarted.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD may take several minutes after the operator starts.
Delete the
ocs-osd-removal
job(s).Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete job ocs-osd-removal-${osd_id_to_remove}
$ oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
OSDs and mon’s might take several minutes to get to the Running state.
- If verification steps fail, contact Red Hat Support.
9.3.2.4. Replacing a failed Amazon EC2 node on installer-provisioned infrastructure
The ephemeral storage of Amazon EC2 I3 for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Amazon EC2 infrastructure.
Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
- You must be logged into the OpenShift Container Platform (OCP) cluster.
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the node that needs to be replaced. Take a note of its Machine Name.
Get the labels on the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Identify the mon (if any) and OSDs that are running in the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
-
Click Compute
Machines. Search for the required machine. -
Besides the required machine, click the Action menu (⋮)
Delete Machine. - Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
-
Click Compute
Nodes in the OpenShift web console. Confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolume
CR. You can either remove or comment out the failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n local-storage localvolume
$ oc get -n local-storage localvolume
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME AGE local-block 25h
NAME AGE local-block 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n local-storage localvolume local-block
$ oc edit -n local-storage localvolume local-block
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow [...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
You can see that in this CR the below two new devices using by-id have been added.
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
-
Display PVs with
localblock
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pv | grep localblock
$ oc get pv | grep localblock
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
Delete each PV and OSD associated with the failed node using the following steps.
Identify the DeviceSet associated with the OSD to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow osd_id_to_remove=0 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
Identify the PV associated with the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in the DeviceSet identified in an earlier step.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
In this example, the associated PV is
local-pv-8176b2bf
.Delete the PVC which was identified in earlier steps. In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
$ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
Delete the PV which was identified in earlier steps. In this example, the PV name is
local-pv-8176b2bf
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pv local-pv-8176b2bf
$ oc delete pv local-pv-8176b2bf
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolume "local-pv-8176b2bf" deleted
persistentvolume "local-pv-8176b2bf" deleted
Remove the failed OSD from the cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete the OSD pod deployment.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
Delete
crashcollector
pod deployment identified in an earlier step.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3m
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3m
Delete the
rook-ceph-operator
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
$ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Verify that the
rook-ceph-operator
pod is restarted.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD may take several minutes after the operator starts.
Delete the
ocs-osd-removal
job(s).Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete job ocs-osd-removal-${osd_id_to_remove}
$ oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
OSDs and mon’s might take several minutes to get to the Running state.
- If verification steps fail, contact Red Hat Support.
9.3.3. Replacing storage nodes on VMware infrastructure
- To replace an operational node, see Section 9.3.3.1, “Replacing an operational node on VMware user-provisioned infrastructure”
- To replace a failed node, see Section 9.3.3.2, “Replacing a failed node on VMware user-provisioned infrastructure”
9.3.3.1. Replacing an operational node on VMware user-provisioned infrastructure
Prerequisites
- You must be logged into the OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Identify the
mon
(if any) and OSDs that are running in the node to be replaced.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Drain the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete node <node_name>
$ oc delete node <node_name>
- Log in to vSphere and terminate the identified VM.
- Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes in OpenShift Web Console, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.
Add a new disk entry to
LocalVolume
CR.Edit
LocalVolume
CR and remove or comment out failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
. In this example, the new device is/dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n local-storage localvolume
# oc get -n local-storage localvolume
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME AGE local-block 25h
NAME AGE local-block 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n local-storage localvolume local-block
# oc edit -n local-storage localvolume local-block
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow [...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4895e0e5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ea2f0f5251 # - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251 storageClassName: localblock volumeMode: Block [...]
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4895e0e5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ea2f0f5251 # - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
Display PVs with
localblock
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pv | grep localblock
$ oc get pv | grep localblock
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow local-pv-3e8964d3 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 1490Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
local-pv-3e8964d3 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 1490Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
Delete the PV associated with the failed node.
Identify the
DeviceSet
associated with the OSD to be replaced.Copy to Clipboard Copied! Toggle word wrap Toggle overflow osd_id_to_remove=0 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
# osd_id_to_remove=0 # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd prefix
. In this example, the deployment name isrook-ceph-osd-0
.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Identify the PV associated with the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in the previous step.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 1490Gi RWO localblock 24h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 1490Gi RWO localblock 24h
In this example, the associated PV is
local-pv-d9c5cbd6
.Delete the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pvc <pvc-name> -n openshift-storage
oc delete pvc <pvc-name> -n openshift-storage
Delete the PV.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pv local-pv-d9c5cbd6
# oc delete pv local-pv-d9c5cbd6
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Remove the failed OSD from the cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete OSD pod deployment and crashcollector pod deployment.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
Delete the
rook-ceph-operator
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Verify that the
rook-ceph-operator
pod is restarted.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD and
mon
might take several minutes after the operator restarts.
Delete the
ocs-osd-removal
job.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete job ocs-osd-removal-${osd_id_to_remove}
# oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
OSD and Mon might take several minutes to get to the
Running
state.- If verification steps fail, contact Red Hat Support.
9.3.3.2. Replacing a failed node on VMware user-provisioned infrastructure
Prerequisites
- You must be logged into the OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>
Identify the
mon
(if any) and OSDs that are running in the node to be replaced.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
Scale down the deployments of the pods identified in the previous step.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
Mark the node as unschedulable.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm cordon <node_name>
$ oc adm cordon <node_name>
Remove the pods which are in Terminating state.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'
Drain the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
Delete the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete node <node_name>
$ oc delete node <node_name>
- Log in to vSphere and terminate the identified VM.
- Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csr
$ oc get csr
Approve all required OpenShift Container Platform CSRs for the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm certificate approve <Certificate_Name>
$ oc adm certificate approve <Certificate_Name>
-
Click Compute
Nodes in OpenShift Web Console, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.
Add a new disk entry to
LocalVolume
CR.Edit
LocalVolume
CR and remove or comment out failed device/dev/disk/by-id/{id}
and add the new/dev/disk/by-id/{id}
. In this example, the new device is/dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n local-storage localvolume
# oc get -n local-storage localvolume
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME AGE local-block 25h
NAME AGE local-block 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n local-storage localvolume local-block
# oc edit -n local-storage localvolume local-block
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow [...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4895e0e5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ea2f0f5251 # - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251 storageClassName: localblock volumeMode: Block [...]
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4895e0e5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ea2f0f5251 # - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
Display PVs with
localblock
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pv | grep localblock
$ oc get pv | grep localblock
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow local-pv-3e8964d3 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 1490Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
local-pv-3e8964d3 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 1490Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
Delete the PV associated with the failed node.
Identify the
DeviceSet
associated with the OSD to be replaced.Copy to Clipboard Copied! Toggle word wrap Toggle overflow osd_id_to_remove=0 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
# osd_id_to_remove=0 # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd prefix
. In this example, the deployment name isrook-ceph-osd-0
.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Identify the PV associated with the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in the previous step.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 1490Gi RWO localblock 24h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 1490Gi RWO localblock 24h
In this example, the associated PV is
local-pv-d9c5cbd6
.Delete the PVC.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pvc <pvc-name> -n openshift-storage
oc delete pvc <pvc-name> -n openshift-storage
Delete the PV.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pv local-pv-d9c5cbd6
# oc delete pv local-pv-d9c5cbd6
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow persistentvolume "local-pv-d9c5cbd6" deleted
persistentvolume "local-pv-d9c5cbd6" deleted
Remove the failed OSD from the cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete OSD pod deployment and crashcollector pod deployment.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
Delete the
rook-ceph-operator
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Verify that the
rook-ceph-operator
pod is restarted.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD and
mon
might take several minutes after the operator restarts.
Delete the`ocs-osd-removal` job.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete job ocs-osd-removal-${osd_id_to_remove}
# oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running
state:-
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
mon
is created and is in the Running state.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
OSD and Mon might take several minutes to get to the
Running
state.- If verification steps fail, contact Red Hat Support.