OpenShift Container Storage is now OpenShift Data Foundation starting with version 4.9.
이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 6. Replacing nodes
For OpenShift Container Storage, node replacement can be performed proactively for an operational node and reactively for a failed node for the IBM Power Systems related deployments.
6.1. Replacing an operational or failed storage node on IBM Power Systems 링크 복사링크가 클립보드에 복사되었습니다!
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into OpenShift Container Platform (RHOCP) cluster.
Procedure
Check the labels on the failed node and make note of the rack label.
oc get nodes --show-labels | grep failed-node-name
$ oc get nodes --show-labels | grep failed-node-name
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the mon (if any) and object storage device (OSD) pods that are running in the failed node.
oc get pods -n openshift-storage -o wide | grep -i failed-node-name
$ oc get pods -n openshift-storage -o wide | grep -i failed-node-name
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name --replicas=0 -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the failed node so that it cannot be scheduled for work.
oc adm cordon failed-node-name
$ oc adm cordon failed-node-name
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the failed node of existing work.
oc adm drain failed-node-name --force --delete-local-data --ignore-daemonsets
$ oc adm drain failed-node-name --force --delete-local-data --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the failed node is not connected to the network, remove the pods running on it by using the command:
oc get pods -A -o wide | grep -i failed-node-name | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}' oc adm drain failed-node-name --force --delete-local-data --ignore-daemonsets
$ oc get pods -A -o wide | grep -i failed-node-name | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}' $ oc adm drain failed-node-name --force --delete-local-data --ignore-daemonsets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the failed node.
oc delete node failed-node-name
$ oc delete node failed-node-name
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Get a new IBM Power machine with required infrastructure. See Installing a cluster on IBM Power Systems.
- Create a new OpenShift Container Platform node using the new IBM Power Systems machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in
Pending
state:oc get csr
$ oc get csr
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Approve all required OpenShift Container Storage CSRs for the new node:
oc adm certificate approve certificate-name
$ oc adm certificate approve certificate-name
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Click Compute
Nodes in OpenShift Web Console, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using your preferred interface:
From OpenShift web console
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
From the command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node new-node-name cluster.ocs.openshift.io/openshift-storage=""
$ oc label node new-node-name cluster.ocs.openshift.io/openshift-storage=""
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.
Determine which
localVolumeSet
to edit.Replace local-storage-project in the following commands with the name of your local storage project. The default project name is
openshift-local-storage
in OpenShift Container Storage 4.6 and later. Previous versions uselocal-storage
by default.oc get -n local-storage-project localvolumeset
# oc get -n local-storage-project localvolumeset NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
localVolumeSet
definition to include the new node and remove the failed node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
Verify that the new
localblock
PV is available.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storage
project.oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
Identify the PVC as afterwards we need to delete PV associated with that specific PVC.
osd_id_to_remove=1 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
# osd_id_to_remove=1 # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd prefix
. In this example, the deployment name isrook-ceph-osd-1
.Example output:
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the PVC name is
ocs-deviceset-localblock-0-data-0-g2mmc
.Remove the failed OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove},{osd_id_to_remove2} | oc create -f -
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove},{osd_id_to_remove2} | oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removal
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV associated with the failed node.
Identify the PV associated with the PVC.
oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in the previous step.For example:
oc get -n openshift-storage pvc ocs-deviceset-localblock-0-data-0-g2mmc
# oc get -n openshift-storage pvc ocs-deviceset-localblock-0-data-0-g2mmc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-localblock-0-data-0-g2mmc Bound local-pv-5c9b8982 500Gi RWO localblock 24h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the associated PV is
local-pv-5c9b8982
.Delete the PV.
oc delete pv <persistent-volume>
# oc delete pv <persistent-volume>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-5c9b8982
# oc delete pv local-pv-5c9b8982 persistentvolume "local-pv-5c9b8982" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Delete the
crashcollector
pod deployment.oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-77758ddc74-dlwn2 1/1 Running 0 1d20h
NAME READY STATUS RESTARTS AGE rook-ceph-operator-77758ddc74-dlwn2 1/1 Running 0 1d20h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
rook-ceph-operator
.oc delete -n openshift-storage pod rook-ceph-operator-77758ddc74-dlwn2
# oc delete -n openshift-storage pod rook-ceph-operator-77758ddc74-dlwn2
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
pod "rook-ceph-operator-77758ddc74-dlwn2" deleted
pod "rook-ceph-operator-77758ddc74-dlwn2" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify that the
rook-ceph-operator
pod is restarted.oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-77758ddc74-wqf25 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-77758ddc74-wqf25 1/1 Running 0 66s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Creation of the new OSD and
mon
might take several minutes after the operator restarts.Delete the
ocs-osd-removal
job.oc delete job ocs-osd-removal-${osd_id_to_remove}
# oc delete job ocs-osd-removal-${osd_id_to_remove}
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete job ocs-osd-removal-1
# oc delete job ocs-osd-removal-1 job.batch "ocs-osd-removal-1" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Make sure that the new incremental
mon
is created and is in theRunning
state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep mon
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-mon-b-74f6dc9dd6-4llzq 1/1 Running 0 6h14m rook-ceph-mon-c-74948755c-h7wtx 1/1 Running 0 4h24m rook-ceph-mon-d-598f69869b-4bv49 1/1 Running 0 162m
rook-ceph-mon-b-74f6dc9dd6-4llzq 1/1 Running 0 6h14m rook-ceph-mon-c-74948755c-h7wtx 1/1 Running 0 4h24m rook-ceph-mon-d-598f69869b-4bv49 1/1 Running 0 162m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Running
state.
- If verification steps fail, contact Red Hat Support.