이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Replacing nodes

For OpenShift Container Storage, node replacement can be performed proactively for an operational node and reactively for a failed node for the IBM Power Systems related deployments.

6.1. Replacing an operational or failed storage node on IBM Power Systems
링크 복사

Prerequisites

Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
You must be logged into OpenShift Container Platform (RHOCP) cluster.

Procedure

Check the labels on the failed node and make note of the rack label.
```
oc get nodes --show-labels | grep failed-node-name
```
```
$ oc get nodes --show-labels | grep failed-node-name
```
Copy to Clipboard Toggle word wrap
Identify the mon (if any) and object storage device (OSD) pods that are running in the failed node.
```
oc get pods -n openshift-storage -o wide | grep -i failed-node-name
```
```
$ oc get pods -n openshift-storage -o wide | grep -i failed-node-name
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods identified in the previous step.

For example:

oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage
oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage
oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name  --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name  --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the failed node so that it cannot be scheduled for work.
```
oc adm cordon failed-node-name
```
```
$ oc adm cordon failed-node-name
```
Copy to Clipboard Toggle word wrap

Drain the failed node of existing work.

oc adm drain failed-node-name --force --delete-local-data --ignore-daemonsets

$ oc adm drain failed-node-name --force --delete-local-data --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Note

If the failed node is not connected to the network, remove the pods running on it by using the command:

oc get pods -A -o wide | grep -i failed-node-name |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
oc adm drain failed-node-name --force --delete-local-data --ignore-daemonsets

$ oc get pods -A -o wide | grep -i failed-node-name |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
$ oc adm drain failed-node-name --force --delete-local-data --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Delete the failed node.
```
oc delete node failed-node-name
```
```
$ oc delete node failed-node-name
```
Copy to Clipboard Toggle word wrap
Get a new IBM Power machine with required infrastructure. See Installing a cluster on IBM Power Systems.
Create a new OpenShift Container Platform node using the new IBM Power Systems machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all required OpenShift Container Storage CSRs for the new node:
```
oc adm certificate approve certificate-name
```
```
$ oc adm certificate approve certificate-name
```
Copy to Clipboard Toggle word wrap
Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using your preferred interface:
- From OpenShift web console
  1. For the new node, click Action Menu (⋮) Edit Labels.
  2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
- From the command line interface
  1. Execute the following command to apply the OpenShift Container Storage label to the new node:
    
    $ oc label node new-node-name cluster.ocs.openshift.io/openshift-storage=""
    
    Copy to Clipboard Toggle word wrap
Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.
1. Determine which localVolumeSet to edit.
  Replace local-storage-project in the following commands with the name of your local storage project. The default project name is openshift-local-storage in OpenShift Container Storage 4.6 and later. Previous versions use local-storage by default.
  # oc get -n local-storage-project localvolumeset NAME AGE localblock 25h
  Copy to Clipboard Toggle word wrap
2. Update the localVolumeSet definition to include the new node and remove the failed node.
  # oc edit -n local-storage-project localvolumeset localblock [...] nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: #- worker-0 - worker-1 - worker-2 - worker-3 [...]
  Copy to Clipboard Toggle word wrap
  Remember to save before exiting the editor.

Verify that the new localblock PV is available.

oc get pv | grep localblock

$ oc get pv | grep localblock
NAME              CAPACITY   ACCESSMODES RECLAIMPOLICY STATUS     CLAIM             STORAGECLASS                 AGE
local-pv-3e8964d3    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-2-data-0-mdbg9  localblock     25h
local-pv-414755e0    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-1-data-0-4cslf  localblock     25h
local-pv-b481410   500Gi     RWO        Delete       Available                                            localblock     3m24s
local-pv-5c9b8982    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-0-data-0-g2mmc  localblock     25h

Copy to Clipboard

Toggle word wrap

Change to the openshift-storage project.
```
oc project openshift-storage
```
```
$ oc project openshift-storage
```
Copy to Clipboard Toggle word wrap

Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

Identify the PVC as afterwards we need to delete PV associated with that specific PVC.

osd_id_to_remove=1
oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

# osd_id_to_remove=1
# oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

Copy to Clipboard

Toggle word wrap

where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-1.

Example output:

ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
    ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc

ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
    ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc

Copy to Clipboard

Toggle word wrap

In this example, the PVC name is ocs-deviceset-localblock-0-data-0-g2mmc.

Remove the failed OSD from the cluster.

oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove},{osd_id_to_remove2} | oc create -f -

# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove},{osd_id_to_remove2} | oc create -f -

Copy to Clipboard

Toggle word wrap

Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod.
A status of Completed confirms that the OSD removal job succeeded.
```
oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
```
```
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
```
Copy to Clipboard Toggle word wrap
Note
If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Copy to Clipboard Toggle word wrap

Delete the PV associated with the failed node.

Identify the PV associated with the PVC.

oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

Copy to Clipboard

Toggle word wrap

where, x, y, and pvc-suffix are the values in the DeviceSet identified in the previous step.

For example:

oc get -n openshift-storage pvc ocs-deviceset-localblock-0-data-0-g2mmc

# oc get -n openshift-storage pvc ocs-deviceset-localblock-0-data-0-g2mmc
NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ocs-deviceset-localblock-0-data-0-g2mmc   Bound   local-pv-5c9b8982   500Gi      RWO            localblock     24h

Copy to Clipboard

Toggle word wrap

In this example, the associated PV is local-pv-5c9b8982.

Delete the PV.

oc delete pv <persistent-volume>

# oc delete pv <persistent-volume>

Copy to Clipboard

Toggle word wrap

For example:

oc delete pv local-pv-5c9b8982

# oc delete pv local-pv-5c9b8982
persistentvolume "local-pv-5c9b8982" deleted

Copy to Clipboard

Toggle word wrap

Delete the crashcollector pod deployment.

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage

Copy to Clipboard

Toggle word wrap

Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

oc get -n openshift-storage pod -l app=rook-ceph-operator

# oc get -n openshift-storage pod -l app=rook-ceph-operator

Copy to Clipboard

Toggle word wrap

Example output:

NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-77758ddc74-dlwn2   1/1     Running   0          1d20h

NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-77758ddc74-dlwn2   1/1     Running   0          1d20h

Copy to Clipboard

Toggle word wrap

Delete the rook-ceph-operator.

oc delete -n openshift-storage pod rook-ceph-operator-77758ddc74-dlwn2

# oc delete -n openshift-storage pod rook-ceph-operator-77758ddc74-dlwn2

Copy to Clipboard

Toggle word wrap

Example output:

pod "rook-ceph-operator-77758ddc74-dlwn2" deleted

pod "rook-ceph-operator-77758ddc74-dlwn2" deleted

Copy to Clipboard

Toggle word wrap

Verify that the rook-ceph-operator pod is restarted.

oc get -n openshift-storage pod -l app=rook-ceph-operator

# oc get -n openshift-storage pod -l app=rook-ceph-operator

Copy to Clipboard

Toggle word wrap

Example output:

NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-77758ddc74-wqf25   1/1     Running   0          66s

NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-77758ddc74-wqf25   1/1     Running   0          66s

Copy to Clipboard

Toggle word wrap

Creation of the new OSD and mon might take several minutes after the operator restarts.

Delete the ocs-osd-removal job.

oc delete job ocs-osd-removal-${osd_id_to_remove}

# oc delete job ocs-osd-removal-${osd_id_to_remove}

Copy to Clipboard

Toggle word wrap

For example:

oc delete job ocs-osd-removal-1

# oc delete job ocs-osd-removal-1
job.batch "ocs-osd-removal-1" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Execute the following command and verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Container Storage pods are in Running state.

Make sure that the new incremental mon is created and is in the Running state.

oc get pod -n openshift-storage | grep mon

$ oc get pod -n openshift-storage | grep mon

Copy to Clipboard

Toggle word wrap

Example output:

rook-ceph-mon-b-74f6dc9dd6-4llzq                                   1/1     Running     0          6h14m
rook-ceph-mon-c-74948755c-h7wtx                                  1/1     Running     0          4h24m
rook-ceph-mon-d-598f69869b-4bv49                                   1/1     Running     0          162m

rook-ceph-mon-b-74f6dc9dd6-4llzq                                   1/1     Running     0          6h14m
rook-ceph-mon-c-74948755c-h7wtx                                  1/1     Running     0          4h24m
rook-ceph-mon-d-598f69869b-4bv49                                   1/1     Running     0          162m

Copy to Clipboard

Toggle word wrap

OSD and Mon might take several minutes to get to the Running state.

If verification steps fail, contact Red Hat Support.

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Replacing nodes

6.1. Replacing an operational or failed storage node on IBM Power Systems
링크 복사

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Replacing nodes

6.1. Replacing an operational or failed storage node on IBM Power Systems링크 복사링크가 클립보드에 복사되었습니다!

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.1. Replacing an operational or failed storage node on IBM Power Systems
링크 복사