Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 9. Replacing DistributedComputeHCI nodes
During hardware maintenance you may need to scale down, scale up, or replace a DistributedComputeHCI node at an edge site. To replace a DistributedComputeHCI node, remove services from the node you are replacing, scale the number of nodes down, and then follow the procedures for scaling those nodes back up.
9.1. Removing Red Hat Ceph Storage services Copier lienLien copié sur presse-papiers!
Before removing an HCI (hyperconverged) node from a cluster, you must remove Red Hat Ceph Storage services. To remove the Red Hat Ceph services, you must disable and remove ceph-osd service from the cluster services on the node you are removing, then stop and disable the mon, mgr, and osd services.
Procedure
On the undercloud, use SSH to connect to the DistributedComputeHCI node that you want to remove:
ssh tripleo-admin@<dcn-computehci-node>
$ ssh tripleo-admin@<dcn-computehci-node>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start a cephadm shell. Use the configuration file and keyring file for the site that the host being removed is in:
sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow Record the OSDs (object storage devices) associated with the DistributedComputeHCI node you are removing for use reference in a later step:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use SSH to connect to another node in the same cluster and remove the monitor from the cluster:
sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring [ceph: root@dcn-computehci2-0]# ceph mon remove dcn2-computehci2-1 -c /etc/ceph/dcn2.conf removing mon.dcn2-computehci2-1 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Use SSH to log in again to the node that you are removing from the cluster.
Stop and disable the
mgrservice:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the cephadm shell:
sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
mgrservice for the node is removed from the cluster:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The node that the mgr service is removed from is no longer listed when the mgr service is successfully removed.
Export the Red Hat Ceph Storage specification:
[ceph: root@dcn2-computehci2-1 ~]# ceph orch ls --export > spec.yml
[ceph: root@dcn2-computehci2-1 ~]# ceph orch ls --export > spec.ymlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the specifications in the
spec.yamlfile:- Remove all instances of the host <dcn-computehci-node> from spec.yml
Remove all instances of the <dcn-computehci-node> entry from the following:
- service_type: osd
- service_type: mon
- service_type: host
Reapply the Red Hat Ceph Storage specification:
[ceph: root@dcn2-computehci2-1 /]# ceph orch apply -i spec.yml
[ceph: root@dcn2-computehci2-1 /]# ceph orch apply -i spec.ymlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the OSDs that you identified using
ceph osd tree:[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm --zap 1 7 11 15 18 Scheduled OSD(s) for removal
[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm --zap 1 7 11 15 18 Scheduled OSD(s) for removalCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the status of the OSDs being removed. Do not continue until the following command returns no output:
[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm status OSD_ID HOST STATE PG_COUNT REPLACE FORCE DRAIN_STARTED_AT 1 dcn2-computehci2-1 draining 27 False False 2021-04-23 21:35:51.215361 7 dcn2-computehci2-1 draining 8 False False 2021-04-23 21:35:49.111500 11 dcn2-computehci2-1 draining 14 False False 2021-04-23 21:35:50.243762
[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm status OSD_ID HOST STATE PG_COUNT REPLACE FORCE DRAIN_STARTED_AT 1 dcn2-computehci2-1 draining 27 False False 2021-04-23 21:35:51.215361 7 dcn2-computehci2-1 draining 8 False False 2021-04-23 21:35:49.111500 11 dcn2-computehci2-1 draining 14 False False 2021-04-23 21:35:50.243762Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that no daemons remain on the host you are removing:
[ceph: root@dcn2-computehci2-1 /]# ceph orch ps dcn2-computehci2-1
[ceph: root@dcn2-computehci2-1 /]# ceph orch ps dcn2-computehci2-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow If daemons are still present, you can remove them with the following command:
[ceph: root@dcn2-computehci2-1 /]# ceph orch host drain dcn2-computehci2-1
[ceph: root@dcn2-computehci2-1 /]# ceph orch host drain dcn2-computehci2-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the <dcn-computehci-node> host from the Red Hat Ceph Storage cluster:
[ceph: root@dcn2-computehci2-1 /]# ceph orch host rm dcn2-computehci2-1 Removed host ‘dcn2-computehci2-1’
[ceph: root@dcn2-computehci2-1 /]# ceph orch host rm dcn2-computehci2-1 Removed host ‘dcn2-computehci2-1’Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.2. Removing the Image service (glance) services Copier lienLien copié sur presse-papiers!
Remove image services from a node when you remove it from service.
Procedure
To disable the Image service services, disable them using
systemctlon the node you are removing:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.3. Removing the Block Storage (cinder) services Copier lienLien copié sur presse-papiers!
You must remove the cinder-volume and etcd services from the DistributedComputeHCI node when you remove it from service.
Procedure
Identify and disable the
cinder-volumeservice on the node you are removing:openstack volume service list --service cinder-volume openstack volume service set --disable dcn2-computehci2-1@tripleo_ceph cinder-volume
(central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume | cinder-volume | dcn2-computehci2-1@tripleo_ceph | az-dcn2 | enabled | up | 2022-03-23T17:41:43.000000 | (central) [stack@site-undercloud-0 ~]$ openstack volume service set --disable dcn2-computehci2-1@tripleo_ceph cinder-volumeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Log on to a different DistributedComputeHCI node in the stack:
ssh tripleo-admin@dcn2-computehci2-0
$ ssh tripleo-admin@dcn2-computehci2-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the
cinder-volumeservice associated with the node that you are removing:podman exec -it cinder_volume cinder-manage service remove cinder-volume dcn2-computehci2-1@tripleo_ceph
[root@dcn2-computehci2-0 ~]# podman exec -it cinder_volume cinder-manage service remove cinder-volume dcn2-computehci2-1@tripleo_ceph Service cinder-volume on host dcn2-computehci2-1@tripleo_ceph removed.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Stop and disable the
tripleo_cinder_volumeservice on the node that you are removing:systemctl stop tripleo_cinder_volume.service systemctl disable tripleo_cinder_volume.service
[root@dcn2-computehci2-1 ~]# systemctl stop tripleo_cinder_volume.service [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_cinder_volume.service Removed /etc/systemd/system/multi-user.target.wants/tripleo_cinder_volume.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
9.4. Delete the DistributedComputeHCI node Copier lienLien copié sur presse-papiers!
Set the provisioned parameter to a value of false and remove the node from the stack. Disable the nova-compute service and delete the relevant network agent.
Procedure
Copy the
baremetal-deployment.yamlfile:cp /home/stack/dcn2/overcloud-baremetal-deploy.yaml \ /home/stack/dcn2/baremetal-deployment-scaledown.yaml
cp /home/stack/dcn2/overcloud-baremetal-deploy.yaml \ /home/stack/dcn2/baremetal-deployment-scaledown.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
baremetal-deployement-scaledown.yamlfile. Identify the host you want to remove and set theprovisionedparameter to have a value offalse:instances: ... - hostname: dcn2-computehci2-1 provisioned: falseinstances: ... - hostname: dcn2-computehci2-1 provisioned: falseCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the node from the stack:
openstack overcloud node delete --stack dcn2 --baremetal-deployment /home/stack/dcn2/baremetal_deployment_scaledown.yaml
openstack overcloud node delete --stack dcn2 --baremetal-deployment /home/stack/dcn2/baremetal_deployment_scaledown.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If you are going to reuse the node, use ironic to clean the disk. This is required if the node will host Ceph OSDs:
openstack baremetal node manage $UUID openstack baremetal node clean $UUID --clean-steps '[{"interface":"deploy", "step": "erase_devices_metadata"}]' openstack baremetal provide $UUIDopenstack baremetal node manage $UUID openstack baremetal node clean $UUID --clean-steps '[{"interface":"deploy", "step": "erase_devices_metadata"}]' openstack baremetal provide $UUIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Redeploy the central site. Include all templates that you used for the initial configuration:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.5. Replacing a removed DistributedComputeHCI node Copier lienLien copié sur presse-papiers!
To add new HCI nodes to your DCN deployment, you must redeploy the edge stack with the additional node, perform a ceph export of that stack, and then perform a stack update for the central location. A stack update of the central location adds configurations specific to edge-sites.
Prerequisites
- The node counts are correct in the nodes_data.yaml file of the stack that you want to replace the node in or add a new node to.
Procedure
You must set the
EtcdIntialClusterStateparameter toexistingin one of the templates called by your deploy script:parameter_defaults: EtcdInitialClusterState: existing
parameter_defaults: EtcdInitialClusterState: existingCopy to Clipboard Copied! Toggle word wrap Toggle overflow Redeploy using the deployment script specific to the stack:
./overcloud_deploy_dcn2.sh
(undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy_dcn2.sh … Overcloud Deployed without errorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Export the Red Hat Ceph Storage data from the stack:
sudo -E openstack overcloud export ceph --stack dcn1,dcn2 --config-download-dir /var/lib/mistral --output-file ~/central/dcn2_scale_up_ceph_external.yaml
(undercloud) [stack@site-undercloud-0 ~]$ sudo -E openstack overcloud export ceph --stack dcn1,dcn2 --config-download-dir /var/lib/mistral --output-file ~/central/dcn2_scale_up_ceph_external.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Replace dcn_ceph_external.yaml with the newly generated dcn2_scale_up_ceph_external.yaml in the deploy script for the central location.
Perform a stack update at central:
./overcloud_deploy.sh
(undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy.sh ... Overcloud Deployed without errorCopy to Clipboard Copied! Toggle word wrap Toggle overflow
9.6. Verify the functionality of a replaced DistributedComputeHCI node Copier lienLien copié sur presse-papiers!
Ensure the value of the
statusfield isenabled, and that the value of theStatefield isup:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that all network agents are in the
upstate:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the status of the Ceph Cluster:
Use SSH to connect to the new DistributedComputeHCI node and check the status of the Ceph cluster:
podman exec -it ceph-mon-dcn2-computehci2-5 \ ceph -s -c /etc/ceph/dcn2.conf
[root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 \ ceph -s -c /etc/ceph/dcn2.confCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that both the ceph mon and ceph mgr services exist for the new node:
services: mon: 3 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0,dcn2-computehci2-5 (age 3d) mgr: dcn2-computehci2-2(active, since 3d), standbys: dcn2-computehci2-0, dcn2-computehci2-5 osd: 20 osds: 20 up (since 3d), 20 in (since 3d)services: mon: 3 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0,dcn2-computehci2-5 (age 3d) mgr: dcn2-computehci2-2(active, since 3d), standbys: dcn2-computehci2-0, dcn2-computehci2-5 osd: 20 osds: 20 up (since 3d), 20 in (since 3d)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the status of the ceph osds with ‘ceph osd tree’. Ensure all osds for our new node are in STATUS up:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify the
cinder-volumeservice for the new DistributedComputeHCI node is in Status ‘enabled’ and in State ‘up’:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the State of the
cinder-volumeservice isdown, then the service has not been started on the node.Use ssh to connect to the new DistributedComputeHCI node and check the status of the Glance services with ‘systemctl’:
systemctl --type service | grep glance
[root@dcn2-computehci2-5 ~]# systemctl --type service | grep glance tripleo_glance_api.service loaded active running glance_api container tripleo_glance_api_healthcheck.service loaded activating start start glance_api healthcheck tripleo_glance_api_tls_proxy.service loaded active running glance_api_tls_proxy containerCopy to Clipboard Copied! Toggle word wrap Toggle overflow
9.7. Troubleshooting DistributedComputeHCI state down Copier lienLien copié sur presse-papiers!
If the replacement node was deployed without the EtcdInitialClusterState parameter value set to existing, then the cinder-volume service of the replaced node shows down when you run openstack volume service list.
Procedure
Log onto the replacement node and check logs for the etcd service. Check that the logs show the
etcdservice is reporting a cluster ID mismatch in the/var/log/containers/stdouts/etcd.loglog file:2022-04-06T18:00:11.834104130+00:00 stderr F 2022-04-06 18:00:11.834045 E | rafthttp: request cluster ID mismatch (got 654f4cf0e2cfb9fd want 918b459b36fe2c0c)
2022-04-06T18:00:11.834104130+00:00 stderr F 2022-04-06 18:00:11.834045 E | rafthttp: request cluster ID mismatch (got 654f4cf0e2cfb9fd want 918b459b36fe2c0c)Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Set the
EtcdInitialClusterStateparameter to the value ofexistingin your deployment templates and rerun the deployment script. Use SSH to connect to the replacement node and run the following commands as root:
systemctl stop tripleo_etcd rm -rf /var/lib/etcd/* systemctl start tripleo_etcd
[root@dcn2-computehci2-4 ~]# systemctl stop tripleo_etcd [root@dcn2-computehci2-4 ~]# rm -rf /var/lib/etcd/* [root@dcn2-computehci2-4 ~]# systemctl start tripleo_etcdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Recheck the
/var/log/containers/stdouts/etcd.loglog file to verify that the node successfully joined the cluster:2022-04-06T18:24:22.130059875+00:00 stderr F 2022-04-06 18:24:22.129395 I | etcdserver/membership: added member 96f61470cd1839e5 [https://dcn2-computehci2-4.internalapi.redhat.local:2380] to cluster 654f4cf0e2cfb9fd
2022-04-06T18:24:22.130059875+00:00 stderr F 2022-04-06 18:24:22.129395 I | etcdserver/membership: added member 96f61470cd1839e5 [https://dcn2-computehci2-4.internalapi.redhat.local:2380] to cluster 654f4cf0e2cfb9fdCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Check the state of the cinder-volume service, and confirm it reads
upon the replacement node when you runopenstack volume service list.