Chapter 8. Replacing DistributedComputeHCI nodes
During hardware maintenance you may need to scale down, scale up, or replace a DistributedComputeHCI node at an edge site. To replace a DistributedComputeHCI node, remove services from the node you are replacing, scale the number of nodes down, and then follow the procedures for scaling those nodes back up.
8.1. Removing Red Hat Ceph Storage services
Before removing an HCI (hyperconverged) node from a cluster, you must remove Red Hat Ceph Storage services. To remove the Red Hat Ceph services, you must disable and remove ceph-osd
service from the cluster services on the node you are removing, then stop and disable the mon
, mgr
, and osd
services.
Procedure
On the undercloud, use SSH to connect to the DistributedComputeHCI node that you want to remove:
ssh tripleo-admin@<dcn-computehci-node>
$ ssh tripleo-admin@<dcn-computehci-node>
Copy to Clipboard Copied! Start a cephadm shell. Use the configuration file and keyring file for the site that the host being removed is in:
sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
Copy to Clipboard Copied! Record the OSDs (object storage devices) associated with the DistributedComputeHCI node you are removing for use reference in a later step:
[ceph: root@dcn2-computehci2-1 ~]# ceph osd tree -c /etc/ceph/dcn2.conf … -3 0.24399 host dcn2-computehci2-1 1 hdd 0.04880 osd.1 up 1.00000 1.00000 7 hdd 0.04880 osd.7 up 1.00000 1.00000 11 hdd 0.04880 osd.11 up 1.00000 1.00000 15 hdd 0.04880 osd.15 up 1.00000 1.00000 18 hdd 0.04880 osd.18 up 1.00000 1.00000 …
[ceph: root@dcn2-computehci2-1 ~]# ceph osd tree -c /etc/ceph/dcn2.conf … -3 0.24399 host dcn2-computehci2-1 1 hdd 0.04880 osd.1 up 1.00000 1.00000 7 hdd 0.04880 osd.7 up 1.00000 1.00000 11 hdd 0.04880 osd.11 up 1.00000 1.00000 15 hdd 0.04880 osd.15 up 1.00000 1.00000 18 hdd 0.04880 osd.18 up 1.00000 1.00000 …
Copy to Clipboard Copied! Use SSH to connect to another node in the same cluster and remove the monitor from the cluster:
sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring [ceph: root@dcn-computehci2-0]# ceph mon remove dcn2-computehci2-1 -c /etc/ceph/dcn2.conf removing mon.dcn2-computehci2-1 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitors
Copy to Clipboard Copied! - Use SSH to log in again to the node that you are removing from the cluster.
Stop and disable the
mgr
service:sudo systemctl --type=service | grep ceph sudo systemctl stop ceph-mgr@dcn2-computehci2-1 sudo systemctl --type=service | grep ceph sudo systemctl disable ceph-mgr@dcn2-computehci2-1
[tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph ceph-crash@dcn2-computehci2-1.service loaded active running Ceph crash dump collector ceph-mgr@dcn2-computehci2-1.service loaded active running Ceph Manager [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl stop ceph-mgr@dcn2-computehci2-1 [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph ceph-crash@dcn2-computehci2-1.service loaded active running Ceph crash dump collector [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl disable ceph-mgr@dcn2-computehci2-1 Removed /etc/systemd/system/multi-user.target.wants/ceph-mgr@dcn2-computehci2-1.service.
Copy to Clipboard Copied! Start the cephadm shell:
sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \ --keyring /etc/ceph/dcn2.client.admin.keyring
Copy to Clipboard Copied! Verify that the
mgr
service for the node is removed from the cluster:[ceph: root@dcn2-computehci2-1 ~]# ceph -s cluster: id: b9b53581-d590-41ac-8463-2f50aa985001 health: HEALTH_WARN 3 pools have too many placement groups mons are allowing insecure global_id reclaim services: mon: 2 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0 (age 2h) mgr: dcn2-computehci2-2(active, since 20h), standbys: dcn2-computehci2-0 osd: 15 osds: 15 up (since 3h), 15 in (since 3h) data: pools: 3 pools, 384 pgs objects: 32 objects, 88 MiB usage: 16 GiB used, 734 GiB / 750 GiB avail pgs: 384 active+clean
[ceph: root@dcn2-computehci2-1 ~]# ceph -s cluster: id: b9b53581-d590-41ac-8463-2f50aa985001 health: HEALTH_WARN 3 pools have too many placement groups mons are allowing insecure global_id reclaim services: mon: 2 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0 (age 2h) mgr: dcn2-computehci2-2(active, since 20h), standbys: dcn2-computehci2-0
1 osd: 15 osds: 15 up (since 3h), 15 in (since 3h) data: pools: 3 pools, 384 pgs objects: 32 objects, 88 MiB usage: 16 GiB used, 734 GiB / 750 GiB avail pgs: 384 active+clean
Copy to Clipboard Copied! - 1
- The node that the mgr service is removed from is no longer listed when the mgr service is successfully removed.
Export the Red Hat Ceph Storage specification:
[ceph: root@dcn2-computehci2-1 ~]# ceph orch ls --export > spec.yml
[ceph: root@dcn2-computehci2-1 ~]# ceph orch ls --export > spec.yml
Copy to Clipboard Copied! Edit the specifications in the
spec.yaml
file:- Remove all instances of the host <dcn-computehci-node> from spec.yml
Remove all instances of the <dcn-computehci-node> entry from the following:
- service_type: osd
- service_type: mon
- service_type: host
Reapply the Red Hat Ceph Storage specification:
[ceph: root@dcn2-computehci2-1 /]# ceph orch apply -i spec.yml
[ceph: root@dcn2-computehci2-1 /]# ceph orch apply -i spec.yml
Copy to Clipboard Copied! Remove the OSDs that you identified using
ceph osd tree
:[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm --zap 1 7 11 15 18 Scheduled OSD(s) for removal
[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm --zap 1 7 11 15 18 Scheduled OSD(s) for removal
Copy to Clipboard Copied! Verify the status of the OSDs being removed. Do not continue until the following command returns no output:
[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm status OSD_ID HOST STATE PG_COUNT REPLACE FORCE DRAIN_STARTED_AT 1 dcn2-computehci2-1 draining 27 False False 2021-04-23 21:35:51.215361 7 dcn2-computehci2-1 draining 8 False False 2021-04-23 21:35:49.111500 11 dcn2-computehci2-1 draining 14 False False 2021-04-23 21:35:50.243762
[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm status OSD_ID HOST STATE PG_COUNT REPLACE FORCE DRAIN_STARTED_AT 1 dcn2-computehci2-1 draining 27 False False 2021-04-23 21:35:51.215361 7 dcn2-computehci2-1 draining 8 False False 2021-04-23 21:35:49.111500 11 dcn2-computehci2-1 draining 14 False False 2021-04-23 21:35:50.243762
Copy to Clipboard Copied! Verify that no daemons remain on the host you are removing:
[ceph: root@dcn2-computehci2-1 /]# ceph orch ps dcn2-computehci2-1
[ceph: root@dcn2-computehci2-1 /]# ceph orch ps dcn2-computehci2-1
Copy to Clipboard Copied! If daemons are still present, you can remove them with the following command:
[ceph: root@dcn2-computehci2-1 /]# ceph orch host drain dcn2-computehci2-1
[ceph: root@dcn2-computehci2-1 /]# ceph orch host drain dcn2-computehci2-1
Copy to Clipboard Copied! Remove the <dcn-computehci-node> host from the Red Hat Ceph Storage cluster:
[ceph: root@dcn2-computehci2-1 /]# ceph orch host rm dcn2-computehci2-1 Removed host ‘dcn2-computehci2-1’
[ceph: root@dcn2-computehci2-1 /]# ceph orch host rm dcn2-computehci2-1 Removed host ‘dcn2-computehci2-1’
Copy to Clipboard Copied!
8.2. Removing the Image service (glance) services
Remove image services from a node when you remove it from service.
Procedure
To disable the Image service services, disable them using
systemctl
on the node you are removing:systemctl stop tripleo_glance_api.service systemctl stop tripleo_glance_api_tls_proxy.service systemctl disable tripleo_glance_api.service systemctl disable tripleo_glance_api_tls_proxy.service
[root@dcn2-computehci2-1 ~]# systemctl stop tripleo_glance_api.service [root@dcn2-computehci2-1 ~]# systemctl stop tripleo_glance_api_tls_proxy.service [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_glance_api.service Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api.service. [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_glance_api_tls_proxy.service Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api_tls_proxy.service.
Copy to Clipboard Copied!
8.3. Removing the Block Storage (cinder) services
You must remove the cinder-volume
and etcd
services from the DistributedComputeHCI node when you remove it from service.
Procedure
Identify and disable the
cinder-volume
service on the node you are removing:openstack volume service list --service cinder-volume openstack volume service set --disable dcn2-computehci2-1@tripleo_ceph cinder-volume
(central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume | cinder-volume | dcn2-computehci2-1@tripleo_ceph | az-dcn2 | enabled | up | 2022-03-23T17:41:43.000000 | (central) [stack@site-undercloud-0 ~]$ openstack volume service set --disable dcn2-computehci2-1@tripleo_ceph cinder-volume
Copy to Clipboard Copied! Log on to a different DistributedComputeHCI node in the stack:
ssh tripleo-admin@dcn2-computehci2-0
$ ssh tripleo-admin@dcn2-computehci2-0
Copy to Clipboard Copied! Remove the
cinder-volume
service associated with the node that you are removing:podman exec -it cinder_volume cinder-manage service remove cinder-volume dcn2-computehci2-1@tripleo_ceph
[root@dcn2-computehci2-0 ~]# podman exec -it cinder_volume cinder-manage service remove cinder-volume dcn2-computehci2-1@tripleo_ceph Service cinder-volume on host dcn2-computehci2-1@tripleo_ceph removed.
Copy to Clipboard Copied! Stop and disable the
tripleo_cinder_volume
service on the node that you are removing:systemctl stop tripleo_cinder_volume.service systemctl disable tripleo_cinder_volume.service
[root@dcn2-computehci2-1 ~]# systemctl stop tripleo_cinder_volume.service [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_cinder_volume.service Removed /etc/systemd/system/multi-user.target.wants/tripleo_cinder_volume.service
Copy to Clipboard Copied!
8.4. Delete the DistributedComputeHCI node
Set the provisioned
parameter to a value of false
and remove the node from the stack. Disable the nova-compute
service and delete the relevant network agent.
Procedure
Copy the
baremetal-deployment.yaml
file:cp /home/stack/dcn2/overcloud-baremetal-deploy.yaml \ /home/stack/dcn2/baremetal-deployment-scaledown.yaml
cp /home/stack/dcn2/overcloud-baremetal-deploy.yaml \ /home/stack/dcn2/baremetal-deployment-scaledown.yaml
Copy to Clipboard Copied! Edit the
baremetal-deployement-scaledown.yaml
file. Identify the host you want to remove and set theprovisioned
parameter to have a value offalse
:instances: ... - hostname: dcn2-computehci2-1 provisioned: false
instances: ... - hostname: dcn2-computehci2-1 provisioned: false
Copy to Clipboard Copied! Remove the node from the stack:
openstack overcloud node delete --stack dcn2 --baremetal-deployment /home/stack/dcn2/baremetal_deployment_scaledown.yaml
openstack overcloud node delete --stack dcn2 --baremetal-deployment /home/stack/dcn2/baremetal_deployment_scaledown.yaml
Copy to Clipboard Copied! Optional: If you are going to reuse the node, use ironic to clean the disk. This is required if the node will host Ceph OSDs:
openstack baremetal node manage $UUID openstack baremetal node clean $UUID --clean-steps '[{"interface":"deploy", "step": "erase_devices_metadata"}]' openstack baremetal provide $UUID
openstack baremetal node manage $UUID openstack baremetal node clean $UUID --clean-steps '[{"interface":"deploy", "step": "erase_devices_metadata"}]' openstack baremetal provide $UUID
Copy to Clipboard Copied! Redeploy the central site. Include all templates that you used for the initial configuration:
openstack overcloud deploy \ --deployed-server \ --stack central \ --templates /usr/share/openstack-tripleo-heat-templates/ \ -r ~/control-plane/central_roles.yaml \ -n ~/network-data.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/dcn-storage.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/nova-az-config.yaml \ -e /home/stack/central/overcloud-networks-deployed.yaml \ -e /home/stack/central/overcloud-vip-deployed.yaml \ -e /home/stack/central/deployed_metal.yaml \ -e /home/stack/central/deployed_ceph.yaml \ -e /home/stack/central/dcn_ceph.yaml \ -e /home/stack/central/glance_update.yaml
openstack overcloud deploy \ --deployed-server \ --stack central \ --templates /usr/share/openstack-tripleo-heat-templates/ \ -r ~/control-plane/central_roles.yaml \ -n ~/network-data.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/dcn-storage.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/nova-az-config.yaml \ -e /home/stack/central/overcloud-networks-deployed.yaml \ -e /home/stack/central/overcloud-vip-deployed.yaml \ -e /home/stack/central/deployed_metal.yaml \ -e /home/stack/central/deployed_ceph.yaml \ -e /home/stack/central/dcn_ceph.yaml \ -e /home/stack/central/glance_update.yaml
Copy to Clipboard Copied!
8.5. Replacing a removed DistributedComputeHCI node
8.5.1. Replacing a removed DistributedComputeHCI node
To add new HCI nodes to your DCN deployment, you must redeploy the edge stack with the additional node, perform a ceph export
of that stack, and then perform a stack update for the central location. A stack update of the central location adds configurations specific to edge-sites.
Prerequisites
The node counts are correct in the nodes_data.yaml file of the stack that you want to replace the node in or add a new node to.
Procedure
You must set the
EtcdIntialClusterState
parameter toexisting
in one of the templates called by your deploy script:parameter_defaults: EtcdInitialClusterState: existing
parameter_defaults: EtcdInitialClusterState: existing
Copy to Clipboard Copied! Redeploy using the deployment script specific to the stack:
./overcloud_deploy_dcn2.sh
(undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy_dcn2.sh … Overcloud Deployed without error
Copy to Clipboard Copied! Export the Red Hat Ceph Storage data from the stack:
sudo -E openstack overcloud export ceph --stack dcn1,dcn2 --config-download-dir /var/lib/mistral --output-file ~/central/dcn2_scale_up_ceph_external.yaml
(undercloud) [stack@site-undercloud-0 ~]$ sudo -E openstack overcloud export ceph --stack dcn1,dcn2 --config-download-dir /var/lib/mistral --output-file ~/central/dcn2_scale_up_ceph_external.yaml
Copy to Clipboard Copied! - Replace dcn_ceph_external.yaml with the newly generated dcn2_scale_up_ceph_external.yaml in the deploy script for the central location.
Perform a stack update at central:
./overcloud_deploy.sh
(undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy.sh ... Overcloud Deployed without error
Copy to Clipboard Copied!
8.6. Verify the functionality of a replaced DistributedComputeHCI node
Ensure the value of the
status
field isenabled
, and that the value of theState
field isup
:openstack compute service list -c Binary -c Host -c Zone -c Status -c State
(central) [stack@site-undercloud-0 ~]$ openstack compute service list -c Binary -c Host -c Zone -c Status -c State +----------------+-----------------------------------------+------------+---------+-------+ | Binary | Host | Zone | Status | State | +----------------+-----------------------------------------+------------+---------+-------+ ... | nova-compute | dcn1-compute1-0.redhat.local | az-dcn1 | enabled | up | | nova-compute | dcn1-compute1-1.redhat.local | az-dcn1 | enabled | up | | nova-compute | dcn2-computehciscaleout2-0.redhat.local | az-dcn2 | enabled | up | | nova-compute | dcn2-computehci2-0.redhat.local | az-dcn2 | enabled | up | | nova-compute | dcn2-computescaleout2-0.redhat.local | az-dcn2 | enabled | up | | nova-compute | dcn2-computehci2-2.redhat.local | az-dcn2 | enabled | up | ...
Copy to Clipboard Copied! Ensure that all network agents are in the
up
state:openstack network agent list -c "Agent Type" -c Host -c Alive -c State
(central) [stack@site-undercloud-0 ~]$ openstack network agent list -c "Agent Type" -c Host -c Alive -c State +--------------------+-----------------------------------------+-------+-------+ | Agent Type | Host | Alive | State | +--------------------+-----------------------------------------+-------+-------+ | DHCP agent | dcn3-compute3-1.redhat.local | :-) | UP | | Open vSwitch agent | central-computehci0-1.redhat.local | :-) | UP | | DHCP agent | dcn3-compute3-0.redhat.local | :-) | UP | | DHCP agent | central-controller0-2.redhat.local | :-) | UP | | Open vSwitch agent | dcn3-compute3-1.redhat.local | :-) | UP | | Open vSwitch agent | dcn1-compute1-1.redhat.local | :-) | UP | | Open vSwitch agent | central-computehci0-0.redhat.local | :-) | UP | | DHCP agent | central-controller0-1.redhat.local | :-) | UP | | L3 agent | central-controller0-2.redhat.local | :-) | UP | | Metadata agent | central-controller0-1.redhat.local | :-) | UP | | Open vSwitch agent | dcn2-computescaleout2-0.redhat.local | :-) | UP | | Open vSwitch agent | dcn2-computehci2-5.redhat.local | :-) | UP | | Open vSwitch agent | central-computehci0-2.redhat.local | :-) | UP | | DHCP agent | central-controller0-0.redhat.local | :-) | UP | | Open vSwitch agent | central-controller0-1.redhat.local | :-) | UP | | Open vSwitch agent | dcn2-computehci2-0.redhat.local | :-) | UP | | Open vSwitch agent | dcn1-compute1-0.redhat.local | :-) | UP | ...
Copy to Clipboard Copied! Verify the status of the Ceph Cluster:
Use SSH to connect to the new DistributedComputeHCI node and check the status of the Ceph cluster:
podman exec -it ceph-mon-dcn2-computehci2-5 \ ceph -s -c /etc/ceph/dcn2.conf
[root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 \ ceph -s -c /etc/ceph/dcn2.conf
Copy to Clipboard Copied! Verify that both the ceph mon and ceph mgr services exist for the new node:
services: mon: 3 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0,dcn2-computehci2-5 (age 3d) mgr: dcn2-computehci2-2(active, since 3d), standbys: dcn2-computehci2-0, dcn2-computehci2-5 osd: 20 osds: 20 up (since 3d), 20 in (since 3d)
services: mon: 3 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0,dcn2-computehci2-5 (age 3d) mgr: dcn2-computehci2-2(active, since 3d), standbys: dcn2-computehci2-0, dcn2-computehci2-5 osd: 20 osds: 20 up (since 3d), 20 in (since 3d)
Copy to Clipboard Copied! Verify the status of the ceph osds with ‘ceph osd tree’. Ensure all osds for our new node are in STATUS up:
podman exec -it ceph-mon-dcn2-computehci2-5 ceph osd tree -c /etc/ceph/dcn2.conf
[root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 ceph osd tree -c /etc/ceph/dcn2.conf ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.97595 root default -5 0.24399 host dcn2-computehci2-0 0 hdd 0.04880 osd.0 up 1.00000 1.00000 4 hdd 0.04880 osd.4 up 1.00000 1.00000 8 hdd 0.04880 osd.8 up 1.00000 1.00000 13 hdd 0.04880 osd.13 up 1.00000 1.00000 17 hdd 0.04880 osd.17 up 1.00000 1.00000 -9 0.24399 host dcn2-computehci2-2 3 hdd 0.04880 osd.3 up 1.00000 1.00000 5 hdd 0.04880 osd.5 up 1.00000 1.00000 10 hdd 0.04880 osd.10 up 1.00000 1.00000 14 hdd 0.04880 osd.14 up 1.00000 1.00000 19 hdd 0.04880 osd.19 up 1.00000 1.00000 -3 0.24399 host dcn2-computehci2-5 1 hdd 0.04880 osd.1 up 1.00000 1.00000 7 hdd 0.04880 osd.7 up 1.00000 1.00000 11 hdd 0.04880 osd.11 up 1.00000 1.00000 15 hdd 0.04880 osd.15 up 1.00000 1.00000 18 hdd 0.04880 osd.18 up 1.00000 1.00000 -7 0.24399 host dcn2-computehciscaleout2-0 2 hdd 0.04880 osd.2 up 1.00000 1.00000 6 hdd 0.04880 osd.6 up 1.00000 1.00000 9 hdd 0.04880 osd.9 up 1.00000 1.00000 12 hdd 0.04880 osd.12 up 1.00000 1.00000 16 hdd 0.04880 osd.16 up 1.00000 1.00000
Copy to Clipboard Copied!
Verify the
cinder-volume
service for the new DistributedComputeHCI node is in Status ‘enabled’ and in State ‘up’:openstack volume service list --service cinder-volume -c Binary -c Host -c Zone -c Status -c State
(central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume -c Binary -c Host -c Zone -c Status -c State +---------------+---------------------------------+------------+---------+-------+ | Binary | Host | Zone | Status | State | +---------------+---------------------------------+------------+---------+-------+ | cinder-volume | hostgroup@tripleo_ceph | az-central | enabled | up | | cinder-volume | dcn1-compute1-1@tripleo_ceph | az-dcn1 | enabled | up | | cinder-volume | dcn1-compute1-0@tripleo_ceph | az-dcn1 | enabled | up | | cinder-volume | dcn2-computehci2-0@tripleo_ceph | az-dcn2 | enabled | up | | cinder-volume | dcn2-computehci2-2@tripleo_ceph | az-dcn2 | enabled | up | | cinder-volume | dcn2-computehci2-5@tripleo_ceph | az-dcn2 | enabled | up | +---------------+---------------------------------+------------+---------+-------+
Copy to Clipboard Copied! NoteIf the State of the
cinder-volume
service isdown
, then the service has not been started on the node.Use ssh to connect to the new DistributedComputeHCI node and check the status of the Glance services with ‘systemctl’:
systemctl --type service | grep glance
[root@dcn2-computehci2-5 ~]# systemctl --type service | grep glance tripleo_glance_api.service loaded active running glance_api container tripleo_glance_api_healthcheck.service loaded activating start start glance_api healthcheck tripleo_glance_api_tls_proxy.service loaded active running glance_api_tls_proxy container
Copy to Clipboard Copied!
8.7. Troubleshooting DistributedComputeHCI state down
If the replacement node was deployed without the EtcdInitialClusterState parameter value set to existing
, then the cinder-volume
service of the replaced node shows down
when you run openstack volume service list
.
Procedure
Log onto the replacement node and check logs for the etcd service. Check that the logs show the
etcd
service is reporting a cluster ID mismatch in the/var/log/containers/stdouts/etcd.log
log file:2022-04-06T18:00:11.834104130+00:00 stderr F 2022-04-06 18:00:11.834045 E | rafthttp: request cluster ID mismatch (got 654f4cf0e2cfb9fd want 918b459b36fe2c0c)
2022-04-06T18:00:11.834104130+00:00 stderr F 2022-04-06 18:00:11.834045 E | rafthttp: request cluster ID mismatch (got 654f4cf0e2cfb9fd want 918b459b36fe2c0c)
Copy to Clipboard Copied! -
Set the
EtcdInitialClusterState
parameter to the value ofexisting
in your deployment templates and rerun the deployment script. Use SSH to connect to the replacement node and run the following commands as root:
systemctl stop tripleo_etcd rm -rf /var/lib/etcd/* systemctl start tripleo_etcd
[root@dcn2-computehci2-4 ~]# systemctl stop tripleo_etcd [root@dcn2-computehci2-4 ~]# rm -rf /var/lib/etcd/* [root@dcn2-computehci2-4 ~]# systemctl start tripleo_etcd
Copy to Clipboard Copied! Recheck the
/var/log/containers/stdouts/etcd.log
log file to verify that the node successfully joined the cluster:2022-04-06T18:24:22.130059875+00:00 stderr F 2022-04-06 18:24:22.129395 I | etcdserver/membership: added member 96f61470cd1839e5 [https://dcn2-computehci2-4.internalapi.redhat.local:2380] to cluster 654f4cf0e2cfb9fd
2022-04-06T18:24:22.130059875+00:00 stderr F 2022-04-06 18:24:22.129395 I | etcdserver/membership: added member 96f61470cd1839e5 [https://dcn2-computehci2-4.internalapi.redhat.local:2380] to cluster 654f4cf0e2cfb9fd
Copy to Clipboard Copied! -
Check the state of the cinder-volume service, and confirm it reads
up
on the replacement node when you runopenstack volume service list
.