Chapter 8. Replacing DistributedComputeHCI nodes


During hardware maintenance you may need to scale down, scale up, or replace a DistributedComputeHCI node at an edge site. To replace a DistributedComputeHCI node, remove services from the node you are replacing, scale the number of nodes down, and then follow the procedures for scaling those nodes back up.

8.1. Removing Red Hat Ceph Storage services

Before removing an HCI (hyperconverged) node from a cluster, you must remove Red Hat Ceph Storage services. To remove the Red Hat Ceph services, you must disable and remove ceph-osd service from the cluster services on the node you are removing, then stop and disable the mon, mgr, and osd services.

Procedure

  1. On the undercloud, use SSH to connect to the DistributedComputeHCI node that you want to remove:

    $ ssh tripleo-admin@<dcn-computehci-node>
    Copy to Clipboard
  2. Start a cephadm shell. Use the configuration file and keyring file for the site that the host being removed is in:

    $ sudo cephadm shell --config /etc/ceph/dcn2.conf \
    --keyring /etc/ceph/dcn2.client.admin.keyring
    Copy to Clipboard
  3. Record the OSDs (object storage devices) associated with the DistributedComputeHCI node you are removing for use reference in a later step:

    [ceph: root@dcn2-computehci2-1 ~]# ceph osd tree -c /etc/ceph/dcn2.conf
    …
    -3       0.24399     host dcn2-computehci2-1
     1   hdd 0.04880         osd.1                           up  1.00000 1.00000
     7   hdd 0.04880         osd.7                           up  1.00000 1.00000
    11   hdd 0.04880         osd.11                          up  1.00000 1.00000
    15   hdd 0.04880         osd.15                          up  1.00000 1.00000
    18   hdd 0.04880         osd.18                          up  1.00000 1.00000
    …
    Copy to Clipboard
  4. Use SSH to connect to another node in the same cluster and remove the monitor from the cluster:

    $ sudo cephadm shell --config /etc/ceph/dcn2.conf \
    --keyring /etc/ceph/dcn2.client.admin.keyring
    
    [ceph: root@dcn-computehci2-0]# ceph mon remove dcn2-computehci2-1 -c /etc/ceph/dcn2.conf
    removing mon.dcn2-computehci2-1 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitors
    Copy to Clipboard
  5. Use SSH to log in again to the node that you are removing from the cluster.
  6. Stop and disable the mgr service:

    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph
    ceph-crash@dcn2-computehci2-1.service    loaded active     running       Ceph crash dump collector
    ceph-mgr@dcn2-computehci2-1.service      loaded active     running       Ceph Manager
    
    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl stop ceph-mgr@dcn2-computehci2-1
    
    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph
    ceph-crash@dcn2-computehci2-1.service  loaded active running Ceph crash dump collector
    
    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl disable ceph-mgr@dcn2-computehci2-1
    Removed /etc/systemd/system/multi-user.target.wants/ceph-mgr@dcn2-computehci2-1.service.
    Copy to Clipboard
  7. Start the cephadm shell:

    $ sudo cephadm shell --config /etc/ceph/dcn2.conf \
    --keyring /etc/ceph/dcn2.client.admin.keyring
    Copy to Clipboard
  8. Verify that the mgr service for the node is removed from the cluster:

    [ceph: root@dcn2-computehci2-1 ~]# ceph -s
    
    cluster:
        id:     b9b53581-d590-41ac-8463-2f50aa985001
        health: HEALTH_WARN
                3 pools have too many placement groups
                mons are allowing insecure global_id reclaim
    
      services:
        mon: 2 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0 (age 2h)
        mgr: dcn2-computehci2-2(active, since 20h), standbys: dcn2-computehci2-0 
    1
    
        osd: 15 osds: 15 up (since 3h), 15 in (since 3h)
    
      data:
        pools:   3 pools, 384 pgs
        objects: 32 objects, 88 MiB
        usage:   16 GiB used, 734 GiB / 750 GiB avail
        pgs:     384 active+clean
    Copy to Clipboard
    1
    The node that the mgr service is removed from is no longer listed when the mgr service is successfully removed.
  9. Export the Red Hat Ceph Storage specification:

    [ceph: root@dcn2-computehci2-1 ~]# ceph orch ls --export > spec.yml
    Copy to Clipboard
  10. Edit the specifications in the spec.yaml file:

    • Remove all instances of the host <dcn-computehci-node> from spec.yml
    • Remove all instances of the <dcn-computehci-node> entry from the following:

      • service_type: osd
      • service_type: mon
      • service_type: host
  11. Reapply the Red Hat Ceph Storage specification:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch apply -i spec.yml
    Copy to Clipboard
  12. Remove the OSDs that you identified using ceph osd tree:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm --zap 1 7 11 15 18
    Scheduled OSD(s) for removal
    Copy to Clipboard
  13. Verify the status of the OSDs being removed. Do not continue until the following command returns no output:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm status
    OSD_ID  HOST                    STATE     PG_COUNT  REPLACE  FORCE  DRAIN_STARTED_AT
    1       dcn2-computehci2-1      draining  27        False    False  2021-04-23 21:35:51.215361
    7       dcn2-computehci2-1      draining  8         False    False  2021-04-23 21:35:49.111500
    11      dcn2-computehci2-1      draining  14        False    False  2021-04-23 21:35:50.243762
    Copy to Clipboard
  14. Verify that no daemons remain on the host you are removing:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch ps dcn2-computehci2-1
    Copy to Clipboard

    If daemons are still present, you can remove them with the following command:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch host drain dcn2-computehci2-1
    Copy to Clipboard
  15. Remove the <dcn-computehci-node> host from the Red Hat Ceph Storage cluster:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch host rm dcn2-computehci2-1
    Removed host ‘dcn2-computehci2-1’
    Copy to Clipboard

8.2. Removing the Image service (glance) services

Remove image services from a node when you remove it from service.

Procedure

  • To disable the Image service services, disable them using systemctl on the node you are removing:

    [root@dcn2-computehci2-1 ~]# systemctl stop tripleo_glance_api.service
    [root@dcn2-computehci2-1 ~]# systemctl stop  tripleo_glance_api_tls_proxy.service
    
    [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_glance_api.service
    Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api.service.
    [root@dcn2-computehci2-1 ~]# systemctl disable  tripleo_glance_api_tls_proxy.service
    Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api_tls_proxy.service.
    Copy to Clipboard

8.3. Removing the Block Storage (cinder) services

You must remove the cinder-volume and etcd services from the DistributedComputeHCI node when you remove it from service.

Procedure

  1. Identify and disable the cinder-volume service on the node you are removing:

    (central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume
    | cinder-volume | dcn2-computehci2-1@tripleo_ceph | az-dcn2    | enabled | up    | 2022-03-23T17:41:43.000000 |
    (central) [stack@site-undercloud-0 ~]$ openstack volume service set --disable dcn2-computehci2-1@tripleo_ceph cinder-volume
    Copy to Clipboard
  2. Log on to a different DistributedComputeHCI node in the stack:

    $ ssh tripleo-admin@dcn2-computehci2-0
    Copy to Clipboard
  3. Remove the cinder-volume service associated with the node that you are removing:

    [root@dcn2-computehci2-0 ~]# podman exec -it cinder_volume cinder-manage service remove cinder-volume dcn2-computehci2-1@tripleo_ceph
    Service cinder-volume on host dcn2-computehci2-1@tripleo_ceph removed.
    Copy to Clipboard
  4. Stop and disable the tripleo_cinder_volume service on the node that you are removing:

    [root@dcn2-computehci2-1 ~]# systemctl stop tripleo_cinder_volume.service
    [root@dcn2-computehci2-1 ~]# systemctl disable tripleo_cinder_volume.service
    Removed /etc/systemd/system/multi-user.target.wants/tripleo_cinder_volume.service
    Copy to Clipboard

8.4. Delete the DistributedComputeHCI node

Set the provisioned parameter to a value of false and remove the node from the stack. Disable the nova-compute service and delete the relevant network agent.

Procedure

  1. Copy the baremetal-deployment.yaml file:

    cp /home/stack/dcn2/overcloud-baremetal-deploy.yaml \
    /home/stack/dcn2/baremetal-deployment-scaledown.yaml
    Copy to Clipboard
  2. Edit the baremetal-deployement-scaledown.yaml file. Identify the host you want to remove and set the provisioned parameter to have a value of false:

    instances:
    ...
      - hostname: dcn2-computehci2-1
        provisioned: false
    Copy to Clipboard
  3. Remove the node from the stack:

    openstack overcloud node delete --stack dcn2 --baremetal-deployment /home/stack/dcn2/baremetal_deployment_scaledown.yaml
    Copy to Clipboard
  4. Optional: If you are going to reuse the node, use ironic to clean the disk. This is required if the node will host Ceph OSDs:

    openstack baremetal node manage $UUID
    openstack baremetal node clean $UUID --clean-steps '[{"interface":"deploy", "step": "erase_devices_metadata"}]'
    openstack baremetal provide $UUID
    Copy to Clipboard
  5. Redeploy the central site. Include all templates that you used for the initial configuration:

    openstack overcloud deploy \
    --deployed-server \
    --stack central \
    --templates /usr/share/openstack-tripleo-heat-templates/ \
    -r ~/control-plane/central_roles.yaml \
    -n ~/network-data.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/dcn-storage.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/nova-az-config.yaml \
    -e /home/stack/central/overcloud-networks-deployed.yaml \
    -e /home/stack/central/overcloud-vip-deployed.yaml \
    -e /home/stack/central/deployed_metal.yaml \
    -e /home/stack/central/deployed_ceph.yaml \
    -e /home/stack/central/dcn_ceph.yaml \
    -e /home/stack/central/glance_update.yaml
    Copy to Clipboard

8.5. Replacing a removed DistributedComputeHCI node

8.5.1. Replacing a removed DistributedComputeHCI node

To add new HCI nodes to your DCN deployment, you must redeploy the edge stack with the additional node, perform a ceph export of that stack, and then perform a stack update for the central location. A stack update of the central location adds configurations specific to edge-sites.

Prerequisites

The node counts are correct in the nodes_data.yaml file of the stack that you want to replace the node in or add a new node to.

Procedure

  1. You must set the EtcdIntialClusterState parameter to existing in one of the templates called by your deploy script:

    parameter_defaults:
      EtcdInitialClusterState: existing
    Copy to Clipboard
  2. Redeploy using the deployment script specific to the stack:

    (undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy_dcn2.sh
    …
    Overcloud Deployed without error
    Copy to Clipboard
  3. Export the Red Hat Ceph Storage data from the stack:

    (undercloud) [stack@site-undercloud-0 ~]$ sudo -E openstack overcloud export ceph --stack dcn1,dcn2 --config-download-dir /var/lib/mistral --output-file ~/central/dcn2_scale_up_ceph_external.yaml
    Copy to Clipboard
  4. Replace dcn_ceph_external.yaml with the newly generated dcn2_scale_up_ceph_external.yaml in the deploy script for the central location.
  5. Perform a stack update at central:

    (undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy.sh
    ...
    Overcloud Deployed without error
    Copy to Clipboard

8.6. Verify the functionality of a replaced DistributedComputeHCI node

  1. Ensure the value of the status field is enabled, and that the value of the State field is up:

    (central) [stack@site-undercloud-0 ~]$ openstack compute service list -c Binary -c Host -c Zone -c Status -c State
    +----------------+-----------------------------------------+------------+---------+-------+
    | Binary         | Host                                    | Zone       | Status  | State |
    +----------------+-----------------------------------------+------------+---------+-------+
    ...
    | nova-compute   | dcn1-compute1-0.redhat.local            | az-dcn1    | enabled | up    |
    | nova-compute   | dcn1-compute1-1.redhat.local            | az-dcn1    | enabled | up    |
    | nova-compute   | dcn2-computehciscaleout2-0.redhat.local | az-dcn2    | enabled | up    |
    | nova-compute   | dcn2-computehci2-0.redhat.local         | az-dcn2    | enabled | up    |
    | nova-compute   | dcn2-computescaleout2-0.redhat.local    | az-dcn2    | enabled | up    |
    | nova-compute   | dcn2-computehci2-2.redhat.local         | az-dcn2    | enabled | up    |
    ...
    Copy to Clipboard
  2. Ensure that all network agents are in the up state:

    (central) [stack@site-undercloud-0 ~]$ openstack network agent list -c "Agent Type" -c Host -c Alive -c State
    +--------------------+-----------------------------------------+-------+-------+
    | Agent Type         | Host                                    | Alive | State |
    +--------------------+-----------------------------------------+-------+-------+
    | DHCP agent         | dcn3-compute3-1.redhat.local            | :-)   | UP    |
    | Open vSwitch agent | central-computehci0-1.redhat.local      | :-)   | UP    |
    | DHCP agent         | dcn3-compute3-0.redhat.local            | :-)   | UP    |
    | DHCP agent         | central-controller0-2.redhat.local      | :-)   | UP    |
    | Open vSwitch agent | dcn3-compute3-1.redhat.local            | :-)   | UP    |
    | Open vSwitch agent | dcn1-compute1-1.redhat.local            | :-)   | UP    |
    | Open vSwitch agent | central-computehci0-0.redhat.local      | :-)   | UP    |
    | DHCP agent         | central-controller0-1.redhat.local      | :-)   | UP    |
    | L3 agent           | central-controller0-2.redhat.local      | :-)   | UP    |
    | Metadata agent     | central-controller0-1.redhat.local      | :-)   | UP    |
    | Open vSwitch agent | dcn2-computescaleout2-0.redhat.local    | :-)   | UP    |
    | Open vSwitch agent | dcn2-computehci2-5.redhat.local         | :-)   | UP    |
    | Open vSwitch agent | central-computehci0-2.redhat.local      | :-)   | UP    |
    | DHCP agent         | central-controller0-0.redhat.local      | :-)   | UP    |
    | Open vSwitch agent | central-controller0-1.redhat.local      | :-)   | UP    |
    | Open vSwitch agent | dcn2-computehci2-0.redhat.local         | :-)   | UP    |
    | Open vSwitch agent | dcn1-compute1-0.redhat.local            | :-)   | UP    |
    ...
    Copy to Clipboard
  3. Verify the status of the Ceph Cluster:

    1. Use SSH to connect to the new DistributedComputeHCI node and check the status of the Ceph cluster:

      [root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 \
      ceph -s -c /etc/ceph/dcn2.conf
      Copy to Clipboard
    2. Verify that both the ceph mon and ceph mgr services exist for the new node:

      services:
          mon: 3 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0,dcn2-computehci2-5 (age 3d)
          mgr: dcn2-computehci2-2(active, since 3d), standbys: dcn2-computehci2-0, dcn2-computehci2-5
          osd: 20 osds: 20 up (since 3d), 20 in (since 3d)
      Copy to Clipboard
    3. Verify the status of the ceph osds with ‘ceph osd tree’. Ensure all osds for our new node are in STATUS up:

      [root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 ceph osd tree -c /etc/ceph/dcn2.conf
      ID CLASS WEIGHT  TYPE NAME                           STATUS REWEIGHT PRI-AFF
      -1       0.97595 root default
      -5       0.24399     host dcn2-computehci2-0
       0   hdd 0.04880         osd.0                           up  1.00000 1.00000
       4   hdd 0.04880         osd.4                           up  1.00000 1.00000
       8   hdd 0.04880         osd.8                           up  1.00000 1.00000
      13   hdd 0.04880         osd.13                          up  1.00000 1.00000
      17   hdd 0.04880         osd.17                          up  1.00000 1.00000
      -9       0.24399     host dcn2-computehci2-2
       3   hdd 0.04880         osd.3                           up  1.00000 1.00000
       5   hdd 0.04880         osd.5                           up  1.00000 1.00000
      10   hdd 0.04880         osd.10                          up  1.00000 1.00000
      14   hdd 0.04880         osd.14                          up  1.00000 1.00000
      19   hdd 0.04880         osd.19                          up  1.00000 1.00000
      -3       0.24399     host dcn2-computehci2-5
       1   hdd 0.04880         osd.1                           up  1.00000 1.00000
       7   hdd 0.04880         osd.7                           up  1.00000 1.00000
      11   hdd 0.04880         osd.11                          up  1.00000 1.00000
      15   hdd 0.04880         osd.15                          up  1.00000 1.00000
      18   hdd 0.04880         osd.18                          up  1.00000 1.00000
      -7       0.24399     host dcn2-computehciscaleout2-0
       2   hdd 0.04880         osd.2                           up  1.00000 1.00000
       6   hdd 0.04880         osd.6                           up  1.00000 1.00000
       9   hdd 0.04880         osd.9                           up  1.00000 1.00000
      12   hdd 0.04880         osd.12                          up  1.00000 1.00000
      16   hdd 0.04880         osd.16                          up  1.00000 1.00000
      Copy to Clipboard
  4. Verify the cinder-volume service for the new DistributedComputeHCI node is in Status ‘enabled’ and in State ‘up’:

    (central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume -c Binary -c Host -c Zone -c Status -c State
    +---------------+---------------------------------+------------+---------+-------+
    | Binary        | Host                            | Zone       | Status  | State |
    +---------------+---------------------------------+------------+---------+-------+
    | cinder-volume | hostgroup@tripleo_ceph          | az-central | enabled | up    |
    | cinder-volume | dcn1-compute1-1@tripleo_ceph    | az-dcn1    | enabled | up    |
    | cinder-volume | dcn1-compute1-0@tripleo_ceph    | az-dcn1    | enabled | up    |
    | cinder-volume | dcn2-computehci2-0@tripleo_ceph | az-dcn2    | enabled | up    |
    | cinder-volume | dcn2-computehci2-2@tripleo_ceph | az-dcn2    | enabled | up    |
    | cinder-volume | dcn2-computehci2-5@tripleo_ceph | az-dcn2    | enabled | up    |
    +---------------+---------------------------------+------------+---------+-------+
    Copy to Clipboard
    Note

    If the State of the cinder-volume service is down, then the service has not been started on the node.

  5. Use ssh to connect to the new DistributedComputeHCI node and check the status of the Glance services with ‘systemctl’:

    [root@dcn2-computehci2-5 ~]# systemctl --type service | grep glance
      tripleo_glance_api.service                        loaded active     running       glance_api container
      tripleo_glance_api_healthcheck.service            loaded activating start   start glance_api healthcheck
      tripleo_glance_api_tls_proxy.service              loaded active     running       glance_api_tls_proxy container
    Copy to Clipboard

8.7. Troubleshooting DistributedComputeHCI state down

If the replacement node was deployed without the EtcdInitialClusterState parameter value set to existing, then the cinder-volume service of the replaced node shows down when you run openstack volume service list.

Procedure

  1. Log onto the replacement node and check logs for the etcd service. Check that the logs show the etcd service is reporting a cluster ID mismatch in the /var/log/containers/stdouts/etcd.log log file:

    2022-04-06T18:00:11.834104130+00:00 stderr F 2022-04-06 18:00:11.834045 E | rafthttp: request cluster ID mismatch (got 654f4cf0e2cfb9fd want 918b459b36fe2c0c)
    Copy to Clipboard
  2. Set the EtcdInitialClusterState parameter to the value of existing in your deployment templates and rerun the deployment script.
  3. Use SSH to connect to the replacement node and run the following commands as root:

    [root@dcn2-computehci2-4 ~]# systemctl stop tripleo_etcd
    [root@dcn2-computehci2-4 ~]# rm -rf /var/lib/etcd/*
    [root@dcn2-computehci2-4 ~]# systemctl start tripleo_etcd
    Copy to Clipboard
  4. Recheck the /var/log/containers/stdouts/etcd.log log file to verify that the node successfully joined the cluster:

    2022-04-06T18:24:22.130059875+00:00 stderr F 2022-04-06 18:24:22.129395 I | etcdserver/membership: added member 96f61470cd1839e5 [https://dcn2-computehci2-4.internalapi.redhat.local:2380] to cluster 654f4cf0e2cfb9fd
    Copy to Clipboard
  5. Check the state of the cinder-volume service, and confirm it reads up on the replacement node when you run openstack volume service list.
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat