Este contenido no está disponible en el idioma seleccionado.

Chapter 9. Replacing DistributedComputeHCI nodes

During hardware maintenance you may need to scale down, scale up, or replace a DistributedComputeHCI node at an edge site. To replace a DistributedComputeHCI node, remove services from the node you are replacing, scale the number of nodes down, and then follow the procedures for scaling those nodes back up.

9.1. Removing Red Hat Ceph Storage services
Copiar enlace

Before removing an HCI (hyperconverged) node from a cluster, you must remove Red Hat Ceph Storage services. To remove the Red Hat Ceph services, you must disable and remove ceph-osd service from the cluster services on the node you are removing, then stop and disable the mon, mgr, and osd services.

Procedure

On the undercloud, use SSH to connect to the DistributedComputeHCI node that you want to remove:
```
$ ssh tripleo-admin@<dcn-computehci-node>
```
Start a cephadm shell. Use the configuration file and keyring file for the site that the host being removed is in:
```
$ sudo cephadm shell --config /etc/ceph/dcn2.conf \
--keyring /etc/ceph/dcn2.client.admin.keyring
```

Record the OSDs (object storage devices) associated with the DistributedComputeHCI node you are removing for use reference in a later step:

[ceph: root@dcn2-computehci2-1 ~]# ceph osd tree -c /etc/ceph/dcn2.conf
…
-3       0.24399     host dcn2-computehci2-1
 1   hdd 0.04880         osd.1                           up  1.00000 1.00000
 7   hdd 0.04880         osd.7                           up  1.00000 1.00000
11   hdd 0.04880         osd.11                          up  1.00000 1.00000
15   hdd 0.04880         osd.15                          up  1.00000 1.00000
18   hdd 0.04880         osd.18                          up  1.00000 1.00000
…

Use SSH to connect to another node in the same cluster and remove the monitor from the cluster:

$ sudo cephadm shell --config /etc/ceph/dcn2.conf \
--keyring /etc/ceph/dcn2.client.admin.keyring

[ceph: root@dcn-computehci2-0]# ceph mon remove dcn2-computehci2-1 -c /etc/ceph/dcn2.conf
removing mon.dcn2-computehci2-1 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitors

Use SSH to log in again to the node that you are removing from the cluster.

Stop and disable the mgr service:

[tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph
ceph-crash@dcn2-computehci2-1.service    loaded active     running       Ceph crash dump collector
ceph-mgr@dcn2-computehci2-1.service      loaded active     running       Ceph Manager

[tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl stop ceph-mgr@dcn2-computehci2-1

[tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph
ceph-crash@dcn2-computehci2-1.service  loaded active running Ceph crash dump collector

[tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl disable ceph-mgr@dcn2-computehci2-1
Removed /etc/systemd/system/multi-user.target.wants/ceph-mgr@dcn2-computehci2-1.service.

Start the cephadm shell:

$ sudo cephadm shell --config /etc/ceph/dcn2.conf \
--keyring /etc/ceph/dcn2.client.admin.keyring

Verify that the mgr service for the node is removed from the cluster:

[ceph: root@dcn2-computehci2-1 ~]# ceph -s

cluster:
    id:     b9b53581-d590-41ac-8463-2f50aa985001
    health: HEALTH_WARN
            3 pools have too many placement groups
            mons are allowing insecure global_id reclaim

  services:
    mon: 2 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0 (age 2h)
    mgr: dcn2-computehci2-2(active, since 20h), standbys: dcn2-computehci2-0
    osd: 15 osds: 15 up (since 3h), 15 in (since 3h)

  data:
    pools:   3 pools, 384 pgs
    objects: 32 objects, 88 MiB
    usage:   16 GiB used, 734 GiB / 750 GiB avail
    pgs:     384 active+clean

Note

The node that the mgr service is removed from is no longer listed when the mgr service is successfully removed.

Export the Red Hat Ceph Storage specification:

[ceph: root@dcn2-computehci2-1 ~]# ceph orch ls --export > spec.yml

Edit the specifications in the spec.yaml file:
- Remove all instances of the host <dcn-computehci-node> from spec.yml
- Remove all instances of the <dcn-computehci-node> entry from the following:
  - service_type: osd
  - service_type: mon
  - service_type: host

Reapply the Red Hat Ceph Storage specification:

[ceph: root@dcn2-computehci2-1 /]# ceph orch apply -i spec.yml

Remove the OSDs that you identified using ceph osd tree:

[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm --zap 1 7 11 15 18
Scheduled OSD(s) for removal

Verify the status of the OSDs being removed. Do not continue until the following command returns no output:

[ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm status
OSD_ID  HOST                    STATE     PG_COUNT  REPLACE  FORCE  DRAIN_STARTED_AT
1       dcn2-computehci2-1      draining  27        False    False  2021-04-23 21:35:51.215361
7       dcn2-computehci2-1      draining  8         False    False  2021-04-23 21:35:49.111500
11      dcn2-computehci2-1      draining  14        False    False  2021-04-23 21:35:50.243762

Verify that no daemons remain on the host you are removing:

[ceph: root@dcn2-computehci2-1 /]# ceph orch ps dcn2-computehci2-1

If daemons are still present, you can remove them with the following command:

[ceph: root@dcn2-computehci2-1 /]# ceph orch host drain dcn2-computehci2-1

Remove the <dcn-computehci-node> host from the Red Hat Ceph Storage cluster:

[ceph: root@dcn2-computehci2-1 /]# ceph orch host rm dcn2-computehci2-1
Removed host ‘dcn2-computehci2-1’

9.2. Removing the Image service (glance) services
Copiar enlace

Remove image services from a node when you remove it from service.

Procedure

To disable the Image service services, disable them using systemctl on the node you are removing:

[root@dcn2-computehci2-1 ~]# systemctl stop tripleo_glance_api.service
[root@dcn2-computehci2-1 ~]# systemctl stop  tripleo_glance_api_tls_proxy.service

[root@dcn2-computehci2-1 ~]# systemctl disable tripleo_glance_api.service
Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api.service.
[root@dcn2-computehci2-1 ~]# systemctl disable  tripleo_glance_api_tls_proxy.service
Removed /etc/systemd/system/multi-user.target.wants/tripleo_glance_api_tls_proxy.service.

9.3. Removing the Block Storage (cinder) services
Copiar enlace

You must remove the cinder-volume and etcd services from the DistributedComputeHCI node when you remove it from service.

Procedure

Identify and disable the cinder-volume service on the node you are removing:

(central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume
| cinder-volume | dcn2-computehci2-1@tripleo_ceph | az-dcn2    | enabled | up    | 2022-03-23T17:41:43.000000 |
(central) [stack@site-undercloud-0 ~]$ openstack volume service set --disable dcn2-computehci2-1@tripleo_ceph cinder-volume

Log on to a different DistributedComputeHCI node in the stack:
```
$ ssh tripleo-admin@dcn2-computehci2-0
```

Remove the cinder-volume service associated with the node that you are removing:

[root@dcn2-computehci2-0 ~]# podman exec -it cinder_volume cinder-manage service remove cinder-volume dcn2-computehci2-1@tripleo_ceph
Service cinder-volume on host dcn2-computehci2-1@tripleo_ceph removed.

Stop and disable the tripleo_cinder_volume service on the node that you are removing:

[root@dcn2-computehci2-1 ~]# systemctl stop tripleo_cinder_volume.service
[root@dcn2-computehci2-1 ~]# systemctl disable tripleo_cinder_volume.service
Removed /etc/systemd/system/multi-user.target.wants/tripleo_cinder_volume.service

9.4. Delete the DistributedComputeHCI node
Copiar enlace

Set the provisioned parameter to a value of false and remove the node from the stack. Disable the nova-compute service and delete the relevant network agent.

Procedure

Copy the baremetal-deployment.yaml file:

cp /home/stack/dcn2/overcloud-baremetal-deploy.yaml \
/home/stack/dcn2/baremetal-deployment-scaledown.yaml

Edit the baremetal-deployement-scaledown.yaml file. Identify the host you want to remove and set the provisioned parameter to have a value of false:
```
instances:
...
  - hostname: dcn2-computehci2-1
    provisioned: false
```

Remove the node from the stack:

openstack overcloud node delete --stack dcn2 --baremetal-deployment /home/stack/dcn2/baremetal_deployment_scaledown.yaml

Optional: If you are going to reuse the node, use ironic to clean the disk. This is required if the node will host Ceph OSDs:

openstack baremetal node manage $UUID
openstack baremetal node clean $UUID --clean-steps '[{"interface":"deploy", "step": "erase_devices_metadata"}]'
openstack baremetal provide $UUID

Redeploy the central site. Include all templates that you used for the initial configuration:

openstack overcloud deploy \
--stack central \
--templates /usr/share/openstack-tripleo-heat-templates/ \
-r ~/control-plane/central_roles.yaml \
-n ~/network-data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/dcn-storage.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/nova-az-config.yaml \
-e /home/stack/central/overcloud-networks-deployed.yaml \
-e /home/stack/central/overcloud-vip-deployed.yaml \
-e /home/stack/central/deployed_metal.yaml \
-e /home/stack/central/deployed_ceph.yaml \
-e /home/stack/central/dcn_ceph.yaml \
-e /home/stack/central/glance_update.yaml

9.5. Replacing a removed DistributedComputeHCI node
Copiar enlace

To add new HCI nodes to your DCN deployment, you must redeploy the edge stack with the additional node, perform a ceph export of that stack, and then perform a stack update for the central location. A stack update of the central location adds configurations specific to edge-sites.

Prerequisites

The node counts are correct in the nodes_data.yaml file of the stack that you want to replace the node in or add a new node to.

Procedure

You must set the EtcdIntialClusterState parameter to existing in one of the templates called by your deploy script:
```
parameter_defaults:
  EtcdInitialClusterState: existing
```

Redeploy using the deployment script specific to the stack:

(undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy_dcn2.sh
…
Overcloud Deployed without error

Export the Red Hat Ceph Storage data from the stack:

(undercloud) [stack@site-undercloud-0 ~]$ sudo -E openstack overcloud export ceph --stack dcn1,dcn2 --config-download-dir /var/lib/mistral --output-file ~/central/dcn2_scale_up_ceph_external.yaml

Replace dcn_ceph_external.yaml with the newly generated dcn2_scale_up_ceph_external.yaml in the deploy script for the central location.

Perform a stack update at central:

(undercloud) [stack@site-undercloud-0 ~]$ ./overcloud_deploy.sh
...
Overcloud Deployed without error

9.6. Verify the functionality of a replaced DistributedComputeHCI node
Copiar enlace

Verify that the replaced HCI compute node is functional.

Procedure

Ensure the value of the status field is enabled, and that the value of the State field is up:

(central) [stack@site-undercloud-0 ~]$ openstack compute service list -c Binary -c Host -c Zone -c Status -c State
+----------------+-----------------------------------------+------------+---------+-------+
| Binary         | Host                                    | Zone       | Status  | State |
+----------------+-----------------------------------------+------------+---------+-------+
...
| nova-compute   | dcn1-compute1-0.redhat.local            | az-dcn1    | enabled | up    |
| nova-compute   | dcn1-compute1-1.redhat.local            | az-dcn1    | enabled | up    |
| nova-compute   | dcn2-computehciscaleout2-0.redhat.local | az-dcn2    | enabled | up    |
| nova-compute   | dcn2-computehci2-0.redhat.local         | az-dcn2    | enabled | up    |
| nova-compute   | dcn2-computescaleout2-0.redhat.local    | az-dcn2    | enabled | up    |
| nova-compute   | dcn2-computehci2-2.redhat.local         | az-dcn2    | enabled | up    |
...

Ensure that all network agents are in the up state:

(central) [stack@site-undercloud-0 ~]$ openstack network agent list -c "Agent Type" -c Host -c Alive -c State
+--------------------+-----------------------------------------+-------+-------+
| Agent Type         | Host                                    | Alive | State |
+--------------------+-----------------------------------------+-------+-------+
| DHCP agent         | dcn3-compute3-1.redhat.local            | :-)   | UP    |
| Open vSwitch agent | central-computehci0-1.redhat.local      | :-)   | UP    |
| DHCP agent         | dcn3-compute3-0.redhat.local            | :-)   | UP    |
| DHCP agent         | central-controller0-2.redhat.local      | :-)   | UP    |
| Open vSwitch agent | dcn3-compute3-1.redhat.local            | :-)   | UP    |
| Open vSwitch agent | dcn1-compute1-1.redhat.local            | :-)   | UP    |
| Open vSwitch agent | central-computehci0-0.redhat.local      | :-)   | UP    |
| DHCP agent         | central-controller0-1.redhat.local      | :-)   | UP    |
| L3 agent           | central-controller0-2.redhat.local      | :-)   | UP    |
| Metadata agent     | central-controller0-1.redhat.local      | :-)   | UP    |
| Open vSwitch agent | dcn2-computescaleout2-0.redhat.local    | :-)   | UP    |
| Open vSwitch agent | dcn2-computehci2-5.redhat.local         | :-)   | UP    |
| Open vSwitch agent | central-computehci0-2.redhat.local      | :-)   | UP    |
| DHCP agent         | central-controller0-0.redhat.local      | :-)   | UP    |
| Open vSwitch agent | central-controller0-1.redhat.local      | :-)   | UP    |
| Open vSwitch agent | dcn2-computehci2-0.redhat.local         | :-)   | UP    |
| Open vSwitch agent | dcn1-compute1-0.redhat.local            | :-)   | UP    |
...

Verify the status of the Ceph Cluster:

Use SSH to connect to the new DistributedComputeHCI node and check the status of the Ceph cluster:

[root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 \
ceph -s -c /etc/ceph/dcn2.conf

Verify that both the ceph mon and ceph mgr services exist for the new node:

services:
    mon: 3 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0,dcn2-computehci2-5 (age 3d)
    mgr: dcn2-computehci2-2(active, since 3d), standbys: dcn2-computehci2-0, dcn2-computehci2-5
    osd: 20 osds: 20 up (since 3d), 20 in (since 3d)

Verify the status of the ceph osds with ‘ceph osd tree’. Ensure all osds for our new node are in STATUS up:

[root@dcn2-computehci2-5 ~]# podman exec -it ceph-mon-dcn2-computehci2-5 ceph osd tree -c /etc/ceph/dcn2.conf
ID CLASS WEIGHT  TYPE NAME                           STATUS REWEIGHT PRI-AFF
-1       0.97595 root default
-5       0.24399     host dcn2-computehci2-0
 0   hdd 0.04880         osd.0                           up  1.00000 1.00000
 4   hdd 0.04880         osd.4                           up  1.00000 1.00000
 8   hdd 0.04880         osd.8                           up  1.00000 1.00000
13   hdd 0.04880         osd.13                          up  1.00000 1.00000
17   hdd 0.04880         osd.17                          up  1.00000 1.00000
-9       0.24399     host dcn2-computehci2-2
 3   hdd 0.04880         osd.3                           up  1.00000 1.00000
 5   hdd 0.04880         osd.5                           up  1.00000 1.00000
10   hdd 0.04880         osd.10                          up  1.00000 1.00000
14   hdd 0.04880         osd.14                          up  1.00000 1.00000
19   hdd 0.04880         osd.19                          up  1.00000 1.00000
-3       0.24399     host dcn2-computehci2-5
 1   hdd 0.04880         osd.1                           up  1.00000 1.00000
 7   hdd 0.04880         osd.7                           up  1.00000 1.00000
11   hdd 0.04880         osd.11                          up  1.00000 1.00000
15   hdd 0.04880         osd.15                          up  1.00000 1.00000
18   hdd 0.04880         osd.18                          up  1.00000 1.00000
-7       0.24399     host dcn2-computehciscaleout2-0
 2   hdd 0.04880         osd.2                           up  1.00000 1.00000
 6   hdd 0.04880         osd.6                           up  1.00000 1.00000
 9   hdd 0.04880         osd.9                           up  1.00000 1.00000
12   hdd 0.04880         osd.12                          up  1.00000 1.00000
16   hdd 0.04880         osd.16                          up  1.00000 1.00000

Verify the cinder-volume service for the new DistributedComputeHCI node is in Status ‘enabled’ and in State ‘up’:

(central) [stack@site-undercloud-0 ~]$ openstack volume service list --service cinder-volume -c Binary -c Host -c Zone -c Status -c State
+---------------+---------------------------------+------------+---------+-------+
| Binary        | Host                            | Zone       | Status  | State |
+---------------+---------------------------------+------------+---------+-------+
| cinder-volume | hostgroup@tripleo_ceph          | az-central | enabled | up    |
| cinder-volume | dcn1-compute1-1@tripleo_ceph    | az-dcn1    | enabled | up    |
| cinder-volume | dcn1-compute1-0@tripleo_ceph    | az-dcn1    | enabled | up    |
| cinder-volume | dcn2-computehci2-0@tripleo_ceph | az-dcn2    | enabled | up    |
| cinder-volume | dcn2-computehci2-2@tripleo_ceph | az-dcn2    | enabled | up    |
| cinder-volume | dcn2-computehci2-5@tripleo_ceph | az-dcn2    | enabled | up    |
+---------------+---------------------------------+------------+---------+-------+

Note

If the State of the cinder-volume service is down, then the service has not been started on the node.

Use ssh to connect to the new DistributedComputeHCI node and check the status of the Glance services with ‘systemctl’:

[root@dcn2-computehci2-5 ~]# systemctl --type service | grep glance
  tripleo_glance_api.service                        loaded active     running       glance_api container
  tripleo_glance_api_healthcheck.service            loaded activating start   start glance_api healthcheck
  tripleo_glance_api_tls_proxy.service              loaded active     running       glance_api_tls_proxy container

9.7. Troubleshooting DistributedComputeHCI state down
Copiar enlace

If the replacement node was deployed without the EtcdInitialClusterState parameter value set to existing, then the cinder-volume service of the replaced node shows down when you run openstack volume service list.

Procedure

Log onto the replacement node and check logs for the etcd service. Check that the logs show the etcd service is reporting a cluster ID mismatch in the /var/log/containers/stdouts/etcd.log log file:
```
2022-04-06T18:00:11.834104130+00:00 stderr F 2022-04-06 18:00:11.834045 E | rafthttp: request cluster ID mismatch (got 654f4cf0e2cfb9fd want 918b459b36fe2c0c)
```
Set the EtcdInitialClusterState parameter to the value of existing in your deployment templates and rerun the deployment script.

Use SSH to connect to the replacement node and run the following commands as root:

[root@dcn2-computehci2-4 ~]# systemctl stop tripleo_etcd
[root@dcn2-computehci2-4 ~]# rm -rf /var/lib/etcd/*
[root@dcn2-computehci2-4 ~]# systemctl start tripleo_etcd

Recheck the /var/log/containers/stdouts/etcd.log log file to verify that the node successfully joined the cluster:

2022-04-06T18:24:22.130059875+00:00 stderr F 2022-04-06 18:24:22.129395 I | etcdserver/membership: added member 96f61470cd1839e5 [https://dcn2-computehci2-4.internalapi.redhat.local:2380] to cluster 654f4cf0e2cfb9fd

Check the state of the cinder-volume service, and confirm it reads up on the replacement node when you run openstack volume service list.

Este contenido no está disponible en el idioma seleccionado.

Chapter 9. Replacing DistributedComputeHCI nodes

9.1. Removing Red Hat Ceph Storage services
Copiar enlace

9.2. Removing the Image service (glance) services
Copiar enlace

9.3. Removing the Block Storage (cinder) services
Copiar enlace

9.4. Delete the DistributedComputeHCI node
Copiar enlace

9.5. Replacing a removed DistributedComputeHCI node
Copiar enlace

9.6. Verify the functionality of a replaced DistributedComputeHCI node
Copiar enlace

9.7. Troubleshooting DistributedComputeHCI state down
Copiar enlace

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este contenido no está disponible en el idioma seleccionado.

Chapter 9. Replacing DistributedComputeHCI nodes

9.1. Removing Red Hat Ceph Storage servicesCopiar enlaceEnlace copiado en el portapapeles!

9.2. Removing the Image service (glance) servicesCopiar enlaceEnlace copiado en el portapapeles!

9.3. Removing the Block Storage (cinder) servicesCopiar enlaceEnlace copiado en el portapapeles!

9.4. Delete the DistributedComputeHCI nodeCopiar enlaceEnlace copiado en el portapapeles!

9.5. Replacing a removed DistributedComputeHCI nodeCopiar enlaceEnlace copiado en el portapapeles!

9.6. Verify the functionality of a replaced DistributedComputeHCI nodeCopiar enlaceEnlace copiado en el portapapeles!

9.7. Troubleshooting DistributedComputeHCI state downCopiar enlaceEnlace copiado en el portapapeles!

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

9.1. Removing Red Hat Ceph Storage services
Copiar enlace

9.2. Removing the Image service (glance) services
Copiar enlace

9.3. Removing the Block Storage (cinder) services
Copiar enlace

9.4. Delete the DistributedComputeHCI node
Copiar enlace

9.5. Replacing a removed DistributedComputeHCI node
Copiar enlace

9.6. Verify the functionality of a replaced DistributedComputeHCI node
Copiar enlace

9.7. Troubleshooting DistributedComputeHCI state down
Copiar enlace