Chapter 11. Replacing Controller nodes
If a Controller node in a high availability cluster fails, you must remove the node from the cluster and replace it with a new Controller node.
The Controller node replacement process involves running the openstack overcloud deploy command to update the overcloud with a request to replace a Controller node.
The following procedure applies only to high availability environments. Do not use this procedure if you are using only one Controller node.
11.1. Preparing for Controller replacement Copy linkLink copied to clipboard!
Before you replace an overcloud Controller node, it is important to check the current state of your Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Controller replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Controller node replacement. Run all commands for these checks on the undercloud.
Procedure
Check the current status of the
overcloudstack on the undercloud:source stackrc openstack overcloud status
$ source stackrc $ openstack overcloud statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow Only continue if the
overcloudstack has a deployment status ofDEPLOY_SUCCESS.Install the database client tools:
sudo dnf -y install mariadb
$ sudo dnf -y install mariadbCopy to Clipboard Copied! Toggle word wrap Toggle overflow Configure root user access to the database:
sudo cp /var/lib/config-data/puppet-generated/mysql/root/.my.cnf /root/.
$ sudo cp /var/lib/config-data/puppet-generated/mysql/root/.my.cnf /root/.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Perform a backup of the undercloud databases:
mkdir /home/stack/backup sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz
$ mkdir /home/stack/backup $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gzCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that your undercloud contains 10 GB free storage to accommodate for image caching and conversion when you provision the new node:
df -h
$ df -hCopy to Clipboard Copied! Toggle word wrap Toggle overflow If you are reusing the IP address for the new controller node, ensure that you delete the port used by the old controller:
openstack port delete <port>
$ openstack port delete <port>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of Pacemaker on the running Controller nodes. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to view the Pacemaker status:
ssh tripleo-admin@192.168.0.47 'sudo pcs status'
$ ssh tripleo-admin@192.168.0.47 'sudo pcs status'Copy to Clipboard Copied! Toggle word wrap Toggle overflow The output shows all services that are running on the existing nodes and those that are stopped on the failed node.
Check the following parameters on each node of the overcloud MariaDB cluster:
-
wsrep_local_state_comment: Synced wsrep_cluster_size: 2Use the following command to check these parameters on each running Controller node. In this example, the Controller node IP addresses are 192.168.0.47 and 192.168.0.46:
for i in 192.168.0.46 192.168.0.47 ; do echo "*** $i ***" ; ssh tripleo-admin@$i "sudo podman exec \$(sudo podman ps --filter name=galera-bundle -q) mysql -e \"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
$ for i in 192.168.0.46 192.168.0.47 ; do echo "*** $i ***" ; ssh tripleo-admin@$i "sudo podman exec \$(sudo podman ps --filter name=galera-bundle -q) mysql -e \"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to view the RabbitMQ status:
ssh tripleo-admin@192.168.0.47 "sudo podman exec \$(sudo podman ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"
$ ssh tripleo-admin@192.168.0.47 "sudo podman exec \$(sudo podman ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
running_nodeskey should show only the two available nodes and not the failed node.If fencing is enabled, disable it. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to check the status of fencing:
ssh tripleo-admin@192.168.0.47 "sudo pcs property show stonith-enabled"
$ ssh tripleo-admin@192.168.0.47 "sudo pcs property show stonith-enabled"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to disable fencing:
ssh tripleo-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"
$ ssh tripleo-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the failed Controller node and stop all the
nova_*containers that are running:sudo systemctl stop tripleo_nova_api.service sudo systemctl stop tripleo_nova_api_cron.service sudo systemctl stop tripleo_nova_conductor.service sudo systemctl stop tripleo_nova_metadata.service sudo systemctl stop tripleo_nova_scheduler.service
$ sudo systemctl stop tripleo_nova_api.service $ sudo systemctl stop tripleo_nova_api_cron.service $ sudo systemctl stop tripleo_nova_conductor.service $ sudo systemctl stop tripleo_nova_metadata.service $ sudo systemctl stop tripleo_nova_scheduler.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow - If you are using the Bare Metal Service (ironic) as the virt driver, you must reuse the hostname when replacing the Controller node. Reusing the hostname prevents the Compute service (nova) database from being corrupted and prevents the workload from needing to be rebalanced when the Bare Metal Provisioning service is redeployed.
11.2. Removing a Ceph Monitor daemon Copy linkLink copied to clipboard!
If your Controller node is running a Ceph monitor service, complete the following steps to remove the ceph-mon daemon.
Adding a new Controller node to the cluster also adds a new Ceph monitor daemon automatically.
If you are using director-deployed Red Hat Ceph Storage, it is important to understand the impact of replacing Controller nodes. The Ceph Monitor service runs on the Controller nodes and is typically assigned IP addresses from the Storage network. These Ceph Monitor service IP addresses are associated with VM instances where Red Hat Ceph Storage is used. They are not dynamically updated if the Ceph Monitor service IP address changes during replacement of the Controller node. This could result in a storage outage, especially if multiple Controller nodes are replaced. Each VM instance would have to be migrated, rebooted, or shelved and unshelved to resolve the IP address change and the resulting outage.
Reuse the IP addresses of the removed Ceph Monitor service instances instead of using new IP addresses to avoid this situation.
For an example, see the fixed_ip configuration example in Step 5 of Provisioning bare metal nodes for the overcloud.
Use the following command on a Controller node to find the current Ceph Monitor service IP addresses:
sudo cephadm shell -- ceph mon stat
$ sudo cephadm shell -- ceph mon stat
Procedure
Connect to the Controller node that you want to replace:
ssh tripleo-admin@192.168.0.47
$ ssh tripleo-admin@192.168.0.47Copy to Clipboard Copied! Toggle word wrap Toggle overflow List the Ceph mon services:
sudo systemctl --type=service | grep ceph ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@crash.controller-0.service loaded active running Ceph crash.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service loaded active running Ceph mgr.controller-0.mufglq for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service loaded active running Ceph mon.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@rgw.rgw.controller-0.ikaevh.service loaded active running Ceph rgw.rgw.controller-0.ikaevh for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
$ sudo systemctl --type=service | grep ceph ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@crash.controller-0.service loaded active running Ceph crash.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service loaded active running Ceph mgr.controller-0.mufglq for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service loaded active running Ceph mon.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@rgw.rgw.controller-0.ikaevh.service loaded active running Ceph rgw.rgw.controller-0.ikaevh for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31Copy to Clipboard Copied! Toggle word wrap Toggle overflow Stop the Ceph mon service:
sudo systemtctl stop ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service
$ sudo systemtctl stop ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Disable the Ceph mon service:
sudo systemctl disable ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.service
$ sudo systemctl disable ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mon.controller-0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Disconnect from the Controller node that you want to replace.
Use SSH to connect to another Controller node in the same cluster:
ssh tripleo-admin@192.168.0.46
$ ssh tripleo-admin@192.168.0.46Copy to Clipboard Copied! Toggle word wrap Toggle overflow The Ceph specification file is modified and applied later in this procedure, to manipulate the file you must export it:
sudo cephadm shell -- ceph orch ls --export > spec.yaml
$ sudo cephadm shell -- ceph orch ls --export > spec.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the monitor from the cluster:
sudo cephadm shell -- ceph mon remove controller-0 removing mon.controller-0 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitors
$ sudo cephadm shell -- ceph mon remove controller-0 removing mon.controller-0 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Disconnect from the Controller node and log back into the Controller node you are removing from the cluster:
ssh tripleo-admin@192.168.0.47
$ ssh tripleo-admin@192.168.0.47Copy to Clipboard Copied! Toggle word wrap Toggle overflow List the Ceph mgr services:
sudo systemctl --type=service | grep ceph ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@crash.controller-0.service loaded active running Ceph crash.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service loaded active running Ceph mgr.controller-0.mufglq for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@rgw.rgw.controller-0.ikaevh.service loaded active running Ceph rgw.rgw.controller-0.ikaevh for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31
$ sudo systemctl --type=service | grep ceph ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@crash.controller-0.service loaded active running Ceph crash.controller-0 for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service loaded active running Ceph mgr.controller-0.mufglq for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31 ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@rgw.rgw.controller-0.ikaevh.service loaded active running Ceph rgw.rgw.controller-0.ikaevh for 4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31Copy to Clipboard Copied! Toggle word wrap Toggle overflow Stop the Ceph mgr service:
sudo systemctl stop ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service
$ sudo systemctl stop ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Disable the Ceph mgr service:
sudo systemctl disable ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.service
$ sudo systemctl disable ceph-4cf401f9-dd4c-5cda-9f0a-fa47fbf12b31@mgr.controller-0.mufglq.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Start a
cephadmshell:sudo cephadm shell
$ sudo cephadm shellCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the Ceph mgr service for the Controller node is removed from the cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The node is not listed if the Ceph mgr service is successfully removed.
Export the Red Hat Ceph Storage specification:
ceph orch ls --export > spec.yaml
$ ceph orch ls --export > spec.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
In the
spec.yamlspecification file, remove all instances of the host, for examplecontroller-0, from theservice_type: monandservice_type: mgr. Reapply the Red Hat Ceph Storage specification:
ceph orch apply -i spec.yaml
$ ceph orch apply -i spec.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that no Ceph daemons remain on the removed host:
ceph orch ps controller-0
$ ceph orch ps controller-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf daemons are present, use the following command to remove them:
ceph orch host drain controller-0
$ ceph orch host drain controller-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Prior to running the
ceph orch host draincommand, backup the contents of/etc/ceph. Restore the contents after running theceph orch host draincommand. You must back up prior to running theceph orch host draincommand until https://bugzilla.redhat.com/show_bug.cgi?id=2153827 is resolved.Remove the
controller-0host from the Red Hat Ceph Storage cluster:ceph orch host rm controller-0 Removed host 'controller-0'
$ ceph orch host rm controller-0 Removed host 'controller-0'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Exit the cephadm shell:
exit
$ exitCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- For more information on controlling Red Hat Ceph Storage services with systemd, see Understanding process management for Ceph.
- For more information on editing and applying Red Hat Ceph Storage specification files, see Deploying the Ceph monitor daemons using the service specification.
11.3. Preparing the cluster for Controller node replacement Copy linkLink copied to clipboard!
Before you replace the node, ensure that Pacemaker is not running on the node and then remove that node from the Pacemaker cluster.
Procedure
To view the list of IP addresses for the Controller nodes, run the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to the node and confirm the pacemaker status. If pacemaker is running, use the
pcs clustercommand to stop pacemaker. This example stops pacemaker onovercloud-controller-0:(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs status | grep -w Online | grep -w overcloud-controller-0" (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster stop overcloud-controller-0"
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs status | grep -w Online | grep -w overcloud-controller-0" (undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster stop overcloud-controller-0"Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIn the case that the node is physically unavailable or stopped, it is not necessary to perform the previous operation, as pacemaker is already stopped on that node.
After you stop Pacemaker on the node, delete the node from the pacemaker cluster. The following example logs in to
overcloud-controller-1to removeovercloud-controller-0:(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster node remove overcloud-controller-0"
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster node remove overcloud-controller-0"Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the node that that you want to replace is unreachable (for example, due to a hardware failure), run the
pcscommand with additional--skip-offlineand--forceoptions to forcibly remove the node from the cluster:(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster node remove overcloud-controller-0 --skip-offline --force"
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs cluster node remove overcloud-controller-0 --skip-offline --force"Copy to Clipboard Copied! Toggle word wrap Toggle overflow After you remove the node from the pacemaker cluster, remove the node from the list of known hosts in pacemaker:
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs host deauth overcloud-controller-0"
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs host deauth overcloud-controller-0"Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can run this command whether the node is reachable or not.
To ensure that the new Controller node uses the correct STONITH fencing device after replacement, delete the devices from the node by entering the following command:
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs stonith delete <stonith_resource_name>"
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs stonith delete <stonith_resource_name>"Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace
<stonith_resource_name>with the name of the STONITH resource that corresponds to the node. The resource name uses the format<resource_agent>-<host_mac>. You can find the resource agent and the host MAC address in theFencingConfigsection of thefencing.yamlfile.
-
Replace
The overcloud database must continue to run during the replacement procedure. To ensure that Pacemaker does not stop Galera during this procedure, select a running Controller node and run the following command on the undercloud with the IP address of the Controller node:
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs resource unmanage galera-bundle"
(undercloud) $ ssh tripleo-admin@192.168.0.45 "sudo pcs resource unmanage galera-bundle"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the OVN northbound database server for the replaced Controller node from the cluster:
Obtain the server ID of the OVN northbound database server to be replaced:
ssh tripleo-admin@<controller_ip> sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null|grep -A4 Servers:
$ ssh tripleo-admin@<controller_ip> sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 2>/dev/null|grep -A4 Servers:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
<controller_ip>with the IP address of any active Controller node.You should see output similar to the following:
Servers: 96da (96da at tcp:172.17.1.55:6643) (self) next_index=26063 match_index=26063 466b (466b at tcp:172.17.1.51:6643) next_index=26064 match_index=26063 last msg 2936 ms ago ba77 (ba77 at tcp:172.17.1.52:6643) next_index=26064 match_index=26063 last msg 2936 ms ago
Servers: 96da (96da at tcp:172.17.1.55:6643) (self) next_index=26063 match_index=26063 466b (466b at tcp:172.17.1.51:6643) next_index=26064 match_index=26063 last msg 2936 ms ago ba77 (ba77 at tcp:172.17.1.52:6643) next_index=26064 match_index=26063 last msg 2936 ms agoCopy to Clipboard Copied! Toggle word wrap Toggle overflow In this example,
172.17.1.55is the internal IP address of the Controller node that is being replaced, so the northbound database server ID is96da.Using the server ID you obtained in the preceding step, remove the OVN northbound database server by running the following command:
ssh tripleo-admin@172.17.1.52 sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound 96da
$ ssh tripleo-admin@172.17.1.52 sudo podman exec ovn_cluster_north_db_server ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound 96daCopy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, you would replace
172.17.1.52with the IP address of any active Controller node, and replace96dawith the server ID of the OVN northbound database server.
Remove the OVN southbound database server for the replaced Controller node from the cluster:
Obtain the server ID of the OVN southbound database server to be replaced:
ssh tripleo-admin@<controller_ip> sudo podman exec ovn_cluster_south_db_server ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 2>/dev/null|grep -A4 Servers:
$ ssh tripleo-admin@<controller_ip> sudo podman exec ovn_cluster_south_db_server ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 2>/dev/null|grep -A4 Servers:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
<controller_ip>with the IP address of any active Controller node.You should see output similar to the following:
Servers: e544 (e544 at tcp:172.17.1.55:6644) last msg 42802690 ms ago 17ca (17ca at tcp:172.17.1.51:6644) last msg 5281 ms ago 6e52 (6e52 at tcp:172.17.1.52:6644) (self)
Servers: e544 (e544 at tcp:172.17.1.55:6644) last msg 42802690 ms ago 17ca (17ca at tcp:172.17.1.51:6644) last msg 5281 ms ago 6e52 (6e52 at tcp:172.17.1.52:6644) (self)Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example,
172.17.1.55is the internal IP address of the Controller node that is being replaced, so the southbound database server ID ise544.Using the server ID you obtained in the preceding step, remove the OVN southbound database server by running the following command:
ssh tripleo-admin@172.17.1.52 sudo podman exec ovn_cluster_south_db_server ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound e544
$ ssh tripleo-admin@172.17.1.52 sudo podman exec ovn_cluster_south_db_server ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound e544Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, you would replace
172.17.1.52with the IP address of any active Controller node, and replacee544with the server ID of the OVN southbound database server.
Run the following clean up commands to prevent cluster rejoins.
Substitute
<replaced_controller_ip>with the IP address of the Controller node that you are replacing:ssh tripleo-admin@<replaced_controller_ip> sudo systemctl disable --now tripleo_ovn_cluster_south_db_server.service tripleo_ovn_cluster_north_db_server.service ssh tripleo-admin@<replaced_controller_ip> sudo rm -rfv /var/lib/openvswitch/ovn/.ovn* /var/lib/openvswitch/ovn/ovn*.db
$ ssh tripleo-admin@<replaced_controller_ip> sudo systemctl disable --now tripleo_ovn_cluster_south_db_server.service tripleo_ovn_cluster_north_db_server.service $ ssh tripleo-admin@<replaced_controller_ip> sudo rm -rfv /var/lib/openvswitch/ovn/.ovn* /var/lib/openvswitch/ovn/ovn*.dbCopy to Clipboard Copied! Toggle word wrap Toggle overflow
11.4. Removing the controller node from IdM Copy linkLink copied to clipboard!
If your nodes are protected with TLSe, you must remove the host and DNS entries from the IdM (Identity Management) server.
On your IdM server, remove all DNS entries for the controller node from IDM:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace
<host name>with the host name of your controller -
Replace
<domain name>with domain name of your controller
-
Replace
On the IdM server, remove the client host entry from the IdM LDAP server. This removes all services and revokes all certificates issued for that host:
ipa host-del client.idm.example.com
[root@server ~]# ipa host-del client.idm.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow
11.5. Replacing a bootstrap Controller node Copy linkLink copied to clipboard!
If you want to replace the Controller node that you use for bootstrap operations and keep the node name, complete the following steps to set the name of the bootstrap Controller node after the replacement process.
Procedure
Find the name of the bootstrap Controller node by running the following command:
ssh tripleo-admin@<controller_ip> "sudo hiera -c /etc/puppet/hiera.yaml pacemaker_short_bootstrap_node_name"
$ ssh tripleo-admin@<controller_ip> "sudo hiera -c /etc/puppet/hiera.yaml pacemaker_short_bootstrap_node_name"Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace
<controller_ip>with the IP address of any active Controller node.
-
Replace
Check if your environment files include the
ExtraConfigandAllNodesExtraMapDataparameters. If the parameters are not set, create the following environment file~/templates/bootstrap_controller.yamland add the following content:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace
NODE_NAMEwith the name of an existing Controller node that you want to use in bootstrap operations after the replacement process. Replace
NODE_IPwith the IP address mapped to the Controller named inNODE_NAME. To get the name, run the following command:sudo hiera -c /etc/puppet/hiera.yaml ovn_dbs_node_ips
$ sudo hiera -c /etc/puppet/hiera.yaml ovn_dbs_node_ipsCopy to Clipboard Copied! Toggle word wrap Toggle overflow If your environment files already include the
ExtraConfigandAllNodesExtraMapDataparameters, add only the lines shown in this step.
-
Replace
For information about troubleshooting the bootstrap Controller node replacement, see the Red Hat knowledgebase solution article Replacement of the first Controller node fails at step 1 if the same hostname is used for a new node.
11.6. Unprovision and remove Controller nodes Copy linkLink copied to clipboard!
You can unprovision and remove Controller nodes.
Procedure
Source the
stackrcfile:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the UUID of the
overcloud-controller-0node:NODE=$(metalsmith -c UUID -f value show overcloud-controller-0)
(undercloud)$ NODE=$(metalsmith -c UUID -f value show overcloud-controller-0)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the node to maintenance mode:
openstack baremetal node maintenance set $NODE
$ openstack baremetal node maintenance set $NODECopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy the
overcloud-baremetal-deploy.yamlfile:cp /home/stack/templates/overcloud-baremetal-deploy.yaml /home/stack/templates/unprovision_controller-0.yaml
$ cp /home/stack/templates/overcloud-baremetal-deploy.yaml /home/stack/templates/unprovision_controller-0.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow In the
unprovision_controller-0.yamlfile, lower the Controller count to unprovision the Controller node that you are replacing. In this example, the count is reduced from3to2. Move thecontroller-0node to theinstancesdictionary and set theprovisionedparameter tofalse:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
node unprovisioncommand:openstack overcloud node delete \ --stack overcloud \ --baremetal-deployment /home/stack/templates/unprovision_controller-0.yaml
$ openstack overcloud node delete \ --stack overcloud \ --baremetal-deployment /home/stack/templates/unprovision_controller-0.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Delete the ironic node:
openstack baremetal node delete <IRONIC_NODE_UUID>
$ openstack baremetal node delete <IRONIC_NODE_UUID>Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace
IRONIC_NODE_UUIDwith the UUID of the node.
-
Replace
11.7. Deploying a new controller node to the overcloud Copy linkLink copied to clipboard!
To deploy a new controller node to the overcloud complete the following steps.
Prerequisites
- The new Controller node must be registered, inspected, and tagged ready for provisioning. For more information, see Provisioning bare metal overcloud nodes.
Procedure
-
Log in to the undercloud host as the
stackuser. Source the
stackrcundercloud credentials file:$ source ~/stackrcIf you want to use the same scheduling, placement, or IP addresses you can edit the
overcloud-baremetal-deploy.yamlenvironment file. Set thehostname,name, andnetworksfor the newcontroller-0instance in theinstancessection:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1 1
- If you are using the Bare Metal Service (ironic) as the virt driver, you must reuse the hostname when replacing the Controller node. Reusing the hostname prevents the Compute service (nova) database from being corrupted and prevents the workload from needing to be rebalanced when the Bare Metal Provisioning service is redeployed.
Provision the overcloud:
openstack overcloud node provision --stack overcloud --network-config --output /home/stack/templates/overcloud-baremetal-deployed.yaml /home/stack/templates/overcloud-baremetal-deploy.yaml
$ openstack overcloud node provision --stack overcloud --network-config --output /home/stack/templates/overcloud-baremetal-deployed.yaml /home/stack/templates/overcloud-baremetal-deploy.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
If you added a new
controller-0instance, then remove theinstancessection from theovercloud-baremetal-deploy.yamlfile when the node is provisioned. To create the
cephadmuser on the new Controller node, export a basic Ceph specification containing the new host information:openstack overcloud ceph spec --stack overcloud \ /home/stack/templates/overcloud-baremetal-deployed.yaml \ -o ceph_spec_host.yaml
$ openstack overcloud ceph spec --stack overcloud \ /home/stack/templates/overcloud-baremetal-deployed.yaml \ -o ceph_spec_host.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf your environment uses a custom role, include the
--roles-dataoption.Add the
cephadmuser to the new Controller node:openstack overcloud ceph user enable \ --stack overcloud ceph_spec_host.yaml
$ openstack overcloud ceph user enable \ --stack overcloud ceph_spec_host.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to the Controller node and add the new role to the Ceph cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Replace <IP_ADDRESS> with the IP address of the Controller node.
- Replace <LABELS> with any required Ceph labels.
Re-run the
openstack overcloud deploycommand:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the replacement Controller node is the bootstrap node, include the
bootstrap_controller.yamlenvironment file.
11.8. Deploying Ceph services on the new controller node Copy linkLink copied to clipboard!
After you provision a new Controller node and the Ceph monitor services are running you can deploy the mgr, rgw and osd Ceph services on the Controller node.
Prerequisites
- The new Controller node is provisioned and is running Ceph monitor services.
Procedure
Modify the
spec.ymlenvironment file, replace the previous Controller node name with the new Controller node name:cephadm shell -- ceph orch ls --export > spec.yml
$ cephadm shell -- ceph orch ls --export > spec.ymlCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteDo not use the basic Ceph environment file
ceph_spec_host.yamlas it does not contain all necessary cluster information.Apply the modified Ceph specification file:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the visibility of the new monitor:
sudo cephadm --ceph status
$ sudo cephadm --ceph statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow
.
11.9. Cleaning up after Controller node replacement Copy linkLink copied to clipboard!
After you complete the node replacement, you can finalize the Controller cluster.
Procedure
- Log into a Controller node.
Enable Pacemaker management of the Galera cluster and start Galera on the new node:
sudo pcs resource refresh galera-bundle sudo pcs resource manage galera-bundle
[tripleo-admin@overcloud-controller-0 ~]$ sudo pcs resource refresh galera-bundle [tripleo-admin@overcloud-controller-0 ~]$ sudo pcs resource manage galera-bundleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Enable fencing:
sudo pcs property set stonith-enabled=true
[tripleo-admin@overcloud-controller-0 ~]$ sudo pcs property set stonith-enabled=trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Perform a final status check to ensure that the services are running correctly:
sudo pcs status
[tripleo-admin@overcloud-controller-0 ~]$ sudo pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf any services have failed, use the
pcs resource refreshcommand to resolve and restart the failed services.Exit to director:
exit
[tripleo-admin@overcloud-controller-0 ~]$ exitCopy to Clipboard Copied! Toggle word wrap Toggle overflow Source the
overcloudrcfile so that you can interact with the overcloud:source ~/overcloudrc
$ source ~/overcloudrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the network agents in your overcloud environment:
(overcloud) $ openstack network agent list
(overcloud) $ openstack network agent listCopy to Clipboard Copied! Toggle word wrap Toggle overflow If any agents appear for the old node, remove them:
(overcloud) $ for AGENT in $(openstack network agent list --host overcloud-controller-1.localdomain -c ID -f value) ; do openstack network agent delete $AGENT ; done
(overcloud) $ for AGENT in $(openstack network agent list --host overcloud-controller-1.localdomain -c ID -f value) ; do openstack network agent delete $AGENT ; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow If necessary, add your router to the L3 agent host on the new node. Use the following example command to add a router named
r1to the L3 agent using the UUID 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4:(overcloud) $ openstack network agent add router --l3 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4 r1
(overcloud) $ openstack network agent add router --l3 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4 r1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Clean the cinder services.
List the cinder services:
(overcloud) $ openstack volume service list
(overcloud) $ openstack volume service listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to a controller node, connect to the
cinder-apicontainer and use thecinder-manage service removecommand to remove leftover services:sudo podman exec -it cinder_api cinder-manage service remove cinder-backup <host> sudo podman exec -it cinder_api cinder-manage service remove cinder-scheduler <host>
[tripleo-admin@overcloud-controller-0 ~]$ sudo podman exec -it cinder_api cinder-manage service remove cinder-backup <host> [tripleo-admin@overcloud-controller-0 ~]$ sudo podman exec -it cinder_api cinder-manage service remove cinder-scheduler <host>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Clean the RabbitMQ cluster.
- Log into a Controller node.
Use the
podman execcommand to launch bash, and verify the status of the RabbitMQ cluster:sudo podman exec -it rabbitmq-bundle-podman-0 bash rabbitmqctl cluster_status
[tripleo-admin@overcloud-controller-0 ~]$ sudo podman exec -it rabbitmq-bundle-podman-0 bash [root@overcloud-controller-0 /]$ rabbitmqctl cluster_statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use the
rabbitmqctlcommand to forget the replaced controller node:rabbitmqctl forget_cluster_node <node_name>
[root@controller-0 /]$ rabbitmqctl forget_cluster_node <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
If you replaced a bootstrap Controller node, you must remove the environment file
~/templates/bootstrap_controller.yamlafter the replacement process, or delete thepacemaker_short_bootstrap_node_nameandmysql_short_bootstrap_node_nameparameters from your existing environment file. This step prevents director from attempting to override the Controller node name in subsequent replacements. For more information, see Replacing a bootstrap Controller node. If you are using the Object Storage service (swift) on the overcloud, you must synchronize the swift rings after updating the overcloud nodes. Use a script, similar to the following example, to distribute ring files from a previously existing Controller node (Controller node 0 in this example) to all Controller nodes and restart the Object Storage service containers on those nodes:
#!/bin/sh set -xe SRC="tripleo-admin@overcloud-controller-0.ctlplane" ALL="tripleo-admin@overcloud-controller-0.ctlplane tripleo-admin@overcloud-controller-1.ctlplane tripleo-admin@overcloud-controller-2.ctlplane"
#!/bin/sh set -xe SRC="tripleo-admin@overcloud-controller-0.ctlplane" ALL="tripleo-admin@overcloud-controller-0.ctlplane tripleo-admin@overcloud-controller-1.ctlplane tripleo-admin@overcloud-controller-2.ctlplane"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Fetch the current set of ring files:
ssh "${SRC}" 'sudo tar -czvf - /var/lib/config-data/puppet-generated/swift_ringbuilder/etc/swift/{*.builder,*.ring.gz,backups/*.builder}' > swift-rings.tar.gzssh "${SRC}" 'sudo tar -czvf - /var/lib/config-data/puppet-generated/swift_ringbuilder/etc/swift/{*.builder,*.ring.gz,backups/*.builder}' > swift-rings.tar.gzCopy to Clipboard Copied! Toggle word wrap Toggle overflow Upload rings to all nodes, put them into the correct place, and restart swift services:
for DST in ${ALL}; do cat swift-rings.tar.gz | ssh "${DST}" 'sudo tar -C / -xvzf -' ssh "${DST}" 'sudo podman restart swift_copy_rings' ssh "${DST}" 'sudo systemctl restart tripleo_swift*' donefor DST in ${ALL}; do cat swift-rings.tar.gz | ssh "${DST}" 'sudo tar -C / -xvzf -' ssh "${DST}" 'sudo podman restart swift_copy_rings' ssh "${DST}" 'sudo systemctl restart tripleo_swift*' doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow