Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 16. Replacing Controller nodes
In certain circumstances a Controller node in a high availability cluster might fail. In these situations, you must remove the node from the cluster and replace it with a new Controller node.
Complete the steps in this section to replace a Controller node. The Controller node replacement process involves running the openstack overcloud deploy
command to update the overcloud with a request to replace a Controller node.
The following procedure applies only to high availability environments. Do not use this procedure if you are using only one Controller node.
16.1. Preparing for Controller replacement Copier lienLien copié sur presse-papiers!
Before you replace an overcloud Controller node, it is important to check the current state of your Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Controller replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Controller node replacement. Run all commands for these checks on the undercloud.
Procedure
Check the current status of the
overcloud
stack on the undercloud:source stackrc
$ source stackrc (undercloud) $ openstack stack list --nested
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
overcloud
stack and its subsequent child stacks should have either aCREATE_COMPLETE
orUPDATE_COMPLETE
.Install the database client tools:
(undercloud) $ sudo dnf -y install mariadb
(undercloud) $ sudo dnf -y install mariadb
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure root user access to the database:
(undercloud) $ sudo cp /var/lib/config-data/puppet-generated/mysql/root/.my.cnf /root/.
(undercloud) $ sudo cp /var/lib/config-data/puppet-generated/mysql/root/.my.cnf /root/.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Perform a backup of the undercloud databases:
(undercloud) $ mkdir /home/stack/backup (undercloud) $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz
(undercloud) $ mkdir /home/stack/backup (undercloud) $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that your undercloud contains 10 GB free storage to accommodate for image caching and conversion when you provision the new node:
(undercloud) $ df -h
(undercloud) $ df -h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of Pacemaker on the running Controller nodes. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to view the Pacemaker status:
(undercloud) $ ssh heat-admin@192.168.0.47 'sudo pcs status'
(undercloud) $ ssh heat-admin@192.168.0.47 'sudo pcs status'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The output shows all services that are running on the existing nodes and that are stopped on the failed node.
Check the following parameters on each node of the overcloud MariaDB cluster:
-
wsrep_local_state_comment: Synced
wsrep_cluster_size: 2
Use the following command to check these parameters on each running Controller node. In this example, the Controller node IP addresses are 192.168.0.47 and 192.168.0.46:
(undercloud) $ for i in 192.168.24.6 192.168.24.7 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo podman exec \$(sudo podman ps --filter name=galera-bundle -q) mysql -e \"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
(undercloud) $ for i in 192.168.24.6 192.168.24.7 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo podman exec \$(sudo podman ps --filter name=galera-bundle -q) mysql -e \"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to view the RabbitMQ status:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo podman exec \$(sudo podman ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo podman exec \$(sudo podman ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
running_nodes
key should show only the two available nodes and not the failed node.If fencing is enabled, disable it. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to check the status of fencing:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs property show stonith-enabled"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs property show stonith-enabled"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to disable fencing:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the Compute services are active on the director node:
(undercloud) $ openstack hypervisor list
(undercloud) $ openstack hypervisor list
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The output should show all non-maintenance mode nodes as
up
.Ensure all undercloud containers are running:
(undercloud) $ sudo podman ps
(undercloud) $ sudo podman ps
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
16.2. Removing a Ceph Monitor daemon Copier lienLien copié sur presse-papiers!
If your Controller node is running a Ceph monitor service, complete the following steps to remove the ceph-mon
daemon..
Adding a new Controller node to the cluster also adds a new Ceph monitor daemon automatically.
Procedure
Connect to the Controller node that you want to replace and become the root user:
ssh heat-admin@192.168.0.47 sudo su -
# ssh heat-admin@192.168.0.47 # sudo su -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the Controller node is unreachable, skip steps 1 and 2 and continue the procedure at step 3 on any working Controller node.
Stop the monitor:
systemctl stop ceph-mon@<monitor_hostname>
# systemctl stop ceph-mon@<monitor_hostname>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
systemctl stop ceph-mon@overcloud-controller-1
# systemctl stop ceph-mon@overcloud-controller-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Disconnect from the Controller node that you want to replace.
Connect to one of the existing Controller nodes.
ssh heat-admin@192.168.0.46 sudo su -
# ssh heat-admin@192.168.0.46 # sudo su -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the monitor from the cluster:
sudo podman exec -it ceph-mon-controller-0 ceph mon remove overcloud-controller-1
# sudo podman exec -it ceph-mon-controller-0 ceph mon remove overcloud-controller-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow On all Controller nodes, remove the v1 and v2 monitor entries from
/etc/ceph/ceph.conf
. For example, if you remove controller-1, then remove the IPs and hostname for controller-1.Before:
mon host = [v2:172.18.0.21:3300,v1:172.18.0.21:6789],[v2:172.18.0.22:3300,v1:172.18.0.22:6789],[v2:172.18.0.24:3300,v1:172.18.0.24:6789] mon initial members = overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
mon host = [v2:172.18.0.21:3300,v1:172.18.0.21:6789],[v2:172.18.0.22:3300,v1:172.18.0.22:6789],[v2:172.18.0.24:3300,v1:172.18.0.24:6789] mon initial members = overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After:
mon host = [v2:172.18.0.21:3300,v1:172.18.0.21:6789],[v2:172.18.0.24:3300,v1:172.18.0.24:6789] mon initial members = overcloud-controller-2,overcloud-controller-0
mon host = [v2:172.18.0.21:3300,v1:172.18.0.21:6789],[v2:172.18.0.24:3300,v1:172.18.0.24:6789] mon initial members = overcloud-controller-2,overcloud-controller-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteDirector updates the
ceph.conf
file on the relevant overcloud nodes when you add the replacement Controller node. Normally, director manages this configuration file exclusively and you should not edit the file manually. However, you can edit the file manually if you want to ensure consistency in case the other nodes restart before you add the new node.(Optional) Archive the monitor data and save the archive on another server:
mv /var/lib/ceph/mon/<cluster>-<daemon_id> /var/lib/ceph/mon/removed-<cluster>-<daemon_id>
# mv /var/lib/ceph/mon/<cluster>-<daemon_id> /var/lib/ceph/mon/removed-<cluster>-<daemon_id>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
16.3. Preparing the cluster for Controller node replacement Copier lienLien copié sur presse-papiers!
Before you replace the old node, you must ensure that Pacemaker is not running on the node and then remove that node from the Pacemaker cluster.
Procedure
To view the list of IP addresses for the Controller nodes, run the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the old node is still reachable, log in to one of the remaining nodes and stop pacemaker on the old node. For this example, stop pacemaker on overcloud-controller-1:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs status | grep -w Online | grep -w overcloud-controller-1" (undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs cluster stop overcloud-controller-1"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs status | grep -w Online | grep -w overcloud-controller-1" (undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs cluster stop overcloud-controller-1"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIn case the old node is physically unavailable or stopped, it is not necessary to perform the previous operation, as pacemaker is already stopped on that node.
After you stop Pacemaker on the old node, delete the old node from the pacemaker cluster. The following example command logs in to
overcloud-controller-0
to removeovercloud-controller-1
:(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs cluster node remove overcloud-controller-1"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs cluster node remove overcloud-controller-1"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the node that that you want to replace is unreachable (for example, due to a hardware failure), run the
pcs
command with additional--skip-offline
and--force
options to forcibly remove the node from the cluster:(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs cluster node remove overcloud-controller-1 --skip-offline --force"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs cluster node remove overcloud-controller-1 --skip-offline --force"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After you remove the old node from the pacemaker cluster, remove the node from the list of known hosts in pacemaker:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs host deauth overcloud-controller-1"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs host deauth overcloud-controller-1"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can run this command whether the node is reachable or not.
The overcloud database must continue to run during the replacement procedure. To ensure that Pacemaker does not stop Galera during this procedure, select a running Controller node and run the following command on the undercloud with the IP address of the Controller node:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs resource unmanage galera-bundle"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs resource unmanage galera-bundle"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
16.4. Replacing a Controller node Copier lienLien copié sur presse-papiers!
To replace a Controller node, identify the index of the node that you want to replace.
- If the node is a virtual node, identify the node that contains the failed disk and restore the disk from a backup. Ensure that the MAC address of the NIC used for PXE boot on the failed server remains the same after disk replacement.
- If the node is a bare metal node, replace the disk, prepare the new disk with your overcloud configuration, and perform a node introspection on the new hardware.
- If the node is a part of a high availability cluster with fencing, you might need recover the Galera nodes separately. For more information, see the article How Galera works and how to rescue Galera clusters in the context of Red Hat OpenStack Platform.
Complete the following example steps to replace the the overcloud-controller-1
node with the overcloud-controller-3
node. The overcloud-controller-3
node has the ID 75b25e9a-948d-424a-9b3b-f0ef70a6eacf
.
To replace the node with an existing bare metal node, enable maintenance mode on the outgoing node so that the director does not automatically reprovision the node.
Replacement of an overcloud Controller might cause swift rings to become inconsistent across nodes. This can result in decreased availability of Object Storage service. This is a known issue. If this happens, log in to the previously existing Controller node using SSH, deploy the updated rings, and restart the Object Storage containers:
Procedure
Source the
stackrc
file:source ~/stackrc
$ source ~/stackrc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the index of the
overcloud-controller-1
node:INSTANCE=$(openstack server list --name overcloud-controller-1 -f value -c ID)
$ INSTANCE=$(openstack server list --name overcloud-controller-1 -f value -c ID)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the bare metal node associated with the instance:
NODE=$(openstack baremetal node list -f csv --quote minimal | grep $INSTANCE | cut -f1 -d,)
$ NODE=$(openstack baremetal node list -f csv --quote minimal | grep $INSTANCE | cut -f1 -d,)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the node to maintenance mode:
openstack baremetal node maintenance set $NODE
$ openstack baremetal node maintenance set $NODE
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the Controller node is a virtual node, run the following command on the Controller host to replace the virtual disk from a backup:
cp <VIRTUAL_DISK_BACKUP> /var/lib/libvirt/images/<VIRTUAL_DISK>
$ cp <VIRTUAL_DISK_BACKUP> /var/lib/libvirt/images/<VIRTUAL_DISK>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
<VIRTUAL_DISK_BACKUP>
with the path to the backup of the failed virtual disk, and replace<VIRTUAL_DISK>
with the name of the virtual disk that you want to replace.If you do not have a backup of the outgoing node, you must use a new virtualized node.
If the Controller node is a bare metal node, complete the following steps to replace the disk with a new bare metal disk:
- Replace the physical hard drive or solid state drive.
- Prepare the node with the same configuration as the failed node.
List unassociated nodes and identify the ID of the new node:
openstack baremetal node list --unassociated
$ openstack baremetal node list --unassociated
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Tag the new node with the
control
profile:(undercloud) $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' 75b25e9a-948d-424a-9b3b-f0ef70a6eacf
(undercloud) $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' 75b25e9a-948d-424a-9b3b-f0ef70a6eacf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
16.5. Triggering the Controller node replacement Copier lienLien copié sur presse-papiers!
Complete the following steps to remove the old Controller node and replace it with a new Controller node.
Procedure
Determine the UUID of the node that you want to remove and store it in the
NODEID
variable. Ensure that you replace NODE_NAME with the name of the node that you want to remove:NODEID=$(openstack server list -f value -c ID --name NODE_NAME)
$ NODEID=$(openstack server list -f value -c ID --name NODE_NAME)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To identify the Heat resource ID, enter the following command:
openstack stack resource show overcloud ControllerServers -f json -c attributes | jq --arg NODEID "$NODEID" -c '.attributes.value | keys[] as $k | if .[$k] == $NODEID then "Node index \($k) for \(.[$k])" else empty end'
$ openstack stack resource show overcloud ControllerServers -f json -c attributes | jq --arg NODEID "$NODEID" -c '.attributes.value | keys[] as $k | if .[$k] == $NODEID then "Node index \($k) for \(.[$k])" else empty end'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the following environment file
~/templates/remove-controller.yaml
and include the node index of the Controller node that you want to remove:parameters: ControllerRemovalPolicies: [{'resource_list': ['NODE_INDEX']}]
parameters: ControllerRemovalPolicies: [{'resource_list': ['NODE_INDEX']}]
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enter the overcloud deployment command, and include the
remove-controller.yaml
environment file with any other environment files relevant to your environment:(undercloud) $ openstack overcloud deploy --templates \ -e /home/stack/templates/remove-controller.yaml \ -e /home/stack/templates/node-info.yaml \ [OTHER OPTIONS]
(undercloud) $ openstack overcloud deploy --templates \ -e /home/stack/templates/remove-controller.yaml \ -e /home/stack/templates/node-info.yaml \ [OTHER OPTIONS]
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteInclude
-e ~/templates/remove-controller.yaml
only for this instance of the deployment command. Remove this environment file from subsequent deployment operations.Director removes the old node, creates a new node, and updates the overcloud stack. You can check the status of the overcloud stack with the following command:
(undercloud) $ openstack stack list --nested
(undercloud) $ openstack stack list --nested
Copy to Clipboard Copied! Toggle word wrap Toggle overflow When the deployment command completes, director shows that the old node is replaced with the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The new node now hosts running control plane services.
16.6. Cleaning up after Controller node replacement Copier lienLien copié sur presse-papiers!
After you complete the node replacement, complete the following steps to finalize the Controller cluster.
Procedure
- Log into a Controller node.
Enable Pacemaker management of the Galera cluster and start Galera on the new node:
sudo pcs resource refresh galera-bundle sudo pcs resource manage galera-bundle
[heat-admin@overcloud-controller-0 ~]$ sudo pcs resource refresh galera-bundle [heat-admin@overcloud-controller-0 ~]$ sudo pcs resource manage galera-bundle
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Perform a final status check to ensure that the services are running correctly:
sudo pcs status
[heat-admin@overcloud-controller-0 ~]$ sudo pcs status
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf any services have failed, use the
pcs resource refresh
command to resolve and restart the failed services.Exit to director:
exit
[heat-admin@overcloud-controller-0 ~]$ exit
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Source the
overcloudrc
file so that you can interact with the overcloud:source ~/overcloudrc
$ source ~/overcloudrc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the network agents in your overcloud environment:
(overcloud) $ openstack network agent list
(overcloud) $ openstack network agent list
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If any agents appear for the old node, remove them:
(overcloud) $ for AGENT in $(openstack network agent list --host overcloud-controller-1.localdomain -c ID -f value) ; do openstack network agent delete $AGENT ; done
(overcloud) $ for AGENT in $(openstack network agent list --host overcloud-controller-1.localdomain -c ID -f value) ; do openstack network agent delete $AGENT ; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If necessary, add your router to the L3 agent host on the new node. Use the following example command to add a router named
r1
to the L3 agent using the UUID 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4:(overcloud) $ openstack network agent add router --l3 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4 r1
(overcloud) $ openstack network agent add router --l3 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4 r1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Because compute services for the removed node still exist in the overcloud, you must remove them. First, check the compute services for the removed node:
source ~/overcloudrc
[stack@director ~]$ source ~/overcloudrc (overcloud) $ openstack compute service list --host overcloud-controller-1.localdomain
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the compute services for the removed node:
(overcloud) $ for SERVICE in $(openstack compute service list --host overcloud-controller-1.localdomain -c ID -f value ) ; do openstack compute service delete $SERVICE ; done
(overcloud) $ for SERVICE in $(openstack compute service list --host overcloud-controller-1.localdomain -c ID -f value ) ; do openstack compute service delete $SERVICE ; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow