Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 13. Replacing Controller Nodes
In certain circumstances a Controller node in a high availability cluster might fail. In these situations, you must remove the node from the cluster and replace it with a new Controller node.
			Complete the steps in this section to replace a Controller node. The Controller node replacement process involves running the openstack overcloud deploy command to update the overcloud with a request to replace a Controller node.
		
The following procedure applies only to high availability environments. Do not use this procedure if using only one Controller node.
13.1. Preparing for Controller replacement Link kopierenLink in die Zwischenablage kopiert!
Before attempting to replace an overcloud Controller node, it is important to check the current state of your Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Controller replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Controller node replacement. Run all commands for these checks on the undercloud.
Procedure
Check the current status of the
overcloudstack on the undercloud:source stackrc
$ source stackrc (undercloud) $ openstack stack list --nestedCopy to Clipboard Copied! Toggle word wrap Toggle overflow The
overcloudstack and its subsequent child stacks should have either aCREATE_COMPLETEorUPDATE_COMPLETE.Perform a backup of the undercloud databases:
(undercloud) $ mkdir /home/stack/backup (undercloud) $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz
(undercloud) $ mkdir /home/stack/backup (undercloud) $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gzCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Check that your undercloud contains 10 GB free storage to accommodate for image caching and conversion when provisioning the new node.
 If you are reusing the IP address for the new controller node, ensure that you delete the port used by the old controller:
(undercloud) $ openstack port delete <port>
(undercloud) $ openstack port delete <port>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of Pacemaker on the running Controller nodes. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the Pacemaker status:
(undercloud) $ ssh heat-admin@192.168.0.47 'sudo pcs status'
(undercloud) $ ssh heat-admin@192.168.0.47 'sudo pcs status'Copy to Clipboard Copied! Toggle word wrap Toggle overflow The output should show all services running on the existing nodes and stopped on the failed node.
Check the following parameters on each node of the overcloud MariaDB cluster:
- 
								
wsrep_local_state_comment: Synced wsrep_cluster_size: 2Use the following command to check these parameters on each running Controller node. In this example, the Controller node IP addresses are 192.168.0.47 and 192.168.0.46:
(undercloud) $ for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql -p\$(sudo hiera -c /etc/puppet/hiera.yaml mysql::server::root_password) --execute=\"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; done
(undercloud) $ for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql -p\$(sudo hiera -c /etc/puppet/hiera.yaml mysql::server::root_password) --execute=\"SHOW STATUS LIKE 'wsrep_local_state_comment'; SHOW STATUS LIKE 'wsrep_cluster_size';\""; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow 
- 
								
 Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the status:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo docker exec \$(sudo docker ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo docker exec \$(sudo docker ps -f name=rabbitmq-bundle -q) rabbitmqctl cluster_status"Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
running_nodeskey should only show the two available nodes and not the failed node.Disable fencing, if enabled. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to disable fencing:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the fencing status with the following command:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs property show stonith-enabled"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs property show stonith-enabled"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the
nova-computeservice on the director node:(undercloud) $ sudo systemctl status openstack-nova-compute (undercloud) $ openstack hypervisor list
(undercloud) $ sudo systemctl status openstack-nova-compute (undercloud) $ openstack hypervisor listCopy to Clipboard Copied! Toggle word wrap Toggle overflow The output should show all non-maintenance mode nodes as
up.Make sure all undercloud services are running:
(undercloud) $ sudo systemctl -t service
(undercloud) $ sudo systemctl -t serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow 
13.2. Restoring a Controller node from a backup or snapshot Link kopierenLink in die Zwischenablage kopiert!
In certain cases where a Controller node fails but the physical disk is still functional, you can restore the node from a backup or a snapshot without replacing the node itself.
Ensure that the MAC address of the NIC used for PXE boot on the failed Controller node remains the same after disk replacement.
Procedure
- If the Controller node is a Red Hat Virtualization node and you use snapshots to back up your Controller nodes, restore the node from the snapshot. For more information, see "Using a Snapshot to Restore a Virtual Machine" in the Red Hat Virtualization Virtual Machine Management Guide.
 - If the Controller node is a Red Hat Virtualization node and you use a backup storage domain, restore the node from the backup storage domain. For more information, see "Backing Up and Restoring Virtual Machines Using a Backup Storage Domain" in the Red Hat Virtualization Administration Guide.
 - If you have a backup image of the Controller node from the Relax-and-Recover (ReaR) tool, restore the node using the ReaR tool. For more information, see "Restoring the control plane" in the Undercloud and Control Plane Back Up and Restore guide.
 - After recovering the node from backup or snapshot, you might need to recover the Galera nodes separately. For more information, see the article How Galera works and how to rescue Galera clusters in the context of Red Hat OpenStack Platform.
 - 
						After you complete the backup restoration, run your 
openstack overcloud deploycommand with all necessary environment files to ensure that the Controller node configuration matches the configuration of the other nodes in the cluster. - If you do not have a backup of the node, you must follow the standard Controller replacement procedure.
 
13.3. Removing a Ceph Monitor daemon Link kopierenLink in die Zwischenablage kopiert!
				Follow this procedure to remove a ceph-mon daemon from the storage cluster. If your Controller node is running a Ceph monitor service, complete the following steps to remove the ceph-mon daemon. This procedure assumes the Controller is reachable.
			
Adding a new Controller to the cluster also adds a new Ceph monitor daemon automatically.
Procedure
Connect to the Controller you want to replace and become root:
ssh heat-admin@192.168.0.47 sudo su -
# ssh heat-admin@192.168.0.47 # sudo su -Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the controller is unreachable, skip steps 1 and 2 and continue the procedure at step 3 on any working controller node.
As root, stop the monitor:
systemctl stop ceph-mon@<monitor_hostname>
# systemctl stop ceph-mon@<monitor_hostname>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
systemctl stop ceph-mon@overcloud-controller-1
# systemctl stop ceph-mon@overcloud-controller-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the monitor from the cluster:
ceph mon remove <mon_id>
# ceph mon remove <mon_id>Copy to Clipboard Copied! Toggle word wrap Toggle overflow On the Ceph monitor node, remove the monitor entry from
/etc/ceph/ceph.conf. For example, if you remove controller-1, then remove the IP and hostname for controller-1.Before:
mon host = 172.18.0.21,172.18.0.22,172.18.0.24 mon initial members = overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
mon host = 172.18.0.21,172.18.0.22,172.18.0.24 mon initial members = overcloud-controller-2,overcloud-controller-1,overcloud-controller-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow After:
mon host = 172.18.0.22,172.18.0.24 mon initial members = overcloud-controller-2,overcloud-controller-0
mon host = 172.18.0.22,172.18.0.24 mon initial members = overcloud-controller-2,overcloud-controller-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the same change to
/etc/ceph/ceph.confon the other overcloud nodes.NoteThe director updates the
ceph.conffile on the relevant overcloud nodes when you add the replacement controller node. Normally, director manages this configuration file exclusively and you should not edit the file manually. However, you can edit the file manually to ensure consistency in case the other nodes restart before you add the new node.Optionally, archive the monitor data and save the archive on another server:
mv /var/lib/ceph/mon/<cluster>-<daemon_id> /var/lib/ceph/mon/removed-<cluster>-<daemon_id>
# mv /var/lib/ceph/mon/<cluster>-<daemon_id> /var/lib/ceph/mon/removed-<cluster>-<daemon_id>Copy to Clipboard Copied! Toggle word wrap Toggle overflow 
13.4. Preparing the cluster for Controller replacement Link kopierenLink in die Zwischenablage kopiert!
Before replacing the old node, you must ensure that Pacemaker is no longer running on the node and then remove that node from the Pacemaker cluster.
Procedure
Get a list of IP addresses for the Controller nodes:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the old node is still reachable, log in to one of the remaining nodes and stop pacemaker on the old node. For this example, stop pacemaker on overcloud-controller-1:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs status | grep -w Online | grep -w overcloud-controller-1" (undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs cluster stop overcloud-controller-1"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs status | grep -w Online | grep -w overcloud-controller-1" (undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs cluster stop overcloud-controller-1"Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIn case the old node is physically unavailable or stopped, it is not necessary to perform the previous operation, as pacemaker is already stopped on that node.
After stopping Pacemaker on the old node (i.e. it is shown as
Stoppedinpcs status), delete the old node from thecorosyncconfiguration on each node and restart Corosync. For this example, the following command logs intoovercloud-controller-0andovercloud-controller-2removes the node:(undercloud) $ for NAME in overcloud-controller-0 overcloud-controller-2; do IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f 2) ; ssh heat-admin@$IP "sudo pcs cluster localnode remove overcloud-controller-1; sudo pcs cluster reload corosync"; done
(undercloud) $ for NAME in overcloud-controller-0 overcloud-controller-2; do IP=$(openstack server list -c Networks -f value --name $NAME | cut -d "=" -f 2) ; ssh heat-admin@$IP "sudo pcs cluster localnode remove overcloud-controller-1; sudo pcs cluster reload corosync"; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to one of the remaining nodes and delete the node from the cluster with the
crm_nodecommand:sudo crm_node -R overcloud-controller-1 --force
(undercloud) $ ssh heat-admin@192.168.0.47 [heat-admin@overcloud-controller-0 ~]$ sudo crm_node -R overcloud-controller-1 --forceCopy to Clipboard Copied! Toggle word wrap Toggle overflow The overcloud database must continue to run during the replacement procedure. To ensure Pacemaker does not stop Galera during this procedure, select a running Controller node and run the following command on the undercloud using the Controller node’s IP address:
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs resource unmanage galera-bundle"
(undercloud) $ ssh heat-admin@192.168.0.47 "sudo pcs resource unmanage galera-bundle"Copy to Clipboard Copied! Toggle word wrap Toggle overflow 
13.5. Reusing a Controller node Link kopierenLink in die Zwischenablage kopiert!
You can reuse a failed Controller node and redeploy it as a new node. Use this method when you do not have an extra node to use for replacement.
Procedure
Source the
stackrcfile:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Disassociate the failed node from the overcloud:
openstack baremetal node undeploy <FAILED_NODE>
$ openstack baremetal node undeploy <FAILED_NODE>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
<FAILED_NODE>with the UUID of the failed node. This command disassociates the node in OpenStack Bare Metal (ironic) from the overcloud servers in OpenStack Compute (nova). If you have enabled node cleaning, this command also removes the file system from the node disks.Tag the new node with the
controlprofile:(undercloud) $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' <FAILED NODE>
(undercloud) $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' <FAILED NODE>Copy to Clipboard Copied! Toggle word wrap Toggle overflow If your Controller node failed due to a faulty disk, you can replace the disk at this point and perform an introspection on the node to refresh the introspection data from the new disk.
openstack baremetal node manage <FAILED NODE> openstack overcloud node introspect --all-manageable --provide
$ openstack baremetal node manage <FAILED NODE> $ openstack overcloud node introspect --all-manageable --provideCopy to Clipboard Copied! Toggle word wrap Toggle overflow 
				The failed node is now ready for the node replacement and redeployment. When you perform the node replacement, the failed node acts as a new node and uses an increased index. For example, if your control plane cluster contains overcloud-controller-0, overcloud-controller-1, and overcloud-controller-2 and you reuse overcloud-controller-1 as a new node, the new node name will be overcloud-controller-3.
			
13.6. Reusing a BMC IP address Link kopierenLink in die Zwischenablage kopiert!
You can replace a failed Controller node with a new node but retain the same BMC IP address. Remove the failed node, reassign the BMC IP address, add the new node as a new baremetal record, and execute introspection.
Procedure
Source the
stackrcfile:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed node:
openstack baremetal node undeploy <FAILED_NODE> openstack baremetal node maintenance set <FAILED_NODE> openstack baremetal node delete <FAILED_NODE>
$ openstack baremetal node undeploy <FAILED_NODE> $ openstack baremetal node maintenance set <FAILED_NODE> $ openstack baremetal node delete <FAILED_NODE>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
<FAILED_NODE>with the UUID of the failed node. Theopenstack baremetal node deletecommand might fail temporarily if there is a previous command in the queue. If theopenstack baremetal node deletecommand fails, wait for the previous command to complete. This might take up to five minutes.- Assign the BMC IP address of the failed node to the new node.
 Add the new node as a new baremetal record:
openstack overcloud node import newnode.json
$ openstack overcloud node import newnode.jsonCopy to Clipboard Copied! Toggle word wrap Toggle overflow For more information about registering overcloud nodes, see Registering nodes for the overcloud.
Perform introspection on the new node:
openstack overcloud node introspect --all-manageable --provide
$ openstack overcloud node introspect --all-manageable --provideCopy to Clipboard Copied! Toggle word wrap Toggle overflow List unassociated nodes and identify the ID of the new node:
openstack baremetal node list --unassociated
$ openstack baremetal node list --unassociatedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Tag the new node with the
controlprofile:(undercloud) $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' <NEW NODE UUID>
(undercloud) $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' <NEW NODE UUID>Copy to Clipboard Copied! Toggle word wrap Toggle overflow 
13.7. Triggering the Controller node replacement Link kopierenLink in die Zwischenablage kopiert!
Complete the following steps to remove the old Controller node and replace it with a new Controller node.
Procedure
Determine the UUID of the node that you want to remove and store it in the
NODEIDvariable. Ensure that you replace NODE_NAME with the name of the node that you want to remove:NODEID=$(openstack server list -f value -c ID --name NODE_NAME)
$ NODEID=$(openstack server list -f value -c ID --name NODE_NAME)Copy to Clipboard Copied! Toggle word wrap Toggle overflow To identify the Heat resource ID, enter the following command:
openstack stack resource show overcloud ControllerServers -f json -c attributes | jq --arg NODEID "$NODEID" -c '.attributes.value | keys[] as $k | if .[$k] == $NODEID then "Node index \($k) for \(.[$k])" else empty end'
$ openstack stack resource show overcloud ControllerServers -f json -c attributes | jq --arg NODEID "$NODEID" -c '.attributes.value | keys[] as $k | if .[$k] == $NODEID then "Node index \($k) for \(.[$k])" else empty end'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the following environment file
~/templates/remove-controller.yamland include the node index of the Controller node that you want to remove:parameters: ControllerRemovalPolicies: [{'resource_list': ['NODE_INDEX']}]parameters: ControllerRemovalPolicies: [{'resource_list': ['NODE_INDEX']}]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run your overcloud deployment command, including the
remove-controller.yamlenvironment file along with any other environment files relevant to your environment:(undercloud) $ openstack overcloud deploy --templates \ -e /home/stack/templates/remove-controller.yaml \ [OTHER OPTIONS](undercloud) $ openstack overcloud deploy --templates \ -e /home/stack/templates/remove-controller.yaml \ [OTHER OPTIONS]Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteInclude
-e ~/templates/remove-controller.yamlonly for this instance of the deployment command. Remove this environment file from subsequent deployment operations.The director removes the old node, creates a new node with the next node index ID, and updates the overcloud stack. You can check the status of the overcloud stack with the following command:
(undercloud) $ openstack stack list --nested
(undercloud) $ openstack stack list --nestedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Once the deployment command completes, the director shows the old node replaced with the new node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The new node now hosts running control plane services.
13.8. Cleaning up after Controller node replacement Link kopierenLink in die Zwischenablage kopiert!
After completing the node replacement, complete the following steps to finalize the Controller cluster.
Procedure
- Log into a Controller node.
 Enable Pacemaker management of the Galera cluster and start Galera on the new node:
sudo pcs resource refresh galera-bundle sudo pcs resource manage galera-bundle
[heat-admin@overcloud-controller-0 ~]$ sudo pcs resource refresh galera-bundle [heat-admin@overcloud-controller-0 ~]$ sudo pcs resource manage galera-bundleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Perform a final status check to make sure services are running correctly:
sudo pcs status
[heat-admin@overcloud-controller-0 ~]$ sudo pcs statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf any services have failed, use the
pcs resource refreshcommand to resolve and restart the failed services.Exit to the director
exit
[heat-admin@overcloud-controller-0 ~]$ exitCopy to Clipboard Copied! Toggle word wrap Toggle overflow Source the
overcloudrcfile so that you can interact with the overcloud:source ~/overcloudrc
$ source ~/overcloudrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the network agents in your overcloud environment:
(overcloud) $ openstack network agent list
(overcloud) $ openstack network agent listCopy to Clipboard Copied! Toggle word wrap Toggle overflow If any agents appear for the old node, remove them:
(overcloud) $ for AGENT in $(openstack network agent list --host overcloud-controller-1.localdomain -c ID -f value) ; do openstack network agent delete $AGENT ; done
(overcloud) $ for AGENT in $(openstack network agent list --host overcloud-controller-1.localdomain -c ID -f value) ; do openstack network agent delete $AGENT ; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow If necessary, add your hosting router to the L3 agent on the new node. Use the following example command to add a hosting router r1 to the L3 agent using the UUID 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4:
(overcloud) $ openstack network agent add router -l3 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4 r1
(overcloud) $ openstack network agent add router -l3 2d1c1dc1-d9d4-4fa9-b2c8-f29cd1a649d4 r1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Compute services for the removed node still exist in the overcloud and require removal. Check the compute services for the removed node:
source ~/overcloudrc
[stack@director ~]$ source ~/overcloudrc (overcloud) $ openstack compute service list --host overcloud-controller-1.localdomainCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the compute services for the removed node:
(overcloud) $ for SERVICE in $(openstack compute service list --host overcloud-controller-1.localdomain -c ID -f value ) ; do openstack compute service delete $SERVICE ; done
(overcloud) $ for SERVICE in $(openstack compute service list --host overcloud-controller-1.localdomain -c ID -f value ) ; do openstack compute service delete $SERVICE ; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow