Chapter 3. Maintaining the process of evacuating instances from failed Compute nodes
The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service assists you in managing the process of evacuating instances from failed Compute nodes. However, this service requires your supervision and at times you need to take action.
The Instance HA service ignores any Compute node that you have intentionally disabled. You can intentionally disable a Compute node for the following reasons:
- You can intentionally disable healthy Compute nodes to ensure that there is sufficient capacity to evacuate the instances of failed Compute nodes. For more information, see Reserving healthy Compute nodes.
- You can intentionally disable Compute nodes that you must maintain or configure because the Instance HA service is not notified when a Compute node is being rebooted or powered off. For more information, see Disabling the evacuation of Compute nodes.
-
You can prevent the Instance HA service from evacuating the failed Compute nodes, by setting the
DISABLEDInstance HA service parameter totrue. In this mode, the Instance HA service monitors your Compute nodes and logs those that have failed but does not do anything further. For more information, see Editing the Instance HA service parameters. -
You can prevent the Instance HA service from re-enabling the failed Compute nodes after they have been evacuated by setting the
LEAVE_DISABLEDInstance HA service parameter totrue, since their failure might indicate faulty hardware and you do not want the Compute nodes to simply fail again. - You can view the log file produced by the Instance HA service to troubleshoot this process. For more information, see Troubleshooting the Instance HA service.
-
When the evacuation of a Compute node is either slow or unsuccessful, you can manually set the
Forced Downflag tofalse. For more information, see Rehabilitating evacuated Compute nodes.
3.1. Disabling the evacuation of Compute nodes Copy linkLink copied to clipboard!
The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service is not notified when a Compute node is being rebooted or powered off. Therefore, you must disable the evacuation of Compute nodes that you maintain or configure.
Procedure
If you plan to configure or maintain a limited number of Compute nodes you can individually disable each Compute node. For example, you can use the following command to disable a Compute node named
compute-7for maintenance:openstack compute service set --disable --disable-reason "maintenance" compute-7 nova-compute
$ openstack compute service set --disable --disable-reason "maintenance" compute-7 nova-computeCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou can use the optional
--disable-reasonargument to specify your reason for disabling the Compute node. Do not use the wordreservedin your description, because the Instance HA service uses this word to identify healthy Compute nodes that are intentionally disabled to provide reserve capacity for evacuating instances. For more information, see Reserving healthy Compute nodes.-
If you plan to configure or maintain a large number of Compute nodes you can temporarily disable the Instance HA service from evacuating all failed Compute nodes, by setting the
DISABLEDInstance HA service parameter totrue. For more information, see Editing the Instance HA service parameters.
Additional resources
3.2. Troubleshooting the Instance HA service Copy linkLink copied to clipboard!
The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service pod provides a log file which records the status of the process of evacuating instances from your failed Compute nodes.
These log file entries are created and associated with the fully qualified pod name, which includes the unique string that is appended to the .metadata.name that you specified in the manifest file, for example, instanceha-0-54f865b6dd-w6h4t.
When the Instance HA service pod restarts, a new unique string is appended to its name and all the log entries associated with the previous unique Instance HA service pod name are removed. You can implement centralized logging to prevent the loss of the log file entries.
The Instance HA service pod restarts when you update the configuration of the Instance HA service pod, for example, by editing an Instance HA service parameter which changes its ConfigMap or by changing the fencingSecret YAML file and reapplying this file.
By default, the Instance HA service log file provides the information, warning, and error log messages, because the LOGLEVEL the Instance HA service parameter is set to info. But when you are troubleshooting the Instance HA service you can change the LOGLEVEL parameter to debug to increase the number and detail of the log messages. For more information, see Editing the Instance HA service parameters.
You can use the following commands to view the Instance HA service pod log file:
oc logs -l service=instanceha oc logs <podname>
$ oc logs -l service=instanceha
$ oc logs <podname>
-
Replace
<podname>with the fully qualified name of your deployed Instance HA service pod, for example,instanceha-0-54f865b6dd-w6h4t. You must run this command to determine this fully qualified pod name:$ oc get pods |grep instanceha.
This command is useful when you have multiple clouds defined and you have deployed a separate Instance HA service pod to monitor each cloud. For more information, see Configuring the Instance HA service pod specification.
The following is an example of the log file messages generated for a successful evacuation of a Compute node, when the LOGLEVEL the Instance HA service parameter is set to info.
3.3. Rehabilitating evacuated Compute nodes Copy linkLink copied to clipboard!
During the normal process of evacuating instances from a failed Compute node, the Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service sets the Forced Down flag of the Compute node to true to ensure that it is successfully evacuated, and if it is re-enabled that it is successfully rebooted. In this case, the Instance HA service sets the Forced Down flag of the Compute node to false.
When the evacuation of a Compute node is either slow or unsuccessful, you can set the Forced Down flag of the Compute node to false.
Procedure
Use the
--longargument to display the status of theirForced Downflags, when viewing the list of Compute nodes:openstack compute service list --long
$ openstack compute service list --longCopy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the
compute-fn93pyp7-2.ctlplane.example.comhas aStatusofenabledbut theForced Downflag is still set totrue, which indicates that something has gone wrong during the evacuation of this Compute node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Force the Compute node up:
openstack compute service set --up <Compute_node> nova-compute
$ openstack compute service set --up <Compute_node> nova-computeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
<Compute_node>with the listed name of the Compute node, for example,compute-fn93pyp7-2.ctlplane.example.com.The
Forced Downflag of<Compute_node>is set tofalse.