Este contenido no está disponible en el idioma seleccionado.
Chapter 3. Maintaining the process of evacuating instances from failed Compute nodes
The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service assists you in managing the process of evacuating instances from failed Compute nodes. However, this service requires your supervision and at times you need to take action.
The Instance HA service ignores any Compute node that you have intentionally disabled. You can intentionally disable a Compute node for the following reasons:
- You can intentionally disable healthy Compute nodes to ensure that there is sufficient capacity to evacuate the instances of failed Compute nodes. For more information, see Reserving healthy Compute nodes.
- You can intentionally disable Compute nodes that you must maintain or configure because the Instance HA service is not notified when a Compute node is being rebooted or powered off. For more information, see Disabling the evacuation of Compute nodes.
-
You can prevent the Instance HA service from evacuating the failed Compute nodes, by setting the
DISABLEDInstance HA service parameter totrue. In this mode, the Instance HA service monitors your Compute nodes and logs those that have failed but does not do anything further. For more information, see Editing the Instance HA service parameters. -
You can prevent the Instance HA service from re-enabling the failed Compute nodes after they have been evacuated by setting the
LEAVE_DISABLEDInstance HA service parameter totrue, since their failure might indicate faulty hardware and you do not want the Compute nodes to simply fail again. - You can view the log file produced by the Instance HA service to troubleshoot this process. For more information, see Troubleshooting the Instance HA service.
-
When the evacuation of a Compute node is either slow or unsuccessful, you can manually set the
Forced Downflag tofalse. For more information, see Rehabilitating evacuated Compute nodes.
3.1. Disabling the evacuation of Compute nodes Copiar enlaceEnlace copiado en el portapapeles!
The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service is not notified when a Compute node is being rebooted or powered off. Therefore, you must disable the evacuation of Compute nodes that you maintain or configure.
Procedure
If you plan to configure or maintain a limited number of Compute nodes you can individually disable each Compute node. For example, you can use the following command to disable a Compute node named
compute-7for maintenance:$ openstack compute service set --disable --disable-reason "maintenance" compute-7 nova-computeNoteYou can use the optional
--disable-reasonargument to specify your reason for disabling the Compute node. Do not use the wordreservedin your description, because the Instance HA service uses this word to identify healthy Compute nodes that are intentionally disabled to provide reserve capacity for evacuating instances. For more information, see Reserving healthy Compute nodes.-
If you plan to configure or maintain a large number of Compute nodes you can temporarily disable the Instance HA service from evacuating all failed Compute nodes, by setting the
DISABLEDInstance HA service parameter totrue. For more information, see Editing the Instance HA service parameters.
Additional resources
3.2. Troubleshooting the Instance HA service Copiar enlaceEnlace copiado en el portapapeles!
The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service pod provides a log file which records the status of the process of evacuating instances from your failed Compute nodes.
These log file entries are created and associated with the fully qualified pod name, which includes the unique string that is appended to the .metadata.name that you specified in the manifest file, for example, instanceha-0-54f865b6dd-w6h4t.
When the Instance HA service pod restarts, a new unique string is appended to its name and all the log entries associated with the previous unique Instance HA service pod name are removed. You can implement centralized logging to prevent the loss of the log file entries.
The Instance HA service pod restarts when you update the configuration of the Instance HA service pod, for example, by editing an Instance HA service parameter which changes its ConfigMap or by changing the fencingSecret YAML file and reapplying this file.
By default, the Instance HA service log file provides the information, warning, and error log messages, because the LOGLEVEL the Instance HA service parameter is set to info. But when you are troubleshooting the Instance HA service you can change the LOGLEVEL parameter to debug to increase the number and detail of the log messages. For more information, see Editing the Instance HA service parameters.
You can use the following commands to view the Instance HA service pod log file:
$ oc logs -l service=instanceha
$ oc logs <podname>
-
Replace
<podname>with the fully qualified name of your deployed Instance HA service pod, for example,instanceha-0-54f865b6dd-w6h4t. You must run this command to determine this fully qualified pod name:$ oc get pods |grep instanceha.
This command is useful when you have multiple clouds defined and you have deployed a separate Instance HA service pod to monitor each cloud. For more information, see Configuring the Instance HA service pod specification.
The following is an example of the log file messages generated for a successful evacuation of a Compute node, when the LOGLEVEL the Instance HA service parameter is set to info.
$ oc logs instanceha-0-54f865b6dd-w6h4t
2024-09-15 23:19:07,065 INFO Nova login successful
2024-09-15 23:21:38,105 WARNING The following computes are down:['compute-0.ctlplane.example.com']
2024-09-15 23:21:39,137 INFO Fencing compute-0.ctlplane.example.com
2024-09-15 23:21:39,137 INFO Fencing host compute-0.ctlplane.example.com off
2024-09-15 23:21:39,407 INFO Power off of compute-0.ctlplane.example.com ok
2024-09-15 23:21:39,824 INFO Nova login successful
2024-09-15 23:21:39,824 INFO Disabling compute-0.ctlplane.example.com before evacuation
2024-09-15 23:21:39,824 INFO Forcing compute-0.ctlplane.example.com down before evacuation
2024-09-15 23:21:40,094 INFO Service nova-compute on host compute-0.ctlplane.example.com is now disabled
2024-09-15 23:21:40,094 INFO Start evacuation of compute-0.ctlplane.example.com
2024-09-15 23:21:41,740 INFO Evacuation successful. Re-enabling compute-0.ctlplane.example.com
2024-09-15 23:21:41,741 INFO Fencing host compute-0.ctlplane.example.com on
2024-09-15 23:21:42,486 INFO Power on of compute-0.ctlplane.example.com ok
2024-09-15 23:21:42,486 INFO Trying to enable compute-0.ctlplane.example.com
2024-09-15 23:21:42,572 INFO Host compute-0.ctlplane.example.com is now enabled
2024-09-15 23:22:43,074 INFO Unsetting force-down on host compute-0.ctlplane.example.com after evacuation
2024-09-15 23:22:43,187 INFO Successfully unset force-down on host compute-0.ctlplane.example.com
3.3. Rehabilitating evacuated Compute nodes Copiar enlaceEnlace copiado en el portapapeles!
During the normal process of evacuating instances from a failed Compute node, the Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service sets the Forced Down flag of the Compute node to true to ensure that it is successfully evacuated, and if it is re-enabled that it is successfully rebooted. In this case, the Instance HA service sets the Forced Down flag of the Compute node to false.
When the evacuation of a Compute node is either slow or unsuccessful, you can set the Forced Down flag of the Compute node to false.
Procedure
Use the
--longargument to display the status of theirForced Downflags, when viewing the list of Compute nodes:$ openstack compute service list --longIn this example, the
compute-fn93pyp7-2.ctlplane.example.comhas aStatusofenabledbut theForced Downflag is still set totrue, which indicates that something has gone wrong during the evacuation of this Compute node.$ openstack compute service list --long +--------------------------------------+----------------+-----------------------------------------+----------+---------+-------+----------------------------+-----------------+-------------+ | ID | Binary | Host | Zone | Status | State | Updated At | Disabled Reason | Forced Down | +--------------------------------------+----------------+-----------------------------------------+----------+---------+-------+----------------------------+-----------------+-------------+ ... | c9590263-7fbd-4782-85d8-7cf7526ed292 | nova-compute | compute-fn93pyp7-0.ctlplane.example.com | nova | enabled | up | 2024-10-08T04:59:23.000000 | None | False | | 3319da3d-31e8-4877-a9d5-9f407e1356fa | nova-compute | compute-fn93pyp7-1.ctlplane.example.com | nova | enabled | up | 2024-10-08T04:59:18.000000 | None | False | | 1b89c4b9-bf16-45c0-9517-8f652ea5a129 | nova-compute | compute-fn93pyp7-2.ctlplane.example.com | nova | enabled | down | 2024-10-08T04:59:18.000000 | None | True |Force the Compute node up:
$ openstack compute service set --up <Compute_node> nova-computeReplace
<Compute_node>with the listed name of the Compute node, for example,compute-fn93pyp7-2.ctlplane.example.com.The
Forced Downflag of<Compute_node>is set tofalse.