Chapter 3. Maintaining the process of evacuating instances from failed Compute nodes

The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service assists you in managing the process of evacuating instances from failed Compute nodes. However, this service requires your supervision and at times you need to take action.

The Instance HA service ignores any Compute node that you have intentionally disabled. You can intentionally disable a Compute node for the following reasons:
- You can intentionally disable healthy Compute nodes to ensure that there is sufficient capacity to evacuate the instances of failed Compute nodes. For more information, see Reserving healthy Compute nodes.
- You can intentionally disable Compute nodes that you must maintain or configure because the Instance HA service is not notified when a Compute node is being rebooted or powered off. For more information, see Disabling the evacuation of Compute nodes.
You can prevent the Instance HA service from evacuating the failed Compute nodes, by setting the DISABLED Instance HA service parameter to true. In this mode, the Instance HA service monitors your Compute nodes and logs those that have failed but does not do anything further. For more information, see Editing the Instance HA service parameters.
You can prevent the Instance HA service from re-enabling the failed Compute nodes after they have been evacuated by setting the LEAVE_DISABLED Instance HA service parameter to true, since their failure might indicate faulty hardware and you do not want the Compute nodes to simply fail again.
You can view the log file produced by the Instance HA service to troubleshoot this process. For more information, see Troubleshooting the Instance HA service.
When the evacuation of a Compute node is either slow or unsuccessful, you can manually set the Forced Down flag to false. For more information, see Rehabilitating evacuated Compute nodes.

3.1. Disabling the evacuation of Compute nodes
Copy link

The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service is not notified when a Compute node is being rebooted or powered off. Therefore, you must disable the evacuation of Compute nodes that you maintain or configure.

Procedure

If you plan to configure or maintain a limited number of Compute nodes you can individually disable each Compute node. For example, you can use the following command to disable a Compute node named compute-7 for maintenance:
```
openstack compute service set --disable --disable-reason "maintenance" compute-7 nova-compute
```
```
$ openstack compute service set --disable --disable-reason "maintenance" compute-7 nova-compute
```
Copy to Clipboard Toggle word wrap
Note
You can use the optional --disable-reason argument to specify your reason for disabling the Compute node. Do not use the word reserved in your description, because the Instance HA service uses this word to identify healthy Compute nodes that are intentionally disabled to provide reserve capacity for evacuating instances. For more information, see Reserving healthy Compute nodes.
If you plan to configure or maintain a large number of Compute nodes you can temporarily disable the Instance HA service from evacuating all failed Compute nodes, by setting the DISABLED Instance HA service parameter to true. For more information, see Editing the Instance HA service parameters.

Additional resources

Maintaining the process of evacuating instances from failed Compute nodes

3.2. Troubleshooting the Instance HA service
Copy link

The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service pod provides a log file which records the status of the process of evacuating instances from your failed Compute nodes.

These log file entries are created and associated with the fully qualified pod name, which includes the unique string that is appended to the .metadata.name that you specified in the manifest file, for example, instanceha-0-54f865b6dd-w6h4t.

Warning

When the Instance HA service pod restarts, a new unique string is appended to its name and all the log entries associated with the previous unique Instance HA service pod name are removed. You can implement centralized logging to prevent the loss of the log file entries.

The Instance HA service pod restarts when you update the configuration of the Instance HA service pod, for example, by editing an Instance HA service parameter which changes its ConfigMap or by changing the fencingSecret YAML file and reapplying this file.

By default, the Instance HA service log file provides the information, warning, and error log messages, because the LOGLEVEL the Instance HA service parameter is set to info. But when you are troubleshooting the Instance HA service you can change the LOGLEVEL parameter to debug to increase the number and detail of the log messages. For more information, see Editing the Instance HA service parameters.

You can use the following commands to view the Instance HA service pod log file:

oc logs -l service=instanceha
oc logs <podname>

$ oc logs -l service=instanceha
$ oc logs <podname>

Copy to Clipboard

Toggle word wrap

Replace <podname> with the fully qualified name of your deployed Instance HA service pod, for example, instanceha-0-54f865b6dd-w6h4t. You must run this command to determine this fully qualified pod name: $ oc get pods |grep instanceha.

Note

This command is useful when you have multiple clouds defined and you have deployed a separate Instance HA service pod to monitor each cloud. For more information, see Configuring the Instance HA service pod specification.

The following is an example of the log file messages generated for a successful evacuation of a Compute node, when the LOGLEVEL the Instance HA service parameter is set to info.

oc logs instanceha-0-54f865b6dd-w6h4t

$ oc logs instanceha-0-54f865b6dd-w6h4t
2024-09-15 23:19:07,065 INFO Nova login successful
2024-09-15 23:21:38,105 WARNING The following computes are down:['compute-0.ctlplane.example.com']
2024-09-15 23:21:39,137 INFO Fencing compute-0.ctlplane.example.com
2024-09-15 23:21:39,137 INFO Fencing host compute-0.ctlplane.example.com off
2024-09-15 23:21:39,407 INFO Power off of compute-0.ctlplane.example.com ok
2024-09-15 23:21:39,824 INFO Nova login successful
2024-09-15 23:21:39,824 INFO Disabling compute-0.ctlplane.example.com before evacuation
2024-09-15 23:21:39,824 INFO Forcing compute-0.ctlplane.example.com down before evacuation
2024-09-15 23:21:40,094 INFO Service nova-compute on host compute-0.ctlplane.example.com is now disabled
2024-09-15 23:21:40,094 INFO Start evacuation of compute-0.ctlplane.example.com
2024-09-15 23:21:41,740 INFO Evacuation successful. Re-enabling compute-0.ctlplane.example.com
2024-09-15 23:21:41,741 INFO Fencing host compute-0.ctlplane.example.com on
2024-09-15 23:21:42,486 INFO Power on of compute-0.ctlplane.example.com ok
2024-09-15 23:21:42,486 INFO Trying to enable compute-0.ctlplane.example.com
2024-09-15 23:21:42,572 INFO Host compute-0.ctlplane.example.com is now enabled
2024-09-15 23:22:43,074 INFO Unsetting force-down on host compute-0.ctlplane.example.com after evacuation
2024-09-15 23:22:43,187 INFO Successfully unset force-down on host compute-0.ctlplane.example.com

Copy to Clipboard

Toggle word wrap

3.3. Rehabilitating evacuated Compute nodes
Copy link

During the normal process of evacuating instances from a failed Compute node, the Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service sets the Forced Down flag of the Compute node to true to ensure that it is successfully evacuated, and if it is re-enabled that it is successfully rebooted. In this case, the Instance HA service sets the Forced Down flag of the Compute node to false.

When the evacuation of a Compute node is either slow or unsuccessful, you can set the Forced Down flag of the Compute node to false.

Procedure

Use the --long argument to display the status of their Forced Down flags, when viewing the list of Compute nodes:

openstack compute service list --long

$ openstack compute service list --long

Copy to Clipboard

Toggle word wrap

In this example, the compute-fn93pyp7-2.ctlplane.example.com has a Status of enabled but the Forced Down flag is still set to true, which indicates that something has gone wrong during the evacuation of this Compute node.

openstack compute service list --long

$ openstack compute service list --long
+--------------------------------------+----------------+-----------------------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
| ID                                | Binary        | Host                                  | Zone  | Status  | State | Updated At              | Disabled Reason | Forced Down |
+--------------------------------------+----------------+-----------------------------------------+----------+---------+-------+----------------------------+-----------------+-------------+
...
| c9590263-7fbd-4782-85d8-7cf7526ed292 | nova-compute   | compute-fn93pyp7-0.ctlplane.example.com | nova    | enabled | up  | 2024-10-08T04:59:23.000000 | None         | False     |
| 3319da3d-31e8-4877-a9d5-9f407e1356fa | nova-compute   | compute-fn93pyp7-1.ctlplane.example.com | nova    | enabled | up  | 2024-10-08T04:59:18.000000 | None         | False     |
| 1b89c4b9-bf16-45c0-9517-8f652ea5a129 | nova-compute   | compute-fn93pyp7-2.ctlplane.example.com | nova    | enabled | down  | 2024-10-08T04:59:18.000000 | None           | True      |

Copy to Clipboard

Toggle word wrap

Force the Compute node up:
```
openstack compute service set --up <Compute_node> nova-compute
```
```
$ openstack compute service set --up <Compute_node> nova-compute
```
Copy to Clipboard Toggle word wrap
- Replace <Compute_node> with the listed name of the Compute node, for example, compute-fn93pyp7-2.ctlplane.example.com.
  The Forced Down flag of <Compute_node> is set to false.

Chapter 3. Maintaining the process of evacuating instances from failed Compute nodes

3.1. Disabling the evacuation of Compute nodes
Copy link

3.2. Troubleshooting the Instance HA service
Copy link

3.3. Rehabilitating evacuated Compute nodes
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 3. Maintaining the process of evacuating instances from failed Compute nodes

3.1. Disabling the evacuation of Compute nodesCopy linkLink copied to clipboard!

3.2. Troubleshooting the Instance HA serviceCopy linkLink copied to clipboard!

3.3. Rehabilitating evacuated Compute nodesCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.1. Disabling the evacuation of Compute nodes
Copy link

3.2. Troubleshooting the Instance HA service
Copy link

3.3. Rehabilitating evacuated Compute nodes
Copy link