Este contenido no está disponible en el idioma seleccionado.

Chapter 1. Overview


You can use the Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service to assist you in managing the high availability of Compute instances by evacuating instances from failed Compute nodes and re-creating them on different healthy Compute nodes:

  • The evacuated instances maintain the same network configuration, such as static IP addresses and floating IP addresses, because the Instance HA service works with shared storage or local storage environments.
  • The re-created instances maintain the same characteristics inside the new Compute node.

By default, every Compute node is eligible for evacuation. But only the Compute nodes containing your critical instances require evacuation. The Instance HA service provides the following two methods for you to identify the Compute nodes running your critical instances:

  • You can tag specific images or flavors, so that the Compute nodes that contain instances that implement these tagged flavors or images are eligible for evacuation.
  • You can tag host aggregates, so that all the Compute nodes contained within a tagged host aggregate are eligible for evacuation.

By default, if you tag an image, flavor, or host aggregate, the Instance HA service ignores every untagged Compute node or instance. For more information, see Tag images, flavors, or host aggregates for evacuation.

The Instance HA service provides a number of parameters and options that you can use to customize the evacuation process of your instances from failed Compute nodes. For more information, see How the Instance HA service evacuates failed Compute nodes.

The Instance HA service assists you in managing the process of evacuating failed Compute nodes, and at times you need to take action. For more information, see Maintaining the process of evacuating instances from failed Compute nodes.

1.1. How the Instance HA service evacuates failed Compute nodes

The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service uses the following iterative process to evacuate instances from your failed Compute nodes.

Procedure

  1. The Instance HA service polls the Compute service (nova) database to identify the failed Compute nodes that are eligible for evacuation. The polling interval is specified by the POLL Instance HA service parameter, which is 45 seconds by default.

    The Instance HA service uses the following filtering methodology:

    • Compute nodes must contain instances to be eligible for evacuation.
    • By default, if you have tagged specific flavors or images, Compute nodes containing instances that use these flavors or images are eligible for evacuation. For more information, see Tag images, flavors, or host aggregates for evacuation.
    • If you have tagged specific host aggregates, Compute nodes in these host aggregates are eligible for evacuation.

      All the Compute nodes that are not eligible for evacuation are ignored.

  2. The status of each Compute node is either enabled or disabled. Any Compute node that is eligible for evacuation and has a status of disabled is ignored, because it is assumed that this Compute node has been intentionally disabled. For more information, see Disabling the evacuation of Compute nodes.

    Note

    You can intentionally disable healthy Compute nodes to provide sufficient capacity to evacuate the instances of failed Compute nodes. For more information, see Reserving healthy Compute nodes.

  3. The state of each Compute node is either up or down. Any Compute node that is eligible for evacuation with a state of down has failed.

    The DELTA Instance HA service parameter can reduce the time taken by the Instance HA service to detect failed Compute nodes: any Compute node eligible for evacuation that reports a status of enabled, when this status is not updated within the specified DELTA period, has also failed. By default, the DELTA period is 30 seconds. For more information, see Editing the Instance HA service parameters.

  4. The Instance HA service calculates the percentage of failed Compute nodes and compares this percentage to the value of the THRESHOLD Instance HA service parameter:
Note

When the TAGGED_AGGREGATES parameter is true, the THRESHOLD parameter is calculated based on the total number of Compute nodes that are tagged by using the EVACUABLE_TAG parameter.

  • If the percentage is less than or equal to the THRESHOLD value, then the Instance HA service continues evacuating Compute nodes.
  • If the percentage exceeds the THRESHOLD value, then the Instance HA service indicates this in a log message and stops evacuating Compute nodes.

    Note

    You must set the THRESHOLD parameter to specify how many Compute nodes can fail before the evacuation process becomes impractical. For example, when the network is severely compromised or when there are insufficient healthy Compute nodes left to evacuate instances from the failed Compute nodes.

    1. If you set the DISABLED Instance HA service parameter to true, then the Instance HA service does not evacuate the failed Compute nodes. In this case, the Instance HA service logs that a Compute node is down but does nothing further. The DISABLED parameter is false by default. For more information, see Disabling the evacuation of Compute nodes.
    2. If you have reserved healthy Compute nodes, then the Instance HA service attempts to enable one reserved Compute node for each failed Compute node. For more information, see Reserving healthy Compute nodes.
    3. If you configure the Instance HA service to detect if a Compute node is capturing a kernel dump, then the Instance HA service waits for the kdump service to finish before fencing or powering off and evacuating each Compute node. For more information, see Detecting if a Compute node is capturing a kernel dump.

      You must configure the fencing agent of every Compute node that is eligible for evacuation. For more information, see Configuring the fencing of Compute nodes.

      Note

      You cannot evacuate a Compute node unless it has a configured fencing agent.

    4. After a failed Compute node has been fenced, the Instance HA service calls the Compute service (nova) to perform the following actions:
  • Set the status of this failed Compute node to disabled and provide a descriptive, timestamped message for the --disable-reason argument.
  • Set the Forced Down flag of this Compute node to true.

    Note

    The Instance HA service evacuates instances that are in a state of ACTIVE, ERROR, or STOPPED. The Compute service prevents instances in any other state from being evacuated.

    1. If you set the LEAVE_DISABLED Instance HA service parameter to true, then the fenced Compute nodes will remain disabled after they have been evacuated. One reason for doing this is if faulty hardware caused the Compute node to fail, then you do not want the Compute node enabled to simply fail again. By default, the LEAVE_DISABLED parameter is set to false, so the disabled Compute nodes are enabled after they have been evacuated successfully. In this case, the Instance HA service instructs the associated fencing agent to power on the Compute node.
    2. The Instance HA service sets the Forced Down flag of a failed Compute node to false when the Compute node is successfully evacuated and successfully rebooted, when the Compute node is re-enabled. When the evacuation of a Compute node is either slow or unsuccessful, you can set the Forced Down flag of the Compute node to false. For more information, see Rehabilitating evacuated Compute nodes.
    3. By default, the Instance HA service does not monitor the evacuation of instances from failed Compute nodes. It assumes they are successful if the respective Compute service (nova) API requests are accepted. If you set the SMART_EVACUATION Instance HA service parameter to true, then the Instance HA service monitors the evacuation of each instance. In this case, the Instance HA service retries evacuating an instance up to 5 times when it fails. If the evacuation of any instance fails for 5 times, the Instance HA service performs the following actions:
  • The Instance HA service stops evacuating all instances from this Compute node.
  • The Instance HA service sets the status of this failed Compute node to disabled and provides a descriptive, timestamped message for the --disable-reason argument.
  • The Instance HA service sets the Forced Down flag of this Compute node to true.

    You must perform the following actions to rehabilitate this failed Compute node:

  • You must determine the reason why this instance could not be evacuated.
  • You must evacuate all the other instances from this Compute node.
  • You must set the Forced Down flag of this Compute node to false.
  • You must enable this Compute node.
  • You must ensure that this Compute node successfully reboots.

    When you set the SMART_EVACUATION Instance HA service parameter to true, you can use the WORKERS Instance HA service parameter to specify the number of instances that the Instance HA service can evacuate at the same time, which is 4 by default. The SMART_EVACUATION parameter is false by default.

    1. By default, the Instance HA service periodically polls each Compute node that is enabled but has a Forced Down flag set to true to ensure that it has been successfully rebooted. The polling interval is specified by the POLL Instance HA service parameter, which is 45 seconds by default.

      If you set the FORCE_ENABLE Instance HA service parameter to true, then the Instance HA service ignores the unsuccessful evacuation of instances from a failed Compute node. But this does not apply when this unsuccessful evacuation was being monitored by setting the SMART_EVACUATION Instance HA service parameter to true. In this case, as explained in the previous step, this failed Compute node is explicitly disabled and you must clean up this Compute node yourself. By default, the FORCE_ENABLE parameter is false.

1.2. Tag images, flavors, or host aggregates for evacuation

By default, every instance on every Compute node is eligible for evacuation. But only the Compute nodes containing your critical instances require evacuation. You can use tagging to identify the instances that you want the Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service to evacuate.

You can tag your critical flavors or images, by adding a trait to the flavor or image metadata that specifies the value of the EVACUABLE_TAG Instance HA service parameter. For more information, see Flavor metadata in Configuring the Compute service for instance creation.

Note

The default value of the EVACUABLE_TAG Instance HA service parameter, is evacuable.

In this case, all the Compute nodes that contain instances that implement these tagged flavors or images are eligible for evacuation, provided that one or both of the following Instance HA service parameters are set to true:

  • By default, the TAGGED_FLAVORS Instance HA service parameter is set to true, so that the Instance HA service checks for tagged flavors. If you set the TAGGED_FLAVORS parameter to false then the Instance HA service does not check for tagged flavors. For more information, see Editing the Instance HA service parameters.
  • By default, the TAGGED_IMAGES Instance HA service parameter is set to true, so that the Instance HA service checks for tagged images. If you set the TAGGED_IMAGES parameter to false then the Instance HA service does not check for tagged images.

You can tag your critical host aggregates, by adding a trait to the host aggregate metadata that specifies the value of the EVACUABLE_TAG Instance HA service parameter. For more information, see Creating a host aggregate in Configuring the Compute service for instance creation.

In this case, all the Compute nodes contained in these tagged host aggregates are eligible for evacuation, provided that the following Instance HA service parameter is set to true:

  • By default, the TAGGED_AGGREGATES Instance HA service parameter is set to true, so that the Instance HA service checks for tagged host aggregates. If you set the TAGGED_AGGREGATES parameter to false then the Instance HA service does not check for tagged host aggregates and therefore will evacuate all the eligible Compute nodes.

1.3. Reserving healthy Compute nodes

You can intentionally reserve healthy Compute nodes to provide sufficient capacity to evacuate the instances of failed Compute nodes. The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service enables one reserved Compute node for each failed Compute node.

Procedure

  1. Enable the RESERVED_HOSTS parameter of the Instance HA service, by setting this value to true. This parameter is false by default. For more information, see Editing the Instance HA service parameters.
  2. Disable each healthy Compute node that you want to reserve, and specify the word reserved for the --disable-reason argument. For example, to reserve compute-1 specify this command:

    $ openstack compute service set --disable --disable-reason “reserved” compute-1 nova-compute
  3. Repeat step 2 until you have disabled all the Compute nodes that you want to reserve.

1.4. Detecting if a Compute node is capturing a kernel dump

You can configure the Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service to detect if a Compute node is capturing a kernel dump, in which case the Instance HA service waits for the kdump service to finish before fencing and evacuating each Compute node.

Prerequisites

  • You must also install the fence_kdump STONITH agent on every Compute node that you want to monitor for failure.
  • You have administrator privileges on all the Compute nodes that require their instances to be evacuated.
  • You have installed Red Hat Enterprise Linux (RHEL) 9 with the High Availability Add-on on all the Compute nodes that require their instances to be evacuated.
  • You have planned the configuration options of the kdump service and the fence_kdump STONITH agent.
  • The network that you have selected to send the UDP kdump notifications of the fence_kdump STONITH agent must meet the following requirements:

    • This network must be shared with the Compute nodes that require their instances to be evacuated.
    • This network must support reverse DNS lookup, so that the source IP address of the failed Compute node is translated into the respective Compute hostname, known by the Compute service (nova).
    • This network must not use OVS Bonds or OVS Bridges. You must use an untagged interface or a tagged VLAN and Linux bonds where appropriate.

Procedure

  1. Use one of the following methods to configure the kdump service on the Compute nodes:

  2. Check that the kdump service is active on all the Compute nodes. You can do this by checking that the output of the following command is active:

    # systemctl is-active kdump
  3. Install the fence_kdump STONITH agent on the Compute nodes:

    # dnf install fence-agents-kdump
  4. Include the following options when you define the specification of your Instance HA service pod in the YAML Instance HA service manifest file:

    • .spec.networkAttachments: You must specify the network that you have selected to send the UDP kdump notifications of the fence_kdump STONITH agent. For example, networkAttachments: ['internalapi'], specifies the internalapi as the designated network.
    • .spec.instanceHaKdumpPort: If you do not use the default UDP port for the fence_kdump STONITH agent of 7410, then you must use this option to specify your selected UDP port.

      For more information about creating the YAML Instance HA service manifest file and defining the specification, see Configuring the Instance HA service pod specification.

  5. Deploy the Instance HA service. For more information, see Deploying the Instance HA service.
  6. Set the CHECK_KDUMP Instance HA service parameter to true. For more information, see Editing the Instance HA service parameters.
  7. Retrieve the IP address of the deployed Instance HA service pod. For example, if internalapi is your designated network, use the following command:

    $ oc get instanceha -o json |jq -r '.items[0].status.networkAttachments["openstack/internalapi"][0]'
  8. Configure the fence_kdump STONITH agent on the Compute nodes to specify the IP address of the deployed Instance HA service pod. For example, if 172.16.0.178 is the IP address of the Instance HA service pod:

    # echo "fence_kdump_nodes 172.16.0.178 >> /etc/kdump.conf
  9. Configure the fence_kdump STONITH agent on the Compute nodes to specify the UDP port, the frames, the number of notification messages and their interval. For example, the following configuration uses the default 7410 UDP port, sends as many messages as you need, with a 5-second interval, and keeps sending messages until the kdump process has completed:

    # echo 'fence_kdump_args -p 7410 -f auto -c 0 -i 5' >> /etc/kdump.conf
  10. Regenerate the initrd image on the Compute nodes by restarting the kdump service:

    # systemctl restart kdump

Verification

  1. Crash one of the Compute nodes to ensure that the Instance HA service pod and the fence_kdump agent has been configured correctly:

    # echo c > /proc/sysrq-trigger
  2. Open the relevant console on the crashed Compute node and verify that you can see the progress of the kdump process.
  3. Open the Instance HA service pod log file and verify that this log file receives this message: the following compute(s) are kdumping:, which specifies the Compute node that you have crashed. For information on viewing the log file, see Troubleshooting the Instance HA service.
Red Hat logoGithubredditYoutubeTwitter

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Ayudamos a los usuarios de Red Hat a innovar y alcanzar sus objetivos con nuestros productos y servicios con contenido en el que pueden confiar. Explore nuestras recientes actualizaciones.

Hacer que el código abierto sea más inclusivo

Red Hat se compromete a reemplazar el lenguaje problemático en nuestro código, documentación y propiedades web. Para más detalles, consulte el Blog de Red Hat.

Acerca de Red Hat

Ofrecemos soluciones reforzadas que facilitan a las empresas trabajar en plataformas y entornos, desde el centro de datos central hasta el perímetro de la red.

Theme

© 2026 Red Hat
Volver arriba