Este contenido no está disponible en el idioma seleccionado.
Chapter 1. Overview
You can use the Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service to assist you in managing the high availability of Compute instances by evacuating instances from failed Compute nodes and re-creating them on different healthy Compute nodes:
- The evacuated instances maintain the same network configuration, such as static IP addresses and floating IP addresses, because the Instance HA service works with shared storage or local storage environments.
- The re-created instances maintain the same characteristics inside the new Compute node.
By default, every Compute node is eligible for evacuation. But only the Compute nodes containing your critical instances require evacuation. The Instance HA service provides the following two methods for you to identify the Compute nodes running your critical instances:
- You can tag specific images or flavors, so that the Compute nodes that contain instances that implement these tagged flavors or images are eligible for evacuation.
- You can tag host aggregates, so that all the Compute nodes contained within a tagged host aggregate are eligible for evacuation.
By default, if you tag an image, flavor, or host aggregate, the Instance HA service ignores every untagged Compute node or instance. For more information, see Tag images, flavors, or host aggregates for evacuation.
The Instance HA service provides a number of parameters and options that you can use to customize the evacuation process of your instances from failed Compute nodes. For more information, see How the Instance HA service evacuates failed Compute nodes.
The Instance HA service assists you in managing the process of evacuating failed Compute nodes, and at times you need to take action. For more information, see Maintaining the process of evacuating instances from failed Compute nodes.
1.1. How the Instance HA service evacuates failed Compute nodes Copiar enlaceEnlace copiado en el portapapeles!
The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service uses the following iterative process to evacuate instances from your failed Compute nodes.
Procedure
The Instance HA service polls the Compute service (nova) database to identify the failed Compute nodes that are eligible for evacuation. The polling interval is specified by the
POLLInstance HA service parameter, which is 45 seconds by default.The Instance HA service uses the following filtering methodology:
- Compute nodes must contain instances to be eligible for evacuation.
- By default, if you have tagged specific flavors or images, Compute nodes containing instances that use these flavors or images are eligible for evacuation. For more information, see Tag images, flavors, or host aggregates for evacuation.
If you have tagged specific host aggregates, Compute nodes in these host aggregates are eligible for evacuation.
All the Compute nodes that are not eligible for evacuation are ignored.
The status of each Compute node is either
enabledordisabled. Any Compute node that is eligible for evacuation and has a status ofdisabledis ignored, because it is assumed that this Compute node has been intentionally disabled. For more information, see Disabling the evacuation of Compute nodes.NoteYou can intentionally disable healthy Compute nodes to provide sufficient capacity to evacuate the instances of failed Compute nodes. For more information, see Reserving healthy Compute nodes.
The state of each Compute node is either
upordown. Any Compute node that is eligible for evacuation with a state ofdownhas failed.The
DELTAInstance HA service parameter can reduce the time taken by the Instance HA service to detect failed Compute nodes: any Compute node eligible for evacuation that reports a status ofenabled, when this status is not updated within the specifiedDELTAperiod, has also failed. By default, theDELTAperiod is 30 seconds. For more information, see Editing the Instance HA service parameters.-
The Instance HA service calculates the percentage of failed Compute nodes and compares this percentage to the value of the
THRESHOLDInstance HA service parameter:
When the TAGGED_AGGREGATES parameter is true, the THRESHOLD parameter is calculated based on the total number of Compute nodes that are tagged by using the EVACUABLE_TAG parameter.
-
If the percentage is less than or equal to the
THRESHOLDvalue, then the Instance HA service continues evacuating Compute nodes. If the percentage exceeds the
THRESHOLDvalue, then the Instance HA service indicates this in a log message and stops evacuating Compute nodes.NoteYou must set the
THRESHOLDparameter to specify how many Compute nodes can fail before the evacuation process becomes impractical. For example, when the network is severely compromised or when there are insufficient healthy Compute nodes left to evacuate instances from the failed Compute nodes.-
If you set the
DISABLEDInstance HA service parameter totrue, then the Instance HA service does not evacuate the failed Compute nodes. In this case, the Instance HA service logs that a Compute node isdownbut does nothing further. TheDISABLEDparameter isfalseby default. For more information, see Disabling the evacuation of Compute nodes. - If you have reserved healthy Compute nodes, then the Instance HA service attempts to enable one reserved Compute node for each failed Compute node. For more information, see Reserving healthy Compute nodes.
If you configure the Instance HA service to detect if a Compute node is capturing a kernel dump, then the Instance HA service waits for the
kdumpservice to finish before fencing or powering off and evacuating each Compute node. For more information, see Detecting if a Compute node is capturing a kernel dump.You must configure the fencing agent of every Compute node that is eligible for evacuation. For more information, see Configuring the fencing of Compute nodes.
NoteYou cannot evacuate a Compute node unless it has a configured fencing agent.
- After a failed Compute node has been fenced, the Instance HA service calls the Compute service (nova) to perform the following actions:
-
If you set the
-
Set the status of this failed Compute node to
disabledand provide a descriptive, timestamped message for the--disable-reasonargument. Set the
Forced Downflag of this Compute node totrue.NoteThe Instance HA service evacuates instances that are in a state of ACTIVE, ERROR, or STOPPED. The Compute service prevents instances in any other state from being evacuated.
-
If you set the
LEAVE_DISABLEDInstance HA service parameter totrue, then the fenced Compute nodes will remain disabled after they have been evacuated. One reason for doing this is if faulty hardware caused the Compute node to fail, then you do not want the Compute node enabled to simply fail again. By default, theLEAVE_DISABLEDparameter is set tofalse, so the disabled Compute nodes are enabled after they have been evacuated successfully. In this case, the Instance HA service instructs the associated fencing agent to power on the Compute node. -
The Instance HA service sets the
Forced Downflag of a failed Compute node tofalsewhen the Compute node is successfully evacuated and successfully rebooted, when the Compute node is re-enabled. When the evacuation of a Compute node is either slow or unsuccessful, you can set theForced Downflag of the Compute node tofalse. For more information, see Rehabilitating evacuated Compute nodes. -
By default, the Instance HA service does not monitor the evacuation of instances from failed Compute nodes. It assumes they are successful if the respective Compute service (nova) API requests are accepted. If you set the
SMART_EVACUATIONInstance HA service parameter totrue, then the Instance HA service monitors the evacuation of each instance. In this case, the Instance HA service retries evacuating an instance up to 5 times when it fails. If the evacuation of any instance fails for 5 times, the Instance HA service performs the following actions:
-
If you set the
- The Instance HA service stops evacuating all instances from this Compute node.
-
The Instance HA service sets the status of this failed Compute node to
disabledand provides a descriptive, timestamped message for the--disable-reasonargument. The Instance HA service sets the
Forced Downflag of this Compute node totrue.You must perform the following actions to rehabilitate this failed Compute node:
- You must determine the reason why this instance could not be evacuated.
- You must evacuate all the other instances from this Compute node.
-
You must set the
Forced Downflag of this Compute node tofalse. - You must enable this Compute node.
You must ensure that this Compute node successfully reboots.
When you set the
SMART_EVACUATIONInstance HA service parameter totrue, you can use theWORKERSInstance HA service parameter to specify the number of instances that the Instance HA service can evacuate at the same time, which is 4 by default. TheSMART_EVACUATIONparameter isfalseby default.By default, the Instance HA service periodically polls each Compute node that is enabled but has a
Forced Downflag set totrueto ensure that it has been successfully rebooted. The polling interval is specified by thePOLLInstance HA service parameter, which is 45 seconds by default.If you set the
FORCE_ENABLEInstance HA service parameter totrue, then the Instance HA service ignores the unsuccessful evacuation of instances from a failed Compute node. But this does not apply when this unsuccessful evacuation was being monitored by setting theSMART_EVACUATIONInstance HA service parameter totrue. In this case, as explained in the previous step, this failed Compute node is explicitly disabled and you must clean up this Compute node yourself. By default, theFORCE_ENABLEparameter isfalse.
1.2. Tag images, flavors, or host aggregates for evacuation Copiar enlaceEnlace copiado en el portapapeles!
By default, every instance on every Compute node is eligible for evacuation. But only the Compute nodes containing your critical instances require evacuation. You can use tagging to identify the instances that you want the Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service to evacuate.
You can tag your critical flavors or images, by adding a trait to the flavor or image metadata that specifies the value of the EVACUABLE_TAG Instance HA service parameter. For more information, see Flavor metadata in Configuring the Compute service for instance creation.
The default value of the EVACUABLE_TAG Instance HA service parameter, is evacuable.
In this case, all the Compute nodes that contain instances that implement these tagged flavors or images are eligible for evacuation, provided that one or both of the following Instance HA service parameters are set to true:
-
By default, the
TAGGED_FLAVORSInstance HA service parameter is set totrue, so that the Instance HA service checks for tagged flavors. If you set theTAGGED_FLAVORSparameter tofalsethen the Instance HA service does not check for tagged flavors. For more information, see Editing the Instance HA service parameters. -
By default, the
TAGGED_IMAGESInstance HA service parameter is set totrue, so that the Instance HA service checks for tagged images. If you set theTAGGED_IMAGESparameter tofalsethen the Instance HA service does not check for tagged images.
You can tag your critical host aggregates, by adding a trait to the host aggregate metadata that specifies the value of the EVACUABLE_TAG Instance HA service parameter. For more information, see Creating a host aggregate in Configuring the Compute service for instance creation.
In this case, all the Compute nodes contained in these tagged host aggregates are eligible for evacuation, provided that the following Instance HA service parameter is set to true:
-
By default, the
TAGGED_AGGREGATESInstance HA service parameter is set totrue, so that the Instance HA service checks for tagged host aggregates. If you set theTAGGED_AGGREGATESparameter tofalsethen the Instance HA service does not check for tagged host aggregates and therefore will evacuate all the eligible Compute nodes.
1.3. Reserving healthy Compute nodes Copiar enlaceEnlace copiado en el portapapeles!
You can intentionally reserve healthy Compute nodes to provide sufficient capacity to evacuate the instances of failed Compute nodes. The Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service enables one reserved Compute node for each failed Compute node.
Procedure
-
Enable the
RESERVED_HOSTSparameter of the Instance HA service, by setting this value totrue. This parameter isfalseby default. For more information, see Editing the Instance HA service parameters. Disable each healthy Compute node that you want to reserve, and specify the word
reservedfor the--disable-reasonargument. For example, to reservecompute-1specify this command:$ openstack compute service set --disable --disable-reason “reserved” compute-1 nova-compute- Repeat step 2 until you have disabled all the Compute nodes that you want to reserve.
1.4. Detecting if a Compute node is capturing a kernel dump Copiar enlaceEnlace copiado en el portapapeles!
You can configure the Red Hat OpenStack Services on OpenShift (RHOSO) high availability for Compute instances (Instance HA) service to detect if a Compute node is capturing a kernel dump, in which case the Instance HA service waits for the kdump service to finish before fencing and evacuating each Compute node.
Prerequisites
-
You must also install the
fence_kdumpSTONITH agent on every Compute node that you want to monitor for failure. - You have administrator privileges on all the Compute nodes that require their instances to be evacuated.
- You have installed Red Hat Enterprise Linux (RHEL) 9 with the High Availability Add-on on all the Compute nodes that require their instances to be evacuated.
-
You have planned the configuration options of the
kdumpservice and thefence_kdumpSTONITH agent. The network that you have selected to send the UDP
kdumpnotifications of thefence_kdumpSTONITH agent must meet the following requirements:- This network must be shared with the Compute nodes that require their instances to be evacuated.
- This network must support reverse DNS lookup, so that the source IP address of the failed Compute node is translated into the respective Compute hostname, known by the Compute service (nova).
- This network must not use OVS Bonds or OVS Bridges. You must use an untagged interface or a tagged VLAN and Linux bonds where appropriate.
Procedure
Use one of the following methods to configure the
kdumpservice on the Compute nodes:-
You can use the command line to configure the
kdumpservice. For more information, see Configuring kdump on the command line in RHEL Managing, monitoring, and updating the kernel. -
You can use the web console to configure the
kdumpservice. For more information, see Configuring kdump in the web console in RHEL Managing, monitoring, and updating the kernel.
-
You can use the command line to configure the
Check that the
kdumpservice is active on all the Compute nodes. You can do this by checking that the output of the following command isactive:# systemctl is-active kdumpInstall the
fence_kdumpSTONITH agent on the Compute nodes:# dnf install fence-agents-kdumpInclude the following options when you define the specification of your Instance HA service pod in the YAML Instance HA service manifest file:
-
.spec.networkAttachments: You must specify the network that you have selected to send the UDPkdumpnotifications of thefence_kdumpSTONITH agent. For example,networkAttachments: ['internalapi'], specifies theinternalapias the designated network. .spec.instanceHaKdumpPort: If you do not use the default UDP port for thefence_kdumpSTONITH agent of 7410, then you must use this option to specify your selected UDP port.For more information about creating the YAML Instance HA service manifest file and defining the specification, see Configuring the Instance HA service pod specification.
-
- Deploy the Instance HA service. For more information, see Deploying the Instance HA service.
-
Set the
CHECK_KDUMPInstance HA service parameter totrue. For more information, see Editing the Instance HA service parameters. Retrieve the IP address of the deployed Instance HA service pod. For example, if
internalapiis your designated network, use the following command:$ oc get instanceha -o json |jq -r '.items[0].status.networkAttachments["openstack/internalapi"][0]'Configure the
fence_kdumpSTONITH agent on the Compute nodes to specify the IP address of the deployed Instance HA service pod. For example, if 172.16.0.178 is the IP address of the Instance HA service pod:# echo "fence_kdump_nodes 172.16.0.178 >> /etc/kdump.confConfigure the
fence_kdumpSTONITH agent on the Compute nodes to specify the UDP port, the frames, the number of notification messages and their interval. For example, the following configuration uses the default 7410 UDP port, sends as many messages as you need, with a 5-second interval, and keeps sending messages until thekdumpprocess has completed:# echo 'fence_kdump_args -p 7410 -f auto -c 0 -i 5' >> /etc/kdump.confRegenerate the
initrdimage on the Compute nodes by restarting thekdumpservice:# systemctl restart kdump
Verification
Crash one of the Compute nodes to ensure that the Instance HA service pod and the
fence_kdumpagent has been configured correctly:# echo c > /proc/sysrq-trigger-
Open the relevant console on the crashed Compute node and verify that you can see the progress of the
kdumpprocess. -
Open the Instance HA service pod log file and verify that this log file receives this message:
the following compute(s) are kdumping:, which specifies the Compute node that you have crashed. For information on viewing the log file, see Troubleshooting the Instance HA service.