Chapter 1. Introduction and planning an Instance HA deployment
High availability for Compute instances (Instance HA) is a tool that you can use to evacuate instances from a failed Compute node and re-create the instances on a different Compute node.
Instance HA works with shared storage or local storage environments, which means that evacuated instances maintain the same network configuration, such as static IP addresses and floating IP addresses. The re-created instances also maintain the same characteristics inside the new Compute node.
1.1. How Instance HA works
When a Compute node fails, the overcloud fencing agent fences the node, then the Instance HA agents evacuate instances from the failed Compute node to a different Compute node.
The following events occur when a Compute node fails and triggers Instance HA:
-
At the time of failure, the
IPMI
agent performs first-layer fencing, which includes physically resetting the node to ensure that it shuts down and preventing data corruption or multiple identical instances on the overcloud. When the node is offline, it is considered fenced. After the physical IPMI fencing, the
fence-nova
agent automatically performs second-layer fencing and marks the fenced node with the"evacuate=yes"
cluster per-node attribute by running the following command:$ attrd_updater -n evacuate -A name="evacuate" host="FAILEDHOST" value="yes"
FAILEDHOST
is the name of the failed Compute node.-
The
nova-evacuate
agent continually runs in the background and periodically checks the cluster for nodes with the"evacuate=yes"
attribute. Whennova-evacuate
detects that the fenced node contains this attribute, the agent starts evacuating the node. The evacuation process is similar to the manual instance evacuation process that you can perform at any time. -
When the failed node restarts after the IPMI reset, the
nova-compute
process on that node also starts automatically. Because the node was previously fenced, it does not run any new instances until Pacemaker un-fences the node. -
When Pacemaker detects that the Compute node is online, it starts the
compute-unfence-trigger
resource agent on the node, which releases the node and so that it can run instances again.
Additional resources
1.2. Planning your Instance HA deployment
Before you deploy Instance HA, review the resource names for compliance and configure your storage and networking based on your environment.
- Compute node host names and Pacemaker remote resource names must comply with the W3C naming conventions. For more information, see Declaring Namespaces and Names and Tokens in the W3C documentation.
Typically, Instance HA requires that you configure shared storage for disk images of instances. Therefore, if you attempt to use the
no-shared-storage
option, you might receive anInvalidSharedStorage
error during evacuation, and the instances will not start on another Compute node.However, if all your instances are configured to boot from an OpenStack Block Storage (
cinder
) volume, you do not need to configure shared storage for the disk image of the instances, and you can evacuate all instances using theno-shared-storage
option.During evacuation, if your instances are configured to boot from a Block Storage volume, any evacuated instances boot from the same volume on another Compute node. Therefore, the evacuated instances immediately restart their jobs because the OS image and the application data are stored on the OpenStack Block Storage volume.
-
If you deploy Instance HA in a Spine-Leaf environment, you must define a single
internal_api
network for the Controller and Compute nodes. You can then define a subnet for each leaf. For more information about configuring Spine-Leaf networks, see Creating a roles data file in the Spine Leaf Networking guide. - From Red Hat OpenStack Platform 13 and later, you use director to upgrade Instance HA as a part of the overcloud upgrade. For more information about upgrading the overcloud, see Keeping Red Hat OpenStack Platform Updated guide.
Disabling Instance HA with the director after installation is not supported. For a workaround to manually remove Instance HA components from your deployment, see the article How can I remove Instance HA components from the controller nodes? .
ImportantThis workaround is not verified for production environments. You must verify the procedure in a test environment before you implement it in a production environment.
1.3. Instance HA resource agents
Instance HA uses the fence_compute
, NovaEvacuate
, and comput-unfence-trigger
resource agents to evacuate and re-created instance if a Compute node fails.
Agent name | Name inside cluster | Role |
---|---|---|
|
| Marks a Compute node for evacuation when the node becomes unavailable. |
|
| Evacuates instances from failed nodes. This agent runs on one of the Controller nodes. |
|
| Releases a fenced node and enables the node to run instances again. |