Chapter 12. Scaling overcloud nodes
If you want to add or remove nodes after the creation of the overcloud, you must update the overcloud.
Do not use openstack server delete
to remove nodes from the overcloud. Read the procedures defined in this section to properly remove and replace nodes.
Ensure that your bare metal nodes are not in maintenance mode before you begin scaling out or removing an overcloud node.
Use the following table to determine support for scaling each node type:
Node Type | Scale Up? | Scale Down? | Notes |
Controller | N | N | You can replace Controller nodes using the procedures in Chapter 13, Replacing Controller Nodes. |
Compute | Y | Y | |
Ceph Storage Nodes | Y | N | You must have at least 1 Ceph Storage node from the initial overcloud creation. |
Object Storage Nodes | Y | Y |
Ensure to leave at least 10 GB free space before scaling the overcloud. This free space accommodates image conversion and caching during the node provisioning process.
12.1. Adding nodes to the overcloud
Complete the following steps to add more nodes to the director node pool.
Procedure
Create a new JSON file (
newnodes.json
) containing the new node details to register:{ "nodes":[ { "mac":[ "dd:dd:dd:dd:dd:dd" ], "cpu":"4", "memory":"6144", "disk":"40", "arch":"x86_64", "pm_type":"ipmi", "pm_user":"admin", "pm_password":"p@55w0rd!", "pm_addr":"192.168.24.207" }, { "mac":[ "ee:ee:ee:ee:ee:ee" ], "cpu":"4", "memory":"6144", "disk":"40", "arch":"x86_64", "pm_type":"ipmi", "pm_user":"admin", "pm_password":"p@55w0rd!", "pm_addr":"192.168.24.208" } ] }
Run the following command to register the new nodes:
$ source ~/stackrc (undercloud) $ openstack overcloud node import newnodes.json
After registering the new node, run the following command to list yours nodes and identify the new node UUID:
(undercloud) $ openstack baremetal node list
Run the following commands to launch the introspection process for each new node:
(undercloud) $ openstack baremetal node manage [NODE UUID] (undercloud) $ openstack overcloud node introspect [NODE UUID] --provide
This process detects and benchmarks the hardware properties of the nodes.
Configure the image properties for the node:
(undercloud) $ openstack overcloud node configure [NODE UUID]
12.2. Increasing node counts for roles
Complete the following steps to scale overcloud nodes for a specific role, such as a Compute node.
Procedure
Tag each new node with the role you want. For example, to tag a node with the Compute role, run the following command:
(undercloud) $ openstack baremetal node set --property capabilities='profile:compute,boot_option:local' [NODE UUID]
Scaling the overcloud requires that you edit the environment file that contains your node counts and re-deploy the overcloud. For example, to scale your overcloud to 5 Compute nodes, edit the
ComputeCount
parameter:parameter_defaults: ... ComputeCount: 5 ...
Rerun the deployment command with the updated file, which in this example is called
node-info.yaml
:(undercloud) $ openstack overcloud deploy --templates -e /home/stack/templates/node-info.yaml [OTHER_OPTIONS]
Ensure you include all environment files and options from your initial overcloud creation. This includes the same scale parameters for non-Compute nodes.
- Wait until the deployment operation completes.
12.3. Removing or replacing a Compute node
In some situations you need to remove a Compute node from the overcloud. For example, you might need to replace a problematic Compute node. When you delete a Compute node the node’s index is added by default to the denylist to prevent the index being reused during scale out operations.
You can replace the removed Compute node after you have removed the node from your overcloud deployment.
Prerequisites
The Compute service is disabled on the nodes that you want to remove to prevent the nodes from scheduling new instances. To confirm that the Compute service is disabled, use the following command:
(overcloud)$ openstack compute service list
If the Compute service is not disabled then disable it:
(overcloud)$ openstack compute service set <hostname> nova-compute --disable
TipUse the
--disable-reason
option to add a short explanation on why the service is being disabled. This is useful if you intend to redeploy the Compute service.- The workloads on the Compute nodes have been migrated to other Compute nodes. For more information, see Migrating virtual machine instances between Compute nodes.
If Instance HA is enabled, choose one of the following options:
-
If the Compute node is accessible, log in to the Compute node as the
root
user and perform a clean shutdown with theshutdown -h now
command. If the Compute node is not accessible, log in to a Controller node as the
root
user, disable the STONITH device for the Compute node, and shut down the bare metal node:[root@controller-0 ~]# pcs stonith disable <stonith_resource_name> [stack@undercloud ~]$ source stackrc [stack@undercloud ~]$ openstack baremetal node power off <UUID>
-
If the Compute node is accessible, log in to the Compute node as the
Procedure
Source the undercloud configuration:
(overcloud)$ source ~/stackrc
Identify the UUID of the overcloud stack:
(undercloud)$ openstack stack list
Identify the UUIDs or hostnames of the Compute nodes that you want to delete:
(undercloud)$ openstack server list
Optional: Run the
overcloud deploy
command with the--update-plan-only
option to update the plans with the most recent configurations from the templates. This ensures that the overcloud configuration is up-to-date before you delete any Compute nodes:$ openstack overcloud deploy --update-plan-only \ --templates \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/templates/network-environment.yaml \ -e /home/stack/templates/storage-environment.yaml \ -e /home/stack/templates/rhel-registration/environment-rhel-registration.yaml \ [-e |...]
NoteThis step is required if you updated the overcloud node denylist. For more information about adding overcloud nodes to the denylist, see Blacklisting nodes.
Delete the Compute nodes from the stack:
$ openstack overcloud node delete --stack <overcloud> \ <node_1> ... [node_n]
-
Replace
<overcloud>
with the name or UUID of the overcloud stack. Replace
<node_1>
, and optionally all nodes up to[node_n]
, with the Compute service hostname or UUID of the Compute nodes you want to delete. Do not use a mix of UUIDs and hostnames. Use either only UUIDs or only hostnames.NoteIf the node has already been powered off, this command returns a
WARNING
message:Ansible failed, check log at /var/lib/mistral/overcloud/ansible.log WARNING: Scale-down configuration error. Manual cleanup of some actions may be necessary. Continuing with node removal.
You can ignore this message.
-
Replace
- Wait for the Compute nodes to delete.
Check the status of the overcloud stack when the node deletion is complete:
(undercloud)$ openstack stack list
Table 12.2. Result Status Description UPDATE_COMPLETE
The delete operation completed successfully.
UPDATE_FAILED
The delete operation failed.
A common reason for a failed delete operation is an unreachable IPMI interface on the node that you want to remove.
When the delete operation fails, you must manually remove the Compute node. For more information, see Removing a Compute node manually.
If Instance HA is enabled, perform the following actions:
Clean up the Pacemaker resources for the node:
$ sudo pcs resource delete <scaled_down_node> $ sudo cibadmin -o nodes --delete --xml-text '<node id="<scaled_down_node>"/>' $ sudo cibadmin -o fencing-topology --delete --xml-text '<fencing-level target="<scaled_down_node>"/>' $ sudo cibadmin -o status --delete --xml-text '<node_state id="<scaled_down_node>"/>' $ sudo cibadmin -o status --delete-all --xml-text '<node id="<scaled_down_node>"/>' --force
Delete the STONITH device for the node:
$ sudo pcs stonith delete <device-name>
If you are not replacing the removed Compute nodes on the overcloud, then decrease the
ComputeCount
parameter in the environment file that contains your node counts. This file is usually namednode-info.yaml
. For example, decrease the node count from four nodes to three nodes if you removed one node:parameter_defaults: ... ComputeCount: 3
Decreasing the node count ensures that director does not provision any new nodes when you run
openstack overcloud deploy
.If you are replacing the removed Compute node on your overcloud deployment, see Replacing a removed Compute node.
12.3.1. Removing a Compute node manually
If the openstack overcloud node delete
command failed due to an unreachable node, then you must manually complete the removal of the Compute node from the overcloud.
Prerequisites
-
Performing the Removing or replacing a Compute node procedure returned a status of
UPDATE_FAILED
.
Procedure
Identify the UUID of the overcloud stack:
(undercloud)$ openstack stack list
Identify the UUID of the node that you want to manually delete:
(undercloud)$ openstack baremetal node list
Move the node that you want to delete to maintenance mode:
(undercloud)$ openstack baremetal node maintenance set <node_uuid>
- Wait for the Compute service to synchronize its state with the Bare Metal service. This can take up to four minutes.
Source the overcloud configuration:
(undercloud)$ source ~/overcloudrc
Delete the network agents for the node that you deleted:
(overcloud)$ for AGENT in $(openstack network agent list --host <scaled_down_node> -c ID -f value) ; do openstack network agent delete $AGENT ; done
Replace
<scaled_down_node>
with the name of the node to remove.Confirm that the Compute service is disabled on the deleted node on the overcloud, to prevent the node from scheduling new instances:
(overcloud)$ openstack compute service list
If the Compute service is not disabled then disable it:
(overcloud)$ openstack compute service set <hostname> nova-compute --disable
TipUse the
--disable-reason
option to add a short explanation on why the service is being disabled. This is useful if you intend to redeploy the Compute service.Source the undercloud configuration:
(overcloud)$ source ~/stackrc
Delete the Compute node from the stack:
(undercloud)$ openstack overcloud node delete --stack <overcloud> <node>
-
Replace
<overcloud>
with the name or UUID of the overcloud stack. Replace
<node>
with the Compute service host name or UUID of the Compute node that you want to delete.NoteIf the node has already been powered off, this command returns a
WARNING
message:Ansible failed, check log at /var/lib/mistral/overcloud/ansible.log WARNING: Scale-down configuration error. Manual cleanup of some actions may be necessary. Continuing with node removal.
You can ignore this message.
-
Replace
- Wait for the overcloud node to delete.
Check the status of the overcloud stack when the node deletion is complete:
(undercloud)$ openstack stack list
Table 12.3. Result Status Description UPDATE_COMPLETE
The delete operation completed successfully.
UPDATE_FAILED
The delete operation failed.
If the overcloud node fails to delete while in maintenance mode, then the problem might be with the hardware.
If Instance HA is enabled, perform the following actions:
Clean up the Pacemaker resources for the node:
$ sudo pcs resource delete <scaled_down_node> $ sudo cibadmin -o nodes --delete --xml-text '<node id="<scaled_down_node>"/>' $ sudo cibadmin -o fencing-topology --delete --xml-text '<fencing-level target="<scaled_down_node>"/>' $ sudo cibadmin -o status --delete --xml-text '<node_state id="<scaled_down_node>"/>' $ sudo cibadmin -o status --delete-all --xml-text '<node id="<scaled_down_node>"/>' --force
Delete the STONITH device for the node:
$ sudo pcs stonith delete <device-name>
If you are not replacing the removed Compute node on the overcloud, then decrease the
ComputeCount
parameter in the environment file that contains your node counts. This file is usually namednode-info.yaml
. For example, decrease the node count from four nodes to three nodes if you removed one node:parameter_defaults: ... ComputeCount: 3 ...
Decreasing the node count ensures that director does not provision any new nodes when you run
openstack overcloud deploy
.If you are replacing the removed Compute node on your overcloud deployment, see Replacing a removed Compute node.
12.3.2. Replacing a removed Compute node
To replace a removed Compute node on your overcloud deployment, you can register and inspect a new Compute node or re-add the removed Compute node. You must also configure your overcloud to provision the node.
Procedure
Optional: To reuse the index of the removed Compute node, configure the
RemovalPoliciesMode
and theRemovalPolicies
parameters for the role to replace the denylist when a Compute node is removed:parameter_defaults: <RoleName>RemovalPoliciesMode: update <RoleName>RemovalPolicies: [{'resource_list': []}]
Replace the removed Compute node:
- To add a new Compute node, register, inspect, and tag the new node to prepare it for provisioning. For more information, see Configuring a basic overcloud.
To re-add a Compute node that you removed manually, remove the node from maintenance mode:
(undercloud)$ openstack baremetal node maintenance unset <node_uuid>
-
Rerun the
openstack overcloud deploy
command that you used to deploy the existing overcloud. - Wait until the deployment process completes.
Confirm that director has successfully registered the new Compute node:
(undercloud)$ openstack baremetal node list
If you performed step 1 to set the
RemovalPoliciesMode
for the role toupdate
, then you must reset theRemovalPoliciesMode
for the role to the default value,append
, to add the Compute node index to the current denylist when a Compute node is removed:parameter_defaults: <RoleName>RemovalPoliciesMode: append
-
Rerun the
openstack overcloud deploy
command that you used to deploy the existing overcloud.
12.4. Replacing Ceph Storage nodes
You can use the director to replace Ceph Storage nodes in a director-created cluster. You can find these instructions in the Deploying an Overcloud with Containerized Red Hat Ceph guide.
12.5. Replacing Object Storage nodes
Follow the instructions in this section to understand how to replace Object Storage nodes while maintaining the integrity of the cluster. This example involves a two-node Object Storage cluster in which the node overcloud-objectstorage-1
must be replaced. The goal of the procedure is to add one more node and then remove overcloud-objectstorage-1
, effectively replacing it.
Procedure
Increase the Object Storage count using the
ObjectStorageCount
parameter. This parameter is usually located innode-info.yaml
, which is the environment file containing your node counts:parameter_defaults: ObjectStorageCount: 4
The
ObjectStorageCount
parameter defines the quantity of Object Storage nodes in your environment. In this situation, we scale from 3 to 4 nodes.Run the deployment command with the updated
ObjectStorageCount
parameter:$ source ~/stackrc (undercloud) $ openstack overcloud deploy --templates -e node-info.yaml ENVIRONMENT_FILES
- After the deployment command completes, the overcloud contains an additional Object Storage node.
Replicate data to the new node. Before removing a node (in this case,
overcloud-objectstorage-1
), wait for a replication pass to finish on the new node. Check the replication pass progress in the/var/log/swift/swift.log
file. When the pass finishes, the Object Storage service should log entries similar to the following example:Mar 29 08:49:05 localhost object-server: Object replication complete. Mar 29 08:49:11 localhost container-server: Replication run OVER Mar 29 08:49:13 localhost account-server: Replication run OVER
To remove the old node from the ring, reduce the
ObjectStorageCount
parameter to the omit the old node. In this case, reduce it to 3:parameter_defaults: ObjectStorageCount: 3
Create a new environment file named
remove-object-node.yaml
. This file identifies and removes the specified Object Storage node. The following content specifies the removal ofovercloud-objectstorage-1
:parameter_defaults: ObjectStorageRemovalPolicies: [{'resource_list': ['1']}]
Include both the
node-info.yaml
andremove-object-node.yaml
files in the deployment command:(undercloud) $ openstack overcloud deploy --templates -e node-info.yaml ENVIRONMENT_FILES -e remove-object-node.yaml
The director deletes the Object Storage node from the overcloud and updates the rest of the nodes on the overcloud to accommodate the node removal.
12.6. Blacklisting nodes
You can exclude overcloud nodes from receiving an updated deployment. This is useful in scenarios where you aim to scale new nodes while excluding existing nodes from receiving an updated set of parameters and resources from the core Heat template collection. In other words, the blacklisted nodes are isolated from the effects of the stack operation.
Use the DeploymentServerBlacklist
parameter in an environment file to create a blacklist.
Setting the Blacklist
The DeploymentServerBlacklist
parameter is a list of server names. Write a new environment file, or add the parameter value to an existing custom environment file and pass the file to the deployment command:
parameter_defaults: DeploymentServerBlacklist: - overcloud-compute-0 - overcloud-compute-1 - overcloud-compute-2
The server names in the parameter value are the names according to OpenStack Orchestration (heat), not the actual server hostnames.
Include this environment file with your openstack overcloud deploy
command:
$ source ~/stackrc (undercloud) $ openstack overcloud deploy --templates \ -e server-blacklist.yaml \ [OTHER OPTIONS]
Heat blacklists any servers in the list from receiving updated Heat deployments. After the stack operation completes, any blacklisted servers remain unchanged. You can also power off or stop the os-collect-config
agents during the operation.
- Exercise caution when blacklisting nodes. Only use a blacklist if you fully understand how to apply the requested change with a blacklist in effect. It is possible to create a hung stack or configure the overcloud incorrectly using the blacklist feature. For example, if a cluster configuration changes applies to all members of a Pacemaker cluster, blacklisting a Pacemaker cluster member during this change can cause the cluster to fail.
- Do not use the blacklist during update or upgrade procedures. Those procedures have their own methods for isolating changes to particular servers. See the Upgrading Red Hat OpenStack Platform documentation for more information.
-
When you add servers to the blacklist, further changes to those nodes are not supported until you remove the server from the blacklist. This includes updates, upgrades, scale up, scale down, and node replacement. For example, when you blacklist existing Compute nodes while scaling out the overcloud with new Compute nodes, the blacklisted nodes miss the information added to
/etc/hosts
and/etc/ssh/ssh_known_hosts
. This can cause live migration to fail, depending on the destination host. The Compute nodes are updated with the information added to/etc/hosts
and/etc/ssh/ssh_known_hosts
during the next overcloud deployment where they are no longer blacklisted. Do not modify the/etc/hosts
and/etc/ssh/ssh_known_hosts
files manually. To modify the/etc/hosts
and/etc/ssh/ssh_known_hosts
files, run the overcloud deploy command as described in the Clearing the Blacklist section.
Clearing the Blacklist
To clear the blacklist for subsequent stack operations, edit the DeploymentServerBlacklist
to use an empty array:
parameter_defaults: DeploymentServerBlacklist: []
Do not just omit the DeploymentServerBlacklist
parameter. If you omit the parameter, the overcloud deployment uses the previously saved value.