8.4. Replacing Controller Nodes
In certain circumstances a Controller node in a high availability cluster might fail. In these situations, you must remove the node from the cluster and replace it with a new Controller node. This also includes ensuring the node connects to the other nodes in the cluster.
This section provides instructions on how to replace a Controller node. The process involves running the
openstack overcloud deploy
command to update the Overcloud with a request to replace a controller node. Note that this process is not completely automatic; during the Overcloud stack update process, the openstack overcloud deploy
command will at some point report a failure and halt the Overcloud stack update. At this point, the process requires some manual intervention. Then the openstack overcloud deploy
process can continue.
Important
The following procedure only applies to high availability environments. Do not use this procedure if only using one Controller node.
8.4.1. Preliminary Checks
Before attempting to replace an Overcloud Controller node, it is important to check the current state of your Red Hat OpenStack Platform environment. Checking the current state can help avoid complications during the Controller replacement process. Use the following list of preliminary checks to determine if it is safe to perform a Controller node replacement. Run all commands for these checks on the Undercloud.
- Check the current status of the
overcloud
stack on the Undercloud:Copy to Clipboard Copied! Toggle word wrap Toggle overflow source stackrc heat stack-list --show-nested
$ source stackrc $ heat stack-list --show-nested
Theovercloud
stack and its subsequent child stacks should have either aCREATE_COMPLETE
orUPDATE_COMPLETE
. - Perform a backup of the Undercloud databases:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow mkdir /home/stack/backup sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz sudo systemctl stop openstack-ironic-api.service openstack-ironic-conductor.service openstack-ironic-discoverd.service openstack-ironic-discoverd-dnsmasq.service sudo cp /var/lib/ironic-discoverd/inspector.sqlite /home/stack/backup sudo systemctl start openstack-ironic-api.service openstack-ironic-conductor.service openstack-ironic-discoverd.service openstack-ironic-discoverd-dnsmasq.service
$ mkdir /home/stack/backup $ sudo mysqldump --all-databases --quick --single-transaction | gzip > /home/stack/backup/dump_db_undercloud.sql.gz $ sudo systemctl stop openstack-ironic-api.service openstack-ironic-conductor.service openstack-ironic-discoverd.service openstack-ironic-discoverd-dnsmasq.service $ sudo cp /var/lib/ironic-discoverd/inspector.sqlite /home/stack/backup $ sudo systemctl start openstack-ironic-api.service openstack-ironic-conductor.service openstack-ironic-discoverd.service openstack-ironic-discoverd-dnsmasq.service
- Check your Undercloud contains 10 GB free storage to accomodate for image caching and conversion when provisioning the new node.
- Check the status of Pacemaker on the running Controller nodes. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the Pacemaker status:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ssh heat-admin@192.168.0.47 'sudo pcs status'
$ ssh heat-admin@192.168.0.47 'sudo pcs status'
The output should show all services running on the existing nodes and stopped on the failed node. - Check the following parameters on each node of the Overcloud's MariaDB cluster:
wsrep_local_state_comment: Synced
wsrep_cluster_size: 2
Use the following command to check these parameters on each running Controller node (respectively using 192.168.0.47 and 192.168.0.46 for IP addresses):Copy to Clipboard Copied! Toggle word wrap Toggle overflow for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_local_state_comment'\" ; sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_cluster_size'\""; done
$ for i in 192.168.0.47 192.168.0.46 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_local_state_comment'\" ; sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_cluster_size'\""; done
- Check the RabbitMQ status. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to get the status
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ssh heat-admin@192.168.0.47 "sudo rabbitmqctl cluster_status"
$ ssh heat-admin@192.168.0.47 "sudo rabbitmqctl cluster_status"
Therunning_nodes
key should only show the two available nodes and not the failed node. - Disable fencing, if enabled. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to disable fencing:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ssh heat-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"
$ ssh heat-admin@192.168.0.47 "sudo pcs property set stonith-enabled=false"
Check the fencing status with the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow ssh heat-admin@192.168.0.47 "sudo pcs property show stonith-enabled"
$ ssh heat-admin@192.168.0.47 "sudo pcs property show stonith-enabled"
- Check the
nova-compute
service on the director node:Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo systemctl status openstack-nova-compute nova hypervisor-list
$ sudo systemctl status openstack-nova-compute $ nova hypervisor-list
The output should show all non-maintenance mode nodes asup
. - Make sure all Undercloud services are running:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo systemctl -t service
$ sudo systemctl -t service
8.4.2. Node Replacement
Identify the index of the node to remove. The node index is the suffix on the instance name from
nova list
output.
nova list
[stack@director ~]$ nova list
+--------------------------------------+------------------------+
| ID | Name |
+--------------------------------------+------------------------+
| 861408be-4027-4f53-87a6-cd3cf206ba7a | overcloud-compute-0 |
| 0966e9ae-f553-447a-9929-c4232432f718 | overcloud-compute-1 |
| 9c08fa65-b38c-4b2e-bd47-33870bff06c7 | overcloud-compute-2 |
| a7f0f5e1-e7ce-4513-ad2b-81146bc8c5af | overcloud-controller-0 |
| cfefaf60-8311-4bc3-9416-6a824a40a9ae | overcloud-controller-1 |
| 97a055d4-aefd-481c-82b7-4a5f384036d2 | overcloud-controller-2 |
+--------------------------------------+------------------------+
In this example, the aim is to remove the
overcloud-controller-1
node and replace it with overcloud-controller-3
. First, set the node into maintenance mode so the director does not reprovision the failed node. Correlate the instance ID from nova list
with the node ID from ironic node-list
ironic node-list
[stack@director ~]$ ironic node-list
+--------------------------------------+------+--------------------------------------+
| UUID | Name | Instance UUID |
+--------------------------------------+------+--------------------------------------+
| 36404147-7c8a-41e6-8c72-a6e90afc7584 | None | 7bee57cf-4a58-4eaf-b851-2a8bf6620e48 |
| 91eb9ac5-7d52-453c-a017-c0e3d823efd0 | None | None |
| 75b25e9a-948d-424a-9b3b-f0ef70a6eacf | None | None |
| 038727da-6a5c-425f-bd45-fda2f4bd145b | None | 763bfec2-9354-466a-ae65-2401c13e07e5 |
| dc2292e6-4056-46e0-8848-d6e96df1f55d | None | 2017b481-706f-44e1-852a-2ee857c303c4 |
| c7eadcea-e377-4392-9fc3-cf2b02b7ec29 | None | 5f73c7d7-4826-49a5-b6be-8bfd558f3b41 |
| da3a8d19-8a59-4e9d-923a-6a336fe10284 | None | cfefaf60-8311-4bc3-9416-6a824a40a9ae |
| 807cb6ce-6b94-4cd1-9969-5c47560c2eee | None | c07c13e6-a845-4791-9628-260110829c3a |
+--------------------------------------+------+--------------------------------------+
Set the node into maintenance mode:
ironic node-set-maintenance da3a8d19-8a59-4e9d-923a-6a336fe10284 true
[stack@director ~]$ ironic node-set-maintenance da3a8d19-8a59-4e9d-923a-6a336fe10284 true
Tag the new node as with the
control
profile.
ironic node-update 75b25e9a-948d-424a-9b3b-f0ef70a6eacf add properties/capabilities='profile:control,boot_option:local'
[stack@director ~]$ ironic node-update 75b25e9a-948d-424a-9b3b-f0ef70a6eacf add properties/capabilities='profile:control,boot_option:local'
Create a YAML file (
~/templates/remove-controller.yaml
) that defines the node index to remove:
parameters: ControllerRemovalPolicies: [{'resource_list': ['1']}]
parameters:
ControllerRemovalPolicies:
[{'resource_list': ['1']}]
Important
If replacing the node with index 0, edit the heat templates and change the bootstrap node index and node validation index before starting replacement. Create a copy of the director's Heat template collection (see Chapter 10, Creating Custom Configuration and run the following command on the
overcloud-without-mergepy.yaml
file:
sudo sed -i "s/resource\.0/resource.1/g" ~/templates/my-overcloud/overcloud-without-mergepy.yaml
$ sudo sed -i "s/resource\.0/resource.1/g" ~/templates/my-overcloud/overcloud-without-mergepy.yaml
This changes the node index for the following resources:
ControllerBootstrapNodeConfig: type: OS::TripleO::BootstrapNode::SoftwareConfig properties: bootstrap_nodeid: {get_attr: [Controller, resource.0.hostname]} bootstrap_nodeid_ip: {get_attr: [Controller, resource.0.ip_address]}
ControllerBootstrapNodeConfig:
type: OS::TripleO::BootstrapNode::SoftwareConfig
properties:
bootstrap_nodeid: {get_attr: [Controller, resource.0.hostname]}
bootstrap_nodeid_ip: {get_attr: [Controller, resource.0.ip_address]}
And:
AllNodesValidationConfig: type: OS::TripleO::AllNodes::Validation properties: PingTestIps: list_join: - ' ' - - {get_attr: [Controller, resource.0.external_ip_address]} - {get_attr: [Controller, resource.0.internal_api_ip_address]} - {get_attr: [Controller, resource.0.storage_ip_address]} - {get_attr: [Controller, resource.0.storage_mgmt_ip_address]} - {get_attr: [Controller, resource.0.tenant_ip_address]}
AllNodesValidationConfig:
type: OS::TripleO::AllNodes::Validation
properties:
PingTestIps:
list_join:
- ' '
- - {get_attr: [Controller, resource.0.external_ip_address]}
- {get_attr: [Controller, resource.0.internal_api_ip_address]}
- {get_attr: [Controller, resource.0.storage_ip_address]}
- {get_attr: [Controller, resource.0.storage_mgmt_ip_address]}
- {get_attr: [Controller, resource.0.tenant_ip_address]}
After identifying the node index, redeploy the Overcloud and include the
remove-controller.yaml
environment file:
openstack overcloud deploy --templates --control-scale 3 -e ~/templates/remove-controller.yaml [OTHER OPTIONS]
[stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 -e ~/templates/remove-controller.yaml [OTHER OPTIONS]
Important
If you passed any extra environment files or options when you created the Overcloud, pass them again here to avoid making undesired changes to the Overcloud.
However, note that the
-e ~/templates/remove-controller.yaml
is only required once in this instance.
The director removes the old node, creates a new one, and updates the Overcloud stack. You can check the status of the Overcloud stack with the following command:
heat stack-list --show-nested
[stack@director ~]$ heat stack-list --show-nested
8.4.3. Manual Intervention
During the
ControllerNodesPostDeployment
stage, the Overcloud stack update halts with an UPDATE_FAILED
error at ControllerLoadBalancerDeployment_Step1
. This is because some Puppet modules do not support nodes replacement. This point in the process requires some manual intervention. Follow these configuration steps:
- Get a list of IP addresses for the Controller nodes. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow nova list
[stack@director ~]$ nova list ... +------------------------+ ... +-------------------------+ ... | Name | ... | Networks | ... +------------------------+ ... +-------------------------+ ... | overcloud-compute-0 | ... | ctlplane=192.168.0.44 | ... | overcloud-controller-0 | ... | ctlplane=192.168.0.47 | ... | overcloud-controller-2 | ... | ctlplane=192.168.0.46 | ... | overcloud-controller-3 | ... | ctlplane=192.168.0.48 | ... +------------------------+ ... +-------------------------+
- Check the
nodeid
value of the removed node in the/etc/corosync/corosync.conf
file on an existing node. For example, the existing node isovercloud-controller-0
at 192.168.0.47:Copy to Clipboard Copied! Toggle word wrap Toggle overflow ssh heat-admin@192.168.0.47 "sudo cat /etc/corosync/corosync.conf"
[stack@director ~]$ ssh heat-admin@192.168.0.47 "sudo cat /etc/corosync/corosync.conf"
This displays anodelist
that contains the ID for the removed node (overcloud-controller-1
):Copy to Clipboard Copied! Toggle word wrap Toggle overflow nodelist { node { ring0_addr: overcloud-controller-0 nodeid: 1 } node { ring0_addr: overcloud-controller-1 nodeid: 2 } node { ring0_addr: overcloud-controller-2 nodeid: 3 } }
nodelist { node { ring0_addr: overcloud-controller-0 nodeid: 1 } node { ring0_addr: overcloud-controller-1 nodeid: 2 } node { ring0_addr: overcloud-controller-2 nodeid: 3 } }
Note thenodeid
value of the removed node for later. In this example, it is 2. - Delete the failed node from the Corosync configuration on each node and restart Corosync. For this example, log into
overcloud-controller-0
andovercloud-controller-2
and run the following commands:Copy to Clipboard Copied! Toggle word wrap Toggle overflow [stack@director] ssh heat-admin@192.168.201.47 "sudo pcs cluster localnode remove overcloud-controller-1" [stack@director] ssh heat-admin@192.168.201.47 "sudo pcs cluster reload corosync" [stack@director] ssh heat-admin@192.168.201.46 "sudo pcs cluster localnode remove overcloud-controller-1" [stack@director] ssh heat-admin@192.168.201.46 "sudo pcs cluster reload corosync"
[stack@director] ssh heat-admin@192.168.201.47 "sudo pcs cluster localnode remove overcloud-controller-1" [stack@director] ssh heat-admin@192.168.201.47 "sudo pcs cluster reload corosync" [stack@director] ssh heat-admin@192.168.201.46 "sudo pcs cluster localnode remove overcloud-controller-1" [stack@director] ssh heat-admin@192.168.201.46 "sudo pcs cluster reload corosync"
- Log into one of the remaining nodes and delete the node from the cluster with the
crm_node
command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo crm_node -R overcloud-controller-1 --force
[stack@director] ssh heat-admin@192.168.201.47 [heat-admin@overcloud-controller-0 ~]$ sudo crm_node -R overcloud-controller-1 --force
Stay logged into this node. - Delete the failed node from the RabbitMQ cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo rabbitmqctl forget_cluster_node rabbit@overcloud-controller-1
[heat-admin@overcloud-controller-0 ~]$ sudo rabbitmqctl forget_cluster_node rabbit@overcloud-controller-1
- Delete the failed node from MongoDB. First, find the IP address for the node's Interal API connection.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo netstat -tulnp | grep 27017
[heat-admin@overcloud-controller-0 ~]$ sudo netstat -tulnp | grep 27017 tcp 0 0 192.168.0.47:27017 0.0.0.0:* LISTEN 13415/mongod
Check that the node is theprimary
replica set:Copy to Clipboard Copied! Toggle word wrap Toggle overflow echo "db.isMaster()" | mongo --host 192.168.0.47:27017
[root@overcloud-controller-0 ~]# echo "db.isMaster()" | mongo --host 192.168.0.47:27017 MongoDB shell version: 2.6.11 connecting to: 192.168.0.47:27017/echo { "setName" : "tripleo", "setVersion" : 1, "ismaster" : true, "secondary" : false, "hosts" : [ "192.168.0.47:27017", "192.168.0.46:27017", "192.168.0.45:27017" ], "primary" : "192.168.0.47:27017", "me" : "192.168.0.47:27017", "electionId" : ObjectId("575919933ea8637676159d28"), "maxBsonObjectSize" : 16777216, "maxMessageSizeBytes" : 48000000, "maxWriteBatchSize" : 1000, "localTime" : ISODate("2016-06-09T09:02:43.340Z"), "maxWireVersion" : 2, "minWireVersion" : 0, "ok" : 1 } bye
This should indicate if the current node is the primary. If not, use the IP address of the node indicated in theprimary
key.Connect to MongoDB on the primary node:Copy to Clipboard Copied! Toggle word wrap Toggle overflow mongo --host 192.168.0.47
[heat-admin@overcloud-controller-0 ~]$ mongo --host 192.168.0.47 MongoDB shell version: 2.6.9 connecting to: 192.168.0.47:27017/test Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see http://docs.mongodb.org/ Questions? Try the support group http://groups.google.com/group/mongodb-user tripleo:PRIMARY>
Check the status of the MongoDB cluster:Copy to Clipboard Copied! Toggle word wrap Toggle overflow tripleo:PRIMARY> rs.status()
tripleo:PRIMARY> rs.status()
Identify the node using the_id
key and remove the failed node using thename
key. In this case, we remove Node 1, which has192.168.0.45:27017
forname
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow tripleo:PRIMARY> rs.remove('192.168.0.45:27017')
tripleo:PRIMARY> rs.remove('192.168.0.45:27017')
Important
You must run the command against thePRIMARY
replica set. If you see the following message:Copy to Clipboard Copied! Toggle word wrap Toggle overflow "replSetReconfig command must be sent to the current replica set primary."
"replSetReconfig command must be sent to the current replica set primary."
Relog into MongoDB on the node designated asPRIMARY
.Note
The following output is normal when removing the failed node's replica set:Copy to Clipboard Copied! Toggle word wrap Toggle overflow 2016-05-07T03:57:19.541+0000 DBClientCursor::init call() failed 2016-05-07T03:57:19.543+0000 Error: error doing query: failed at src/mongo/shell/query.js:81 2016-05-07T03:57:19.545+0000 trying reconnect to 192.168.0.47:27017 (192.168.0.47) failed 2016-05-07T03:57:19.547+0000 reconnect 192.168.0.47:27017 (192.168.0.47) ok
2016-05-07T03:57:19.541+0000 DBClientCursor::init call() failed 2016-05-07T03:57:19.543+0000 Error: error doing query: failed at src/mongo/shell/query.js:81 2016-05-07T03:57:19.545+0000 trying reconnect to 192.168.0.47:27017 (192.168.0.47) failed 2016-05-07T03:57:19.547+0000 reconnect 192.168.0.47:27017 (192.168.0.47) ok
Exit MongoDB:Copy to Clipboard Copied! Toggle word wrap Toggle overflow tripleo:PRIMARY> exit
tripleo:PRIMARY> exit
- Update list of nodes in the Galera cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs resource update galera wsrep_cluster_address=gcomm://overcloud-controller-0,overcloud-controller-3,overcloud-controller-2
[heat-admin@overcloud-controller-0 ~]$ sudo pcs resource update galera wsrep_cluster_address=gcomm://overcloud-controller-0,overcloud-controller-3,overcloud-controller-2
- Add the new node to the cluster:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs cluster node add overcloud-controller-3
[heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster node add overcloud-controller-3
- Check the
/etc/corosync/corosync.conf
file on each node. If thenodeid
of the new node is the same as the removed node, update the value to a new nodeid value. For example, the/etc/corosync/corosync.conf
file contains an entry for the new node (overcloud-controller-3
):Copy to Clipboard Copied! Toggle word wrap Toggle overflow nodelist { node { ring0_addr: overcloud-controller-0 nodeid: 1 } node { ring0_addr: overcloud-controller-2 nodeid: 3 } node { ring0_addr: overcloud-controller-3 nodeid: 2 } }
nodelist { node { ring0_addr: overcloud-controller-0 nodeid: 1 } node { ring0_addr: overcloud-controller-2 nodeid: 3 } node { ring0_addr: overcloud-controller-3 nodeid: 2 } }
Note that in this example, the new node uses the samenodeid
of the removed node. Update this value to a unused node ID value. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow node { ring0_addr: overcloud-controller-3 nodeid: 4 }
node { ring0_addr: overcloud-controller-3 nodeid: 4 }
Update thisnodeid
value on each Controller node's/etc/corosync/corosync.conf
file, including the new node. - Restart the Corosync service on the existing nodes only. For example, on
overcloud-controller-0
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs cluster reload corosync
[heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster reload corosync
And onovercloud-controller-2
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs cluster reload corosync
[heat-admin@overcloud-controller-2 ~]$ sudo pcs cluster reload corosync
Do not run this command on the new node. - Start the new Controller node:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs cluster start overcloud-controller-3
[heat-admin@overcloud-controller-0 ~]$ sudo pcs cluster start overcloud-controller-3
- Enable the keystone service on the new node. Copy the
/etc/keystone
directory from a remaining node to the director host:Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo -i scp -r /etc/keystone stack@192.168.0.1:~/.
[heat-admin@overcloud-controller-0 ~]$ sudo -i [root@overcloud-controller-0 ~]$ scp -r /etc/keystone stack@192.168.0.1:~/.
Log in to the new Controller node. Remove the/etc/keystone
directory from the new Controller node and copy thekeystone
files from the director host:Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo -i rm -rf /etc/keystone scp -r stack@192.168.0.1:~/keystone /etc/. chown -R keystone: /etc/keystone chown root /etc/keystone/logging.conf /etc/keystone/default_catalog.templates
[heat-admin@overcloud-controller-3 ~]$ sudo -i [root@overcloud-controller-3 ~]$ rm -rf /etc/keystone [root@overcloud-controller-3 ~]$ scp -r stack@192.168.0.1:~/keystone /etc/. [root@overcloud-controller-3 ~]$ chown -R keystone: /etc/keystone [root@overcloud-controller-3 ~]$ chown root /etc/keystone/logging.conf /etc/keystone/default_catalog.templates
Edit/etc/keystone/keystone.conf
and set theadmin_bind_host
andpublic_bind_host
parameters to new Controller node's IP address. To find these IP addresses, use theip addr
command and look for the IP address within the following networks:admin_bind_host
- Provisioning networkpublic_bind_host
- Internal API network
Note
These networks might differ if you deployed the Overcloud using a customServiceNetMap
parameter.For example, if the Provisioning network uses the 192.168.0.0/24 subnet and the Internal API uses the 172.17.0.0/24 subnet, use the following commands to find the node’s IP addresses on those networks:Copy to Clipboard Copied! Toggle word wrap Toggle overflow ip addr | grep "192\.168\.0\..*/24" ip addr | grep "172\.17\.0\..*/24"
[root@overcloud-controller-3 ~]$ ip addr | grep "192\.168\.0\..*/24" [root@overcloud-controller-3 ~]$ ip addr | grep "172\.17\.0\..*/24"
- Enable and restart some services through Pacemaker. The cluster is currently in maintenance mode and you will need to temporarily disable it to enable the service. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs property set maintenance-mode=false --wait
[heat-admin@overcloud-controller-3 ~]$ sudo pcs property set maintenance-mode=false --wait
- Wait until the Galera service starts on all nodes.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs status | grep galera -A1
[heat-admin@overcloud-controller-3 ~]$ sudo pcs status | grep galera -A1 Master/Slave Set: galera-master [galera] Masters: [ overcloud-controller-0 overcloud-controller-2 overcloud-controller-3 ]
If need be, perform a `cleanup` on the new node:Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs resource cleanup galera overcloud-controller-3
[heat-admin@overcloud-controller-3 ~]$ sudo pcs resource cleanup galera overcloud-controller-3
- Wait until the Keystone service starts on all nodes.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs status | grep keystone -A1
[heat-admin@overcloud-controller-3 ~]$ sudo pcs status | grep keystone -A1 Clone Set: openstack-keystone-clone [openstack-keystone] Started: [ overcloud-controller-0 overcloud-controller-2 overcloud-controller-3 ]
If need be, perform a `cleanup` on the new node:Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs resource cleanup openstack-keystone-clone overcloud-controller-3
[heat-admin@overcloud-controller-3 ~]$ sudo pcs resource cleanup openstack-keystone-clone overcloud-controller-3
- Switch the cluster back into maintenance mode:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo pcs property set maintenance-mode=true --wait
[heat-admin@overcloud-controller-3 ~]$ sudo pcs property set maintenance-mode=true --wait
The manual configuration is complete. Re-run the Overcloud deployment command to continue the stack update:
openstack overcloud deploy --templates --control-scale 3 [OTHER OPTIONS]
[stack@director ~]$ openstack overcloud deploy --templates --control-scale 3 [OTHER OPTIONS]
Important
If you passed any extra environment files or options when you created the Overcloud, pass them again here to avoid making undesired changes to the Overcloud.
However, note that the
remove-controller.yaml
file is no longer needed.
8.4.4. Finalizing Overcloud Services
After the Overcloud stack update completes, some final configuration is required. Log in to one of the Controller nodes and refresh any stopped services in Pacemaker:
for i in `sudo pcs status|grep -B2 Stop |grep -v "Stop\|Start"|awk -F"[" '/\[/ {print substr($NF,0,length($NF)-1)}'`; do echo $i; sudo pcs resource cleanup $i; done
[heat-admin@overcloud-controller-0 ~]$ for i in `sudo pcs status|grep -B2 Stop |grep -v "Stop\|Start"|awk -F"[" '/\[/ {print substr($NF,0,length($NF)-1)}'`; do echo $i; sudo pcs resource cleanup $i; done
Perform a final status check to make sure services are running correctly:
sudo pcs status
[heat-admin@overcloud-controller-0 ~]$ sudo pcs status
Note
If any services have failed, use the
pcs resource cleanup
command to restart them after resolving them.
Enable fencing if you disabled it during the node replacement. For example, if 192.168.0.47 is the IP address of a running Controller node, use the following command to enable fencing:
sudo pcs property set stonith-enabled=true
[heat-admin@overcloud-controller-0 ~]$ sudo pcs property set stonith-enabled=true
Exit to the director
exit
[heat-admin@overcloud-controller-0 ~]$ exit
8.4.5. Finalizing Overcloud Network Agents
Source the
overcloudrc
file so that you can interact with the Overcloud. Check your routers to make sure the L3 agents are properly hosting the routers in your Overcloud environment. In this example, we use a router with the name r1
:
source ~/overcloudrc neutron l3-agent-list-hosting-router r1
[stack@director ~]$ source ~/overcloudrc
[stack@director ~]$ neutron l3-agent-list-hosting-router r1
This list might still show the old node instead of the new node. To replace it, list the L3 network agents in your environment:
neutron agent-list | grep "neutron-l3-agent"
[stack@director ~]$ neutron agent-list | grep "neutron-l3-agent"
Identify the UUID for the agents on the new node and the old node. Add the router to the agent on the new node and remove the router from old node. For example:
neutron l3-agent-router-add fd6b3d6e-7d8c-4e1a-831a-4ec1c9ebb965 r1 neutron l3-agent-router-remove b40020af-c6dd-4f7a-b426-eba7bac9dbc2 r1
[stack@director ~]$ neutron l3-agent-router-add fd6b3d6e-7d8c-4e1a-831a-4ec1c9ebb965 r1
[stack@director ~]$ neutron l3-agent-router-remove b40020af-c6dd-4f7a-b426-eba7bac9dbc2 r1
Perform a final check on the router and make all are active:
neutron l3-agent-list-hosting-router r1
[stack@director ~]$ neutron l3-agent-list-hosting-router r1
Delete the existing Neutron agents that point to old Controller node. For example:
neutron agent-list -F id -F host | grep overcloud-controller-1 neutron agent-delete ddae8e46-3e8e-4a1b-a8b3-c87f13c294eb
[stack@director ~]$ neutron agent-list -F id -F host | grep overcloud-controller-1
| ddae8e46-3e8e-4a1b-a8b3-c87f13c294eb | overcloud-controller-1.localdomain |
[stack@director ~]$ neutron agent-delete ddae8e46-3e8e-4a1b-a8b3-c87f13c294eb
8.4.6. Finalizing Compute Services
Compute services for the removed node still exist in the Overcloud and require removal. Source the
overcloudrc
file so that you can interact with the Overcloud. Check the compute services for the removed node:
source ~/overcloudrc nova service-list | grep "overcloud-controller-1.localdomain"
[stack@director ~]$ source ~/overcloudrc
[stack@director ~]$ nova service-list | grep "overcloud-controller-1.localdomain"
Remove the compute services for the node. For example, if the
nova-scheduler
service for overcloud-controller-1.localdomain
has an ID of 5, run the following command:
nova service-delete 5
[stack@director ~]$ nova service-delete 5
Perform this task for each service of the removed node.
Check the
openstack-nova-consoleauth
service on the new node.
nova service-list | grep consoleauth
[stack@director ~]$ nova service-list | grep consoleauth
If the service is not running, log into a Controller node and restart the service:
pcs resource restart openstack-nova-consoleauth
[stack@director] ssh heat-admin@192.168.201.47
[heat-admin@overcloud-controller-0 ~]$ pcs resource restart openstack-nova-consoleauth
8.4.7. Conclusion
The failed Controller node and its related services are now replaced with a new node.