11.6. Troubleshooting the Overcloud after Creation
After creating your Overcloud, you might want to perform certain Overcloud operations in the future. For example, you might aim to scale your available nodes, or replace faulty nodes. Certain issues might arise when performing these operations. This section provides some advice to diagnose and troubleshoot failed post-creation operations.
11.6.1. Overcloud Stack Modifications
Problems can occur when modifying the
overcloud
stack through the director. Example of stack modifications include:
- Scaling Nodes
- Removing Nodes
- Replacing Nodes
Modifying the stack is similar to the process of creating the stack, in that the director checks the availability of the requested number of nodes, provisions additional or removes existing nodes, and then applies the Puppet configuration. Here are some guidelines to follow in situations when modifying the
overcloud
stack.
As an initial step, follow the advice set in Section 11.3, “Troubleshooting Overcloud Creation”. These same steps can help diagnose problems with updating the
Overcloud
heat stack. In particular, use the following command to help identify problematic resources:
heat stack-list --show-nested
- List all stacks. The
--show-nested
displays all child stacks and their respective parent stacks. This command helps identify the point where a stack failed. heat resource-list overcloud
- List all resources in the
overcloud
stack and their current states. This helps identify which resource is causing failures in the stack. You can trace this resource failure to its respective parameters and configuration in the heat template collection and the Puppet modules. heat event-list overcloud
- List all events related to the
overcloud
stack in chronological order. This includes the initiation, completion, and failure of all resources in the stack. This helps identify points of resource failure.
The next few sections provide advice to diagnose issues on specific node types.
11.6.2. Controller Service Failures
The Overcloud Controller nodes contain the bulk of Red Hat OpenStack Platform services. Likewise, you might use multiple Controller nodes in a high availability cluster. If a certain service on a node is faulty, the high availability cluster provides a certain level of failover. However, it then becomes necessary to diagnose the faulty service to ensure your Overcloud operates at full capacity.
The Controller nodes use Pacemaker to manage the resources and services in the high availability cluster. The Pacemaker Configuration System (
pcs
) command is a tool that manages a Pacemaker cluster. Run this command on a Controller node in the cluster to perform configuration and monitoring functions. Here are few commands to help troubleshoot Overcloud services on a high availability cluster:
pcs status
- Provides a status overview of the entire cluster including enabled resources, failed resources, and online nodes.
pcs resource show
- Shows a list of resources, and their respective nodes.
pcs resource disable [resource]
- Stop a particular resource.
pcs resource enable [resource]
- Start a particular resource.
pcs cluster standby [node]
- Place a node in standby mode. The node is no longer available in the cluster. This is useful for performing maintenance on a specific node without affecting the cluster.
pcs cluster unstandby [node]
- Remove a node from standby mode. The node becomes available in the cluster again.
Use these Pacemaker commands to identify the faulty component and/or node. After identifying the component, view the respective component log file in
/var/log/
.
11.6.3. Compute Service Failures
Compute nodes use the Compute service to perform hypervisor-based operations. This means the main diagnosis for Compute nodes revolves around this service. For example:
- View the status of the service using the following
systemd
function:$ sudo systemctl status openstack-nova-compute.service
Likewise, view thesystemd
journal for the service using the following command:$ sudo journalctl -u openstack-nova-compute.service
- The primary log file for Compute nodes is
/var/log/nova/nova-compute.log
. If issues occur with Compute node communication, this log file is usually a good place to start a diagnosis. - If performing maintenance on the Compute node, migrate the existing instances from the host to an operational Compute node, then disable the node. See Section 8.9, “Migrating VMs from an Overcloud Compute Node” for more information on node migrations.
11.6.4. Ceph Storage Service Failures
For any issues that occur with Red Hat Ceph Storage clusters, see Part X. Logging and Debugging in the Red Hat Ceph Storage Configuration Guide. This section provides information on diagnosing logs for all Ceph storage services.