Chapter 11. Troubleshooting Director Issues
An error can occur at certain stages of the director’s processes. This section provides some information for diagnosing common problems.
Note the common logs for the director’s components:
-
The
/var/log
directory contains logs for many common OpenStack Platform components as well as logs for standard Red Hat Enterprise Linux applications. The
journald
service provides logs for various components. Note that ironic uses two units:openstack-ironic-api
andopenstack-ironic-conductor
. Likewise,ironic-inspector
uses two units as well:openstack-ironic-inspector
andopenstack-ironic-inspector-dnsmasq
. Use both units for each respective component. For example:$ sudo journalctl -u openstack-ironic-inspector -u openstack-ironic-inspector-dnsmasq
-
ironic-inspector
also stores the ramdisk logs in/var/log/ironic-inspector/ramdisk/
as gz-compressed tar files. Filenames contain date, time, and the IPMI address of the node. Use these logs for diagnosing introspection issues.
11.1. Troubleshooting Node Registration
Issues with node registration usually arise from issues with incorrect node details. In this case, use ironic
to fix problems with node data registered. Here are a few examples:
Find out the assigned port UUID:
$ ironic node-port-list [NODE UUID]
Update the MAC address:
$ ironic port-update [PORT UUID] replace address=[NEW MAC]
Run the following command:
$ ironic node-update [NODE UUID] replace driver_info/ipmi_address=[NEW IPMI ADDRESS]
11.2. Troubleshooting Hardware Introspection
The introspection process must run to completion. However, ironic’s Discovery daemon (ironic-inspector
) times out after a default 1 hour period if the discovery ramdisk provides no response. Sometimes this might indicate a bug in the discovery ramdisk but usually it happens due to an environment misconfiguration, particularly BIOS boot settings.
Here are some common scenarios where environment misconfiguration occurs and advice on how to diagnose and resolve them.
Errors with Starting Node Introspection
Normally the introspection process uses the baremetal introspection
, which acts an an umbrella command for ironic’s services. However, if running the introspection directly with ironic-inspector
, it might fail to discover nodes in the AVAILABLE state, which is meant for deployment and not for discovery. Change the node status to the MANAGEABLE state before discovery:
$ ironic node-set-provision-state [NODE UUID] manage
Then, when discovery completes, change back to AVAILABLE before provisioning:
$ ironic node-set-provision-state [NODE UUID] provide
Introspected node is not booting in PXE
Before a node reboots, ironic-inspector
adds the MAC address of the node to the undercloud firewall’s ironic-inspector
chain. This allows the node to boot over PXE. To verify the correct configuration, run the following command:
$ `sudo iptables -L`
The output should display the following chain table with the MAC address:
Chain ironic-inspector (1 references) target prot opt source destination DROP all -- anywhere anywhere MAC xx:xx:xx:xx:xx:xx ACCEPT all -- anywhere anywhere
If the MAC address is not there, the most common cause is a corruption in the ironic-inspector
cache, which is in an SQLite database. To fix it, delete the SQLite file:
$ sudo rm /var/lib/ironic-inspector/inspector.sqlite
And recreate it:
$ sudo ironic-inspector-dbsync --config-file /etc/ironic-inspector/inspector.conf upgrade $ sudo systemctl restart openstack-ironic-inspector
Stopping the Discovery Process
Currently ironic-inspector
does not provide a direct means for stopping discovery. The recommended path is to wait until the process times out. If necessary, change the timeout
setting in /etc/ironic-inspector/inspector.conf
to change the timeout period to another period in minutes.
In worst case scenarios, you can stop discovery for all nodes using the following process:
Change the power state of each node to off:
$ ironic node-set-power-state [NODE UUID] off
Remove ironic-inspector
cache and restart it:
$ rm /var/lib/ironic-inspector/inspector.sqlite
Resynchronize the ironic-inspector
cache:
$ sudo ironic-inspector-dbsync --config-file /etc/ironic-inspector/inspector.conf upgrade $ sudo systemctl restart openstack-ironic-inspector
Accessing the Introspection Ramdisk
The introspection ramdisk uses a dynamic login element. This means you can provide either a temporary password or an SSH key to access the node during introspection debugging. Use the following process to set up ramdisk access:
Provide a temporary password to the
openssl passwd -1
command to generate an MD5 hash. For example:$ openssl passwd -1 mytestpassword $1$enjRSyIw$/fYUpJwr6abFy/d.koRgQ/
Edit the
/httpboot/inspector.ipxe
file, find the line starting withkernel
, and append therootpwd
parameter and the MD5 hash. For example:kernel http://192.2.0.1:8088/agent.kernel ipa-inspection-callback-url=http://192.168.0.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-benchmarks=cpu,mem,disk rootpwd="$1$enjRSyIw$/fYUpJwr6abFy/d.koRgQ/" selinux=0
Alternatively, you can append the
sshkey
parameter with your public SSH key.NoteQuotation marks are required for both the
rootpwd
andsshkey
parameters.Start the introspection and find the IP address from either the
arp
command or the DHCP logs:$ arp $ sudo journalctl -u openstack-ironic-inspector-dnsmasq
SSH as a root user with the temporary password or the SSH key.
$ ssh root@192.0.2.105
Checking Introspection Storage
The director uses OpenStack Object Storage (swift) to save the hardware data obtained during the introspection process. If this service is not running, the introspection can fail. Check all services related to OpenStack Object Storage to ensure the service is running:
$ sudo systemctl list-units openstack-swift*
11.3. Troubleshooting Workflows and Executions
The OpenStack Workflow (mistral) service groups multiple OpenStack tasks into workflows. Red Hat OpenStack Platform uses a set of these workflow to perform common functions across the CLI and web UI. This includes bare metal node control, validations, plan management, and overcloud deployment.
For example, when running the openstack overcloud deploy
command, the OpenStack Workflow service executes two workflows. The first one uploads the deployment plan:
Removing the current plan files Uploading new plan files Started Mistral Workflow. Execution ID: aef1e8c6-a862-42de-8bce-073744ed5e6b Plan updated
The second one starts the overcloud deployment:
Deploying templates in the directory /tmp/tripleoclient-LhRlHX/tripleo-heat-templates Started Mistral Workflow. Execution ID: 97b64abe-d8fc-414a-837a-1380631c764d 2016-11-28 06:29:26Z [overcloud]: CREATE_IN_PROGRESS Stack CREATE started 2016-11-28 06:29:26Z [overcloud.Networks]: CREATE_IN_PROGRESS state changed 2016-11-28 06:29:26Z [overcloud.HeatAuthEncryptionKey]: CREATE_IN_PROGRESS state changed 2016-11-28 06:29:26Z [overcloud.ServiceNetMap]: CREATE_IN_PROGRESS state changed ...
Workflow Objects
OpenStack Workflow uses the following objects to keep track of the workflow:
- Actions
- A particular instruction that OpenStack performs once an associated task runs. Examples include running shell scripts or performing HTTP requests. Some OpenStack components have in-built actions that OpenStack Workflow uses.
- Tasks
- Defines the action to run and the result of running the action. These tasks usually have actions or other workflows associated with them. Once a task completes, the workflow directs to another task, usually depending on whether the task succeeded or failed.
- Workflows
- A set of tasks grouped together and executed in a specific order.
- Executions
- Defines a particular action, task, or workflow running.
Workflow Error Diagnosis
OpenStack Workflow also provides robust logging of executions, which help you identify issues with certain command failures. For example, if a workflow execution fails, you can identify the point of failure. List the workflow executions that have the failed state ERROR
:
$ mistral execution-list | grep "ERROR"
Get the UUID of the failed workflow execution (for example, 3c87a885-0d37-4af8-a471-1b392264a7f5) and view the execution and its output:
$ mistral execution-get 3c87a885-0d37-4af8-a471-1b392264a7f5 $ mistral execution-get-output 3c87a885-0d37-4af8-a471-1b392264a7f5
This provides information about the failed task in the execution. The mistral execution-get
also displays the workflow used for the execution (for example, tripleo.plan_management.v1.update_deployment_plan
). You can view the full workflow definition using the following command:
$ mistral execution-get-definition tripleo.plan_management.v1.update_deployment_plan
This is useful for identifying where in the workflow a particular task occurs.
You can also view action executions and their results using a similar command syntax:
$ mistral action-execution-list $ mistral action-execution-get b59245bf-7183-4fcf-9508-c83ec1a26908 $ mistral action-execution-get-output b59245bf-7183-4fcf-9508-c83ec1a26908
This is useful for identifying a specific action causing issues.
11.4. Troubleshooting Overcloud Creation
There are three layers where the deployment can fail:
- Orchestration (heat and nova services)
- Bare Metal Provisioning (ironic service)
- Post-Deployment Configuration (Puppet)
If an overcloud deployment has failed at any of these levels, use the OpenStack clients and service log files to diagnose the failed deployment.
11.4.1. Orchestration
In most cases, Heat shows the failed overcloud stack after the overcloud creation fails:
$ heat stack-list +-----------------------+------------+--------------------+----------------------+ | id | stack_name | stack_status | creation_time | +-----------------------+------------+--------------------+----------------------+ | 7e88af95-535c-4a55... | overcloud | CREATE_FAILED | 2015-04-06T17:57:16Z | +-----------------------+------------+--------------------+----------------------+
If the stack list is empty, this indicates an issue with the initial Heat setup. Check your Heat templates and configuration options, and check for any error messages that presented after running openstack overcloud deploy
.
11.4.2. Bare Metal Provisioning
Check ironic
to see all registered nodes and their current status:
$ ironic node-list +----------+------+---------------+-------------+-----------------+-------------+ | UUID | Name | Instance UUID | Power State | Provision State | Maintenance | +----------+------+---------------+-------------+-----------------+-------------+ | f1e261...| None | None | power off | available | False | | f0b8c1...| None | None | power off | available | False | +----------+------+---------------+-------------+-----------------+-------------+
Here are some common issues that arise from the provisioning process.
Review the Provision State and Maintenance columns in the resulting table. Check for the following:
- An empty table, or fewer nodes than you expect
- Maintenance is set to True
-
Provision State is set to
manageable
. This usually indicates an issue with the registration or discovery processes. For example, if Maintenance sets itself to True automatically, the nodes are usually using the wrong power management credentials.
-
If Provision State is
available
, then the problem occurred before bare metal deployment has even started. -
If Provision State is
active
and Power State ispower on
, the bare metal deployment has finished successfully. This means that the problem occurred during the post-deployment configuration step. -
If Provision State is
wait call-back
for a node, the bare metal provisioning process has not yet finished for this node. Wait until this status changes, otherwise, connect to the virtual console of the failed node and check the output. If Provision State is
error
ordeploy failed
, then bare metal provisioning has failed for this node. Check the bare metal node’s details:$ ironic node-show [NODE UUID]
Look for
last_error
field, which contains error description. If the error message is vague, you can use logs to clarify it:$ sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
-
If you see
wait timeout error
and the node Power State ispower on
, connect to the virtual console of the failed node and check the output.
11.4.3. Post-Deployment Configuration
Many things can occur during the configuration stage. For example, a particular Puppet module could fail to complete due to an issue with the setup. This section provides a process to diagnose such issues.
List all the resources from the overcloud stack to see which one failed:
$ heat resource-list overcloud
This shows a table of all resources and their states. Look for any resources with a CREATE_FAILED
.
Show the failed resource:
$ heat resource-show overcloud [FAILED RESOURCE]
Check for any information in the resource_status_reason
field that can help your diagnosis.
Use the nova
command to see the IP addresses of the overcloud nodes.
$ nova list
Log in as the heat-admin
user to one of the deployed nodes. For example, if the stack’s resource list shows the error occurred on a Controller node, log in to a Controller node. The heat-admin
user has sudo access.
$ ssh heat-admin@192.0.2.14
Check the os-collect-config
log for a possible reason for the failure.
$ sudo journalctl -u os-collect-config
In some cases, nova fails deploying the node in entirety. This situation would be indicated by a failed OS::Heat::ResourceGroup
for one of the overcloud role types. Use nova
to see the failure in this case.
$ nova list $ nova show [SERVER ID]
The most common error shown will reference the error message No valid host was found
. See Section 11.6, “Troubleshooting "No Valid Host Found" Errors” for details on troubleshooting this error. In other cases, look at the following log files for further troubleshooting:
-
/var/log/nova/*
-
/var/log/heat/*
-
/var/log/ironic/*
Use the SOS toolset, which gathers information about system hardware and configuration. Use this information for diagnostic purposes and debugging. SOS is commonly used to help support technicians and developers. SOS is useful on both the undercloud and overcloud. Install the sos
package:
$ sudo yum install sos
Generate a report:
$ sudo sosreport --all-logs
The post-deployment process for Controller nodes uses five main steps for the deployment. This includes:
Step | Description |
| Initial load balancing software configuration, including Pacemaker, RabbitMQ, Memcached, Redis, and Galera. |
| Initial cluster configuration, including Pacemaker configuration, HAProxy, MongoDB, Galera, Ceph Monitor, and database initialization for OpenStack Platform services. |
|
Initial ring build for OpenStack Object Storage ( |
| Configure service start up settings in Pacemaker, including constraints to determine service start up order and service start up parameters. |
|
Initial configuration of projects, roles, and users in OpenStack Identity ( |
11.5. Troubleshooting IP Address Conflicts on the Provisioning Network
Discovery and deployment tasks will fail if the destination hosts are allocated an IP address which is already in use. To avoid this issue, you can perform a port scan of the Provisioning network to determine whether the discovery IP range and host IP range are free.
Perform the following steps from the undercloud host:
Install nmap
:
# yum install nmap
Use nmap
to scan the IP address range for active addresses. This example scans the 192.0.2.0/24 range, replace this with the IP subnet of the Provisioning network (using CIDR bitmask notation):
# nmap -sn 192.0.2.0/24
Review the output of the nmap
scan:
For example, you should see the IP address(es) of the undercloud, and any other hosts that are present on the subnet. If any of the active IP addresses conflict with the IP ranges in undercloud.conf, you will need to either change the IP address ranges or free up the IP addresses before introspecting or deploying the overcloud nodes.
# nmap -sn 192.0.2.0/24 Starting Nmap 6.40 ( http://nmap.org ) at 2015-10-02 15:14 EDT Nmap scan report for 192.0.2.1 Host is up (0.00057s latency). Nmap scan report for 192.0.2.2 Host is up (0.00048s latency). Nmap scan report for 192.0.2.3 Host is up (0.00045s latency). Nmap scan report for 192.0.2.5 Host is up (0.00040s latency). Nmap scan report for 192.0.2.9 Host is up (0.00019s latency). Nmap done: 256 IP addresses (5 hosts up) scanned in 2.45 seconds
11.6. Troubleshooting "No Valid Host Found" Errors
Sometimes the /var/log/nova/nova-conductor.log
contains the following error:
NoValidHost: No valid host was found. There are not enough hosts available.
This means the nova Scheduler could not find a bare metal node suitable for booting the new instance. This in turn usually means a mismatch between resources that nova expects to find and resources that ironic advertised to nova. Check the following in this case:
Make sure introspection succeeds for you. Otherwise check that each node contains the required ironic node properties. For each node:
$ ironic node-show [NODE UUID]
Check the
properties
JSON field has valid values for keyscpus
,cpu_arch
,memory_mb
andlocal_gb
.Check that the nova flavor used does not exceed the ironic node properties above for a required number of nodes:
$ nova flavor-show [FLAVOR NAME]
-
Check that sufficient nodes are in the
available
state according toironic node-list
. Nodes inmanageable
state usually mean a failed introspection. Check the nodes are not in maintenance mode. Use
ironic node-list
to check. A node automatically changing to maintenance mode usually means incorrect power credentials. Check them and then remove maintenance mode:$ ironic node-set-maintenance [NODE UUID] off
-
If you’re using the Automated Health Check (AHC) tools to perform automatic node tagging, check that you have enough nodes corresponding to each flavor/profile. Check the
capabilities
key inproperties
field forironic node-show
. For example, a node tagged for the Compute role should containprofile:compute
. It takes some time for node information to propagate from ironic to nova after introspection. The director’s tool usually accounts for it. However, if you performed some steps manually, there might be a short period of time when nodes are not available to nova. Use the following command to check the total resources in your system.:
$ nova hypervisor-stats
11.7. Troubleshooting the Overcloud after Creation
After creating your overcloud, you might want to perform certain overcloud operations in the future. For example, you might aim to scale your available nodes, or replace faulty nodes. Certain issues might arise when performing these operations. This section provides some advice to diagnose and troubleshoot failed post-creation operations.
11.7.1. Overcloud Stack Modifications
Problems can occur when modifying the overcloud
stack through the director. Example of stack modifications include:
- Scaling Nodes
- Removing Nodes
- Replacing Nodes
Modifying the stack is similar to the process of creating the stack, in that the director checks the availability of the requested number of nodes, provisions additional or removes existing nodes, and then applies the Puppet configuration. Here are some guidelines to follow in situations when modifying the overcloud
stack.
As an initial step, follow the advice set in Section 11.4.3, “Post-Deployment Configuration”. These same steps can help diagnose problems with updating the overcloud
heat stack. In particular, use the following command to help identify problematic resources:
heat stack-list --show-nested
-
List all stacks. The
--show-nested
displays all child stacks and their respective parent stacks. This command helps identify the point where a stack failed. heat resource-list overcloud
-
List all resources in the
overcloud
stack and their current states. This helps identify which resource is causing failures in the stack. You can trace this resource failure to its respective parameters and configuration in the heat template collection and the Puppet modules. heat event-list overcloud
-
List all events related to the
overcloud
stack in chronological order. This includes the initiation, completion, and failure of all resources in the stack. This helps identify points of resource failure.
The next few sections provide advice to diagnose issues on specific node types.
11.7.2. Controller Service Failures
The overcloud Controller nodes contain the bulk of Red Hat OpenStack Platform services. Likewise, you might use multiple Controller nodes in a high availability cluster. If a certain service on a node is faulty, the high availability cluster provides a certain level of failover. However, it then becomes necessary to diagnose the faulty service to ensure your overcloud operates at full capacity.
The Controller nodes use Pacemaker to manage the resources and services in the high availability cluster. The Pacemaker Configuration System (pcs
) command is a tool that manages a Pacemaker cluster. Run this command on a Controller node in the cluster to perform configuration and monitoring functions. Here are few commands to help troubleshoot overcloud services on a high availability cluster:
pcs status
- Provides a status overview of the entire cluster including enabled resources, failed resources, and online nodes.
pcs resource show
- Shows a list of resources, and their respective nodes.
pcs resource disable [resource]
- Stop a particular resource.
pcs resource enable [resource]
- Start a particular resource.
pcs cluster standby [node]
- Place a node in standby mode. The node is no longer available in the cluster. This is useful for performing maintenance on a specific node without affecting the cluster.
pcs cluster unstandby [node]
- Remove a node from standby mode. The node becomes available in the cluster again.
Use these Pacemaker commands to identify the faulty component and/or node. After identifying the component, view the respective component log file in /var/log/
.
11.7.3. Compute Service Failures
Compute nodes use the Compute service to perform hypervisor-based operations. This means the main diagnosis for Compute nodes revolves around this service. For example:
View the status of the service using the following
systemd
function:$ sudo systemctl status openstack-nova-compute.service
Likewise, view the
systemd
journal for the service using the following command:$ sudo journalctl -u openstack-nova-compute.service
-
The primary log file for Compute nodes is
/var/log/nova/nova-compute.log
. If issues occur with Compute node communication, this log file is usually a good place to start a diagnosis. - If performing maintenance on the Compute node, migrate the existing instances from the host to an operational Compute node, then disable the node. See Chapter 8, Migrating Virtual Machines Between Compute Nodes for more information on node migrations.
11.7.4. Ceph Storage Service Failures
For any issues that occur with Red Hat Ceph Storage clusters, see Chapter 10. Logging and Debugging in the Red Hat Ceph Storage Configuration Guide. This section provides information on diagnosing logs for all Ceph storage services.
11.8. Tuning the Undercloud
The advice in this section aims to help increase the performance of your undercloud. Implement the recommendations as necessary.
-
The Identity Service (keystone) uses a token-based system for access control against the other OpenStack services. After a certain period, the database will accumulate a large number of unused tokens; a default cronjob flushes the token table every day. It is recommended that you monitor your environment and adjust the token flush interval as needed. For the undercloud, you can adjust the interval using
crontab -u keystone -e
. Note that this is a temporary change and thatopenstack undercloud update
will reset this cronjob back to its default. Heat stores a copy of all template files in its database’s
raw_template
table each time you runopenstack overcloud deploy
. Theraw_template
table retains all past templates and grows in size. To remove unused templates in theraw_templates
table, create a daily cronjob that clears unused templates that exist in the database for longer than a day:0 04 * * * /bin/heat-manage purge_deleted -g days 1
The
openstack-heat-engine
andopenstack-heat-api
services might consume too many resources at times. If so, setmax_resources_per_stack=-1
in/etc/heat/heat.conf
and restart the heat services:$ sudo systemctl restart openstack-heat-engine openstack-heat-api
Sometimes the director might not have enough resources to perform concurrent node provisioning. The default is 10 nodes at the same time. To reduce the number of concurrent nodes, set the
max_concurrent_builds
parameter in/etc/nova/nova.conf
to a value less than 10 and restart the nova services:$ sudo systemctl restart openstack-nova-api openstack-nova-scheduler
Edit the
/etc/my.cnf.d/server.cnf
file. Some recommended values to tune include:- max_connections
- Number of simultaneous connections to the database. The recommended value is 4096.
- innodb_additional_mem_pool_size
- The size in bytes of a memory pool the database uses to store data dictionary information and other internal data structures. The default is usually 8M and an ideal value is 20M for the undercloud.
- innodb_buffer_pool_size
- The size in bytes of the buffer pool, the memory area where the database caches table and index data. The default is usually 128M and an ideal value is 1000M for the undercloud.
- innodb_flush_log_at_trx_commit
- Controls the balance between strict ACID compliance for commit operations, and higher performance that is possible when commit-related I/O operations are rearranged and done in batches. Set to 1.
- innodb_lock_wait_timeout
- The length of time in seconds a database transaction waits for a row lock before giving up. Set to 50.
- innodb_max_purge_lag
- This variable controls how to delay INSERT, UPDATE, and DELETE operations when purge operations are lagging. Set to 10000.
- innodb_thread_concurrency
- The limit of concurrent operating system threads. Ideally, provide at least two threads for each CPU and disk resource. For example, if using a quad-core CPU and a single disk, use 10 threads.
Ensure that heat has enough workers to perform an overcloud creation. Usually, this depends on how many CPUs the undercloud has. To manually set the number of workers, edit the
/etc/heat/heat.conf
file, set thenum_engine_workers
parameter to the number of workers you need (ideally 4), and restart the heat engine:$ sudo systemctl restart openstack-heat-engine
11.9. Important Logs for Undercloud and Overcloud
Use the following logs to find out information about the undercloud and overcloud when troubleshooting.
Information | Log Location |
---|---|
OpenStack Compute log |
|
OpenStack Compute API interactions |
|
OpenStack Compute Conductor log |
|
OpenStack Orchestration log |
|
OpenStack Orchestration API interactions |
|
OpenStack Orchestration CloudFormations log |
|
OpenStack Bare Metal Conductor log |
|
OpenStack Bare Metal API interactions |
|
Introspection |
|
OpenStack Workflow Engine log |
|
OpenStack Workflow Executor log |
|
OpenStack Workflow API interactions |
|
Information | Log Location |
---|---|
Cloud-Init Log |
|
Overcloud Configuration (Summary of Last Puppet Run) |
|
Overcloud Configuration (Report from Last Puppet Run) |
|
Overcloud Configuration (All Puppet Reports) |
|
Overcloud Configuration (stdout from each Puppet Run) |
|
Overcloud Configuration (stderr from each Puppet Run) |
|
High availability log |
|