Chapter 28. Troubleshooting director errors
Errors can occur at certain stages of the director processes. This section contains some information about diagnosing common problems.
28.1. Troubleshooting node registration Copy linkLink copied to clipboard!
Issues with node registration usually occur due to issues with incorrect node details. In these situations, validate the template file containing your node details and correct the imported node details.
Procedure
Source the
stackrcfile:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run the node import command with the
--validate-onlyoption. This option validates your node template without performing an import:(undercloud) $ openstack overcloud node import --validate-only ~/nodes.json Waiting for messages on queue 'tripleo' with no timeout. Successfully validated environment file
(undercloud) $ openstack overcloud node import --validate-only ~/nodes.json Waiting for messages on queue 'tripleo' with no timeout. Successfully validated environment fileCopy to Clipboard Copied! Toggle word wrap Toggle overflow To fix incorrect details with imported nodes, run the
openstack baremetalcommands to update node details. The following example shows how to change networking details:Identify the assigned port UUID for the imported node:
source ~/stackrc
$ source ~/stackrc (undercloud) $ openstack baremetal port list --node [NODE UUID]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the MAC address:
(undercloud) $ openstack baremetal port set --address=[NEW MAC] [PORT UUID]
(undercloud) $ openstack baremetal port set --address=[NEW MAC] [PORT UUID]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure a new IPMI address on the node:
(undercloud) $ openstack baremetal node set --driver-info ipmi_address=[NEW IPMI ADDRESS] [NODE UUID]
(undercloud) $ openstack baremetal node set --driver-info ipmi_address=[NEW IPMI ADDRESS] [NODE UUID]Copy to Clipboard Copied! Toggle word wrap Toggle overflow
28.2. Troubleshooting hardware introspection Copy linkLink copied to clipboard!
The Bare Metal Provisioning inspector service, ironic-inspector, times out after a default one-hour period if the inspection RAM disk does not respond. The timeout might indicate a bug in the inspection RAM disk, but usually the timeout occurs due to an environment misconfiguration.
You can diagnose and resolve common environment misconfiguration issues to ensure the introspection process runs to completion.
Procedure
Source the
stackrcundercloud credentials file:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that your nodes are in a
manageablestate. The introspection does not inspect nodes in anavailablestate, which is meant for deployment. If you want to inspect nodes that are in anavailablestate, change the node status tomanageablestate before introspection:openstack baremetal node manage <node_uuid>
(undercloud)$ openstack baremetal node manage <node_uuid>Copy to Clipboard Copied! Toggle word wrap Toggle overflow To configure temporary access to the introspection RAM disk during introspection debugging, use the
sshkeyparameter to append your public SSH key to thekernelconfiguration in the/httpboot/inspector.ipxefile:kernel http://192.2.0.1:8088/agent.kernel ipa-inspection-callback-url=http://192.168.0.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-benchmarks=cpu,mem,disk selinux=0 sshkey="<public_ssh_key>"kernel http://192.2.0.1:8088/agent.kernel ipa-inspection-callback-url=http://192.168.0.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-benchmarks=cpu,mem,disk selinux=0 sshkey="<public_ssh_key>"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the introspection on the node:
openstack overcloud node introspect <node_uuid> --provide
(undercloud)$ openstack overcloud node introspect <node_uuid> --provideCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use the
--provideoption to change the node state toavailableafter the introspection completes.Identify the IP address of the node from the
dnsmasqlogs:sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.log
(undercloud)$ sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.logCopy to Clipboard Copied! Toggle word wrap Toggle overflow If an error occurs, access the node using the root user and temporary access details:
ssh root@192.168.24.105
$ ssh root@192.168.24.105Copy to Clipboard Copied! Toggle word wrap Toggle overflow Access the node during introspection to run diagnostic commands and troubleshoot the introspection failure.
To stop the introspection process, run the following command:
openstack baremetal introspection abort <node_uuid>
(undercloud)$ openstack baremetal introspection abort <node_uuid>Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can also wait until the process times out.
NoteRed Hat OpenStack Platform director retries introspection three times after the initial abort. Run the
openstack baremetal introspection abortcommand at each attempt to abort the introspection completely.
28.3. Troubleshooting overcloud creation and deployment Copy linkLink copied to clipboard!
The initial creation of the overcloud occurs with the OpenStack Orchestration (heat) service. If an overcloud deployment fails, use the OpenStack clients and service log files to diagnose the failed deployment.
Procedure
Source the
stackrcfile:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run the deployment failures command:
openstack overcloud failures
$ openstack overcloud failuresCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to display the details of the failure:
(undercloud) $ openstack stack failures list <OVERCLOUD_NAME> --long
(undercloud) $ openstack stack failures list <OVERCLOUD_NAME> --longCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace
<OVERCLOUD_NAME>with the name of your overcloud.
-
Replace
Run the following command to identify the stacks that failed:
(undercloud) $ openstack stack list --nested --property status=FAILED
(undercloud) $ openstack stack list --nested --property status=FAILEDCopy to Clipboard Copied! Toggle word wrap Toggle overflow
28.4. Troubleshooting node provisioning Copy linkLink copied to clipboard!
The OpenStack Orchestration (heat) service controls the provisioning process. If node provisioning fails, use the OpenStack clients and service log files to diagnose the issues.
Procedure
Source the
stackrcfile:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the bare metal service to see all registered nodes and their current status:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow All nodes available for provisioning should have the following states set:
-
Maintenance set to
False. -
Provision State set to
availablebefore provisioning.
-
Maintenance set to
If a node does not have
Maintenanceset toFalseorProvision Stateset toavailable, then use the following table to identify the problem and the solution:Expand Problem Cause Solution Maintenance sets itself to
Trueautomatically.The director cannot access the power management for the nodes.
Check the credentials for node power management.
Provision State is set to
availablebut nodes do not provision.The problem occurred before bare metal deployment started.
Check the node details including the profile and flavor mapping. Check that the node hardware details are within the requirements for the flavor.
Provision State is set to
wait call-backfor a node.The node provisioning process has not yet finished for this node.
Wait until this status changes. Otherwise, connect to the virtual console of the node and check the output.
Provision State is
activeand Power State ispower onbut the nodes do not respond.The node provisioning has finished successfully and there is a problem during the post-deployment configuration step.
Diagnose the node configuration process. Connect to the virtual console of the node and check the output.
Provision State is
errorordeploy failed.Node provisioning has failed.
View the bare metal node details with the
openstack baremetal node showcommand and check thelast_errorfield, which contains error description.
Additional resources
28.5. Troubleshooting IP address conflicts during provisioning Copy linkLink copied to clipboard!
Introspection and deployment tasks fail if the destination hosts are allocated an IP address that is already in use. To prevent these failures, you can perform a port scan of the Provisioning network to determine whether the discovery IP range and host IP range are free.
Procedure
Install
nmap:sudo dnf install nmap
$ sudo dnf install nmapCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use
nmapto scan the IP address range for active addresses. This example scans the 192.168.24.0/24 range, replace this with the IP subnet of the Provisioning network (using CIDR bitmask notation):sudo nmap -sn 192.168.24.0/24
$ sudo nmap -sn 192.168.24.0/24Copy to Clipboard Copied! Toggle word wrap Toggle overflow Review the output of the
nmapscan. For example, you should see the IP address of the undercloud, and any other hosts that are present on the subnet:Copy to Clipboard Copied! Toggle word wrap Toggle overflow If any of the active IP addresses conflict with the IP ranges in undercloud.conf, you must either change the IP address ranges or release the IP addresses before you introspect or deploy the overcloud nodes.
28.6. Troubleshooting "No Valid Host Found" errors Copy linkLink copied to clipboard!
Sometimes the /var/log/nova/nova-conductor.log contains the following error:
NoValidHost: No valid host was found. There are not enough hosts available.
NoValidHost: No valid host was found. There are not enough hosts available.
This error occurs when the Compute Scheduler cannot find a bare metal node that is suitable for booting the new instance. This usually means that there is a mismatch between resources that the Compute service expects to find and resources that the Bare Metal service advertised to Compute. To check that there is a mismatch error, complete the following steps:
Procedure
Source the
stackrcfile:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the introspection succeeded on the node. If the introspection fails, check that each node contains the required ironic node properties:
(undercloud) $ openstack baremetal node show [NODE UUID]
(undercloud) $ openstack baremetal node show [NODE UUID]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the
propertiesJSON field has valid values for keyscpus,cpu_arch,memory_mbandlocal_gb.Ensure that the Compute flavor that is mapped to the node does not exceed the node properties for the required number of nodes:
(undercloud) $ openstack flavor show [FLAVOR NAME]
(undercloud) $ openstack flavor show [FLAVOR NAME]Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Run the
openstack baremetal node listcommand to ensure that there are sufficient nodes in the available state. Nodes inmanageablestate usually signify a failed introspection. Run the
openstack baremetal node listcommand and ensure that the nodes are not in maintenance mode. If a node changes to maintenance mode automatically, the likely cause is an issue with incorrect power management credentials. Check the power management credentials and then remove maintenance mode:(undercloud) $ openstack baremetal node maintenance unset [NODE UUID]
(undercloud) $ openstack baremetal node maintenance unset [NODE UUID]Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
If you are using automatic profile tagging, check that you have enough nodes that correspond to each flavor and profile. Run the
openstack baremetal node showcommand on a node and check thecapabilitieskey in thepropertiesfield. For example, a node tagged for the Compute role contains theprofile:computevalue. You must wait for node information to propagate from Bare Metal to Compute after introspection. However, if you performed some steps manually, there might be a short period of time when nodes are not available to the Compute service (nova). Use the following command to check the total resources in your system:
(undercloud) $ openstack hypervisor stats show
(undercloud) $ openstack hypervisor stats showCopy to Clipboard Copied! Toggle word wrap Toggle overflow
28.7. Troubleshooting container configuration Copy linkLink copied to clipboard!
Red Hat OpenStack Platform director uses podman to manage containers and puppet to create container configuration. This procedure shows how to diagnose a container when errors occur.
Accessing the host
Source the
stackrcfile:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the IP address of the node with the container failure.
(undercloud) $ metalsmith list
(undercloud) $ metalsmith listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to the node:
(undercloud) $ ssh tripleo-admin@192.168.24.60
(undercloud) $ ssh tripleo-admin@192.168.24.60Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identifying failed containers
View all containers:
sudo podman ps --all
$ sudo podman ps --allCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the failed container. The failed container usually exits with a non-zero status.
Checking container logs
Each container retains standard output from its main process. Use this output as a log to help determine what actually occurs during a container run. For example, to view the log for the
keystonecontainer, run the following command:sudo podman logs keystone
$ sudo podman logs keystoneCopy to Clipboard Copied! Toggle word wrap Toggle overflow In most cases, this log contains information about the cause of a container failure.
The host also retains the
stdoutlog for the failed service. You can find thestdoutlogs in/var/log/containers/stdouts/. For example, to view the log for a failedkeystonecontainer, run the following command:cat /var/log/containers/stdouts/keystone.log
$ cat /var/log/containers/stdouts/keystone.logCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Inspecting containers
In some situations, you might need to verify information about a container. For example, use the following command to view keystone container data:
sudo podman inspect keystone
$ sudo podman inspect keystone
This command returns a JSON object containing low-level configuration data. You can pipe the output to the jq command to parse specific data. For example, to view the container mounts for the keystone container, run the following command:
sudo podman inspect keystone | jq .[0].Mounts
$ sudo podman inspect keystone | jq .[0].Mounts
You can also use the --format option to parse data to a single line, which is useful for running commands against sets of container data. For example, to recreate the options used to run the keystone container, use the following inspect command with the --format option:
sudo podman inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}:{{ join .Options "," }}{{end}} -ti {{.Config.Image}}' keystone
$ sudo podman inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}:{{ join .Options "," }}{{end}} -ti {{.Config.Image}}' keystone
The --format option uses Go syntax to create queries.
Use these options in conjunction with the podman run command to recreate the container for troubleshooting purposes:
OPTIONS=$( sudo podman inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone )
sudo podman run --rm $OPTIONS /bin/bash
$ OPTIONS=$( sudo podman inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone )
$ sudo podman run --rm $OPTIONS /bin/bash
Running commands in a container
In some cases, you might need to obtain information from within a container through a specific Bash command. In this situation, use the following podman command to execute commands within a running container. For example, run the podman exec command to run a command inside the keystone container:
sudo podman exec -ti keystone <COMMAND>
$ sudo podman exec -ti keystone <COMMAND>
The -ti options run the command through an interactive pseudoterminal.
-
Replace
<COMMAND>with the command you want to run. For example, each container has a health check script to verify the service connection. You can run the health check script forkeystonewith the following command:
sudo podman exec -ti keystone /openstack/healthcheck
$ sudo podman exec -ti keystone /openstack/healthcheck
To access the container shell, run podman exec using /bin/bash as the command you want to run inside the container:
sudo podman exec -ti keystone /bin/bash
$ sudo podman exec -ti keystone /bin/bash
Viewing a container filesystem
To view the file system for the failed container, run the
podman mountcommand. For example, to view the file system for a failedkeystonecontainer, run the following command:sudo podman mount keystone
$ sudo podman mount keystoneCopy to Clipboard Copied! Toggle word wrap Toggle overflow This provides a mounted location to view the filesystem contents:
/var/lib/containers/storage/overlay/78946a109085aeb8b3a350fc20bd8049a08918d74f573396d7358270e711c610/merged
/var/lib/containers/storage/overlay/78946a109085aeb8b3a350fc20bd8049a08918d74f573396d7358270e711c610/mergedCopy to Clipboard Copied! Toggle word wrap Toggle overflow This is useful for viewing the Puppet reports within the container. You can find these reports in the
var/lib/puppet/directory within the container mount.
Exporting a container
When a container fails, you might need to investigate the full contents of the file. In this case, you can export the full file system of a container as a tar archive. For example, to export the keystone container file system, run the following command:
sudo podman export keystone -o keystone.tar
$ sudo podman export keystone -o keystone.tar
This command creates the keystone.tar archive, which you can extract and explore.
28.8. Troubleshooting Compute node failures Copy linkLink copied to clipboard!
Compute nodes use the Compute service to perform hypervisor-based operations. This means the main diagnosis for Compute nodes revolves around this service.
Procedure
Source the
stackrcfile:source ~/stackrc
$ source ~/stackrcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the IP address of the Compute node that contains the failure:
(undercloud) $ openstack server list
(undercloud) $ openstack server listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to the node:
(undercloud) $ ssh tripleo-admin@192.168.24.60
(undercloud) $ ssh tripleo-admin@192.168.24.60Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the root user:
sudo -i
$ sudo -iCopy to Clipboard Copied! Toggle word wrap Toggle overflow View the status of the container:
sudo podman ps -f name=nova_compute
$ sudo podman ps -f name=nova_computeCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
The primary log file for Compute nodes is
/var/log/containers/nova/nova-compute.log. If issues occur with Compute node communication, use this file to begin the diagnosis. - If you perform maintenance on the Compute node, migrate the existing instances from the host to an operational Compute node, then disable the node.
28.9. Creating an sosreport Copy linkLink copied to clipboard!
If you need to contact Red Hat for support with Red Hat OpenStack Platform, you might need to generate an sosreport. For more information about creating an sosreport, see:
28.10. Log locations Copy linkLink copied to clipboard!
Use the following logs to gather information about the undercloud and overcloud when you troubleshoot issues.
| Information | Log location |
|---|---|
| Containerized service logs |
|
| Standard output from containerized services |
|
| Ansible configuration logs |
|
| Information | Log location |
|---|---|
|
Command history for |
|
| Undercloud installation log |
|
| Information | Log location |
|---|---|
| Cloud-Init Log |
|
| High availability log |
|