Chapter 15. Troubleshooting Director Issues

15.1. Troubleshooting Node Registration
Copiar o link

Issues with node registration usually arise from issues with incorrect node details. In this case, use ironic to fix problems with node data registered. Here are a few examples:

Find out the assigned port UUID:

source ~/stackrc

$ source ~/stackrc
(undercloud) $ openstack baremetal port list --node [NODE UUID]

Copy to Clipboard

Toggle word wrap

Update the MAC address:

(undercloud) $ openstack baremetal port set --address=[NEW MAC] [PORT UUID]

(undercloud) $ openstack baremetal port set --address=[NEW MAC] [PORT UUID]

Copy to Clipboard

Toggle word wrap

Run the following command:

(undercloud) $ openstack baremetal node set --driver-info ipmi_address=[NEW IPMI ADDRESS] [NODE UUID]

(undercloud) $ openstack baremetal node set --driver-info ipmi_address=[NEW IPMI ADDRESS] [NODE UUID]

Copy to Clipboard

Toggle word wrap

15.2. Troubleshooting Hardware Introspection
Copiar o link

The introspection process must run to completion. However, ironic’s Discovery daemon (ironic-inspector) times out after a default 1 hour period if the discovery ramdisk provides no response. Sometimes this might indicate a bug in the discovery ramdisk but usually it happens due to an environment misconfiguration, particularly BIOS boot settings.

Here are some common scenarios where environment misconfiguration occurs and advice on how to diagnose and resolve them.

Errors with Starting Node Introspection

Normally the introspection process uses the openstack overcloud node introspect command. However, if running the introspection directly with ironic-inspector, it might fail to discover nodes in the AVAILABLE state, which is meant for deployment and not for discovery. Change the node status to the MANAGEABLE state before discovery:

source ~/stackrc

$ source ~/stackrc
(undercloud) $ openstack baremetal node manage [NODE UUID]

Copy to Clipboard

Toggle word wrap

Then, when discovery completes, change back to AVAILABLE before provisioning:

(undercloud) $ openstack baremetal node provide [NODE UUID]

(undercloud) $ openstack baremetal node provide [NODE UUID]

Copy to Clipboard

Toggle word wrap

Stopping the Discovery Process

Stop the introspection process:

source ~/stackrc

$ source ~/stackrc
(undercloud) $ openstack baremetal introspection abort [NODE UUID]

Copy to Clipboard

Toggle word wrap

You can also wait until the process times out. If necessary, change the timeout setting in /etc/ironic-inspector/inspector.conf to another period in seconds.

Accessing the Introspection Ramdisk

The introspection ramdisk uses a dynamic login element. This means you can provide either a temporary password or an SSH key to access the node during introspection debugging. Use the following process to set up ramdisk access:

Provide a temporary password to the openssl passwd -1 command to generate an MD5 hash. For example:
```
openssl passwd -1 mytestpassword
```
```
$ openssl passwd -1 mytestpassword
$1$enjRSyIw$/fYUpJwr6abFy/d.koRgQ/
```
Copy to Clipboard Toggle word wrap

Edit the /httpboot/inspector.ipxe file, find the line starting with kernel, and append the rootpwd parameter and the MD5 hash. For example:

kernel http://192.2.0.1:8088/agent.kernel ipa-inspection-callback-url=http://192.168.0.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-benchmarks=cpu,mem,disk rootpwd="$1$enjRSyIw$/fYUpJwr6abFy/d.koRgQ/" selinux=0

kernel http://192.2.0.1:8088/agent.kernel ipa-inspection-callback-url=http://192.168.0.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-benchmarks=cpu,mem,disk rootpwd="$1$enjRSyIw$/fYUpJwr6abFy/d.koRgQ/" selinux=0

Copy to Clipboard

Toggle word wrap

Alternatively, you can append the sshkey parameter with your public SSH key.

Note

Quotation marks are required for both the rootpwd and sshkey parameters.

Start the introspection and find the IP address from either the arp command or the DHCP logs:
```
arp
sudo journalctl -u openstack-ironic-inspector-dnsmasq
```
```
$ arp
$ sudo journalctl -u openstack-ironic-inspector-dnsmasq
```
Copy to Clipboard Toggle word wrap
SSH as a root user with the temporary password or the SSH key.
```
ssh root@192.168.24.105
```
```
$ ssh root@192.168.24.105
```
Copy to Clipboard Toggle word wrap

Checking Introspection Storage

The director uses OpenStack Object Storage (swift) to save the hardware data obtained during the introspection process. If this service is not running, the introspection can fail. Check all services related to OpenStack Object Storage to ensure the service is running:

sudo systemctl list-units openstack-swift*

$ sudo systemctl list-units openstack-swift*

Copy to Clipboard

Toggle word wrap

15.3. Troubleshooting Workflows and Executions
Copiar o link

The OpenStack Workflow (mistral) service groups multiple OpenStack tasks into workflows. Red Hat OpenStack Platform uses a set of these workflow to perform common functions across the CLI and web UI. This includes bare metal node control, validations, plan management, and overcloud deployment.

For example, when running the openstack overcloud deploy command, the OpenStack Workflow service executes two workflows. The first one uploads the deployment plan:

Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: aef1e8c6-a862-42de-8bce-073744ed5e6b
Plan updated

Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: aef1e8c6-a862-42de-8bce-073744ed5e6b
Plan updated

Copy to Clipboard

Toggle word wrap

The second one starts the overcloud deployment:

Deploying templates in the directory /tmp/tripleoclient-LhRlHX/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 97b64abe-d8fc-414a-837a-1380631c764d
2016-11-28 06:29:26Z [overcloud]: CREATE_IN_PROGRESS  Stack CREATE started
2016-11-28 06:29:26Z [overcloud.Networks]: CREATE_IN_PROGRESS  state changed
2016-11-28 06:29:26Z [overcloud.HeatAuthEncryptionKey]: CREATE_IN_PROGRESS  state changed
2016-11-28 06:29:26Z [overcloud.ServiceNetMap]: CREATE_IN_PROGRESS  state changed
...

Deploying templates in the directory /tmp/tripleoclient-LhRlHX/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 97b64abe-d8fc-414a-837a-1380631c764d
2016-11-28 06:29:26Z [overcloud]: CREATE_IN_PROGRESS  Stack CREATE started
2016-11-28 06:29:26Z [overcloud.Networks]: CREATE_IN_PROGRESS  state changed
2016-11-28 06:29:26Z [overcloud.HeatAuthEncryptionKey]: CREATE_IN_PROGRESS  state changed
2016-11-28 06:29:26Z [overcloud.ServiceNetMap]: CREATE_IN_PROGRESS  state changed
...

Copy to Clipboard

Toggle word wrap

Workflow Objects

OpenStack Workflow uses the following objects to keep track of the workflow:

Actions: A particular instruction that OpenStack performs once an associated task runs. Examples include running shell scripts or performing HTTP requests. Some OpenStack components have in-built actions that OpenStack Workflow uses.
Tasks: Defines the action to run and the result of running the action. These tasks usually have actions or other workflows associated with them. Once a task completes, the workflow directs to another task, usually depending on whether the task succeeded or failed.
Workflows: A set of tasks grouped together and executed in a specific order.
Executions: Defines a particular action, task, or workflow running.

Workflow Error Diagnosis

OpenStack Workflow also provides robust logging of executions, which help you identify issues with certain command failures. For example, if a workflow execution fails, you can identify the point of failure. List the workflow executions that have the failed state ERROR:

source ~/stackrc

$ source ~/stackrc
(undercloud) $ openstack workflow execution list | grep "ERROR"

Copy to Clipboard

Toggle word wrap

Get the UUID of the failed workflow execution (for example, dffa96b0-f679-4cd2-a490-4769a3825262) and view the execution and its output:

(undercloud) $ openstack workflow execution show dffa96b0-f679-4cd2-a490-4769a3825262
(undercloud) $ openstack workflow execution output show dffa96b0-f679-4cd2-a490-4769a3825262

(undercloud) $ openstack workflow execution show dffa96b0-f679-4cd2-a490-4769a3825262
(undercloud) $ openstack workflow execution output show dffa96b0-f679-4cd2-a490-4769a3825262

Copy to Clipboard

Toggle word wrap

This provides information about the failed task in the execution. The openstack workflow execution show also displays the workflow used for the execution (for example, tripleo.plan_management.v1.publish_ui_logs_to_swift). You can view the full workflow definition using the following command:

(undercloud) $ openstack workflow definition show tripleo.plan_management.v1.publish_ui_logs_to_swift

(undercloud) $ openstack workflow definition show tripleo.plan_management.v1.publish_ui_logs_to_swift

Copy to Clipboard

Toggle word wrap

This is useful for identifying where in the workflow a particular task occurs.

You can also view action executions and their results using a similar command syntax:

(undercloud) $ openstack action execution list
(undercloud) $ openstack action execution show 8a68eba3-0fec-4b2a-adc9-5561b007e886
(undercloud) $ openstack action execution output show 8a68eba3-0fec-4b2a-adc9-5561b007e886

(undercloud) $ openstack action execution list
(undercloud) $ openstack action execution show 8a68eba3-0fec-4b2a-adc9-5561b007e886
(undercloud) $ openstack action execution output show 8a68eba3-0fec-4b2a-adc9-5561b007e886

Copy to Clipboard

Toggle word wrap

This is useful for identifying a specific action causing issues.

15.4. Troubleshooting Overcloud Creation
Copiar o link

There are three layers where the deployment can fail:

Orchestration (heat and nova services)
Bare Metal Provisioning (ironic service)
Post-Deployment Configuration (Puppet)

If an overcloud deployment has failed at any of these levels, use the OpenStack clients and service log files to diagnose the failed deployment. You can also run the following command to display details of the failure:

openstack stack failures list <OVERCLOUD_NAME> --long

$ openstack stack failures list <OVERCLOUD_NAME> --long

Copy to Clipboard

Toggle word wrap

Replace <OVERCLOUD_NAME> with the name of your overcloud.

Note

If the initial overcloud creation fails, you can delete the partially deployed overcloud with the openstack stack delete overcloud command and try again. Only run this command if these initial overcloud creation fails. Do not run this command on a fully deployed and operational overcloud or else you will delete the entire overcloud.

15.4.1. Accessing deployment command history
Copiar o link

Understanding historical director deployment commands and arguments can be useful for troubleshooting and support. You can view this information in /home/stack/.tripleo/history.

15.4.2. Orchestration
Copiar o link

In most cases, Heat shows the failed overcloud stack after the overcloud creation fails:

source ~/stackrc

$ source ~/stackrc
(undercloud) $ openstack stack list --nested --property status=FAILED
+-----------------------+------------+--------------------+----------------------+
| id                    | stack_name | stack_status       | creation_time        |
+-----------------------+------------+--------------------+----------------------+
| 7e88af95-535c-4a55... | overcloud  | CREATE_FAILED      | 2015-04-06T17:57:16Z |
+-----------------------+------------+--------------------+----------------------+

Copy to Clipboard

Toggle word wrap

If the stack list is empty, this indicates an issue with the initial Heat setup. Check your Heat templates and configuration options, and check for any error messages that presented after running openstack overcloud deploy.

15.4.3. Bare Metal Provisioning
Copiar o link

Check ironic to see all registered nodes and their current status:

source ~/stackrc

$ source ~/stackrc
(undercloud) $ openstack baremetal node list

+----------+------+---------------+-------------+-----------------+-------------+
| UUID     | Name | Instance UUID | Power State | Provision State | Maintenance |
+----------+------+---------------+-------------+-----------------+-------------+
| f1e261...| None | None          | power off   | available       | False       |
| f0b8c1...| None | None          | power off   | available       | False       |
+----------+------+---------------+-------------+-----------------+-------------+

Copy to Clipboard

Toggle word wrap

Here are some common issues that arise from the provisioning process.

Review the Provision State and Maintenance columns in the resulting table. Check for the following:
- An empty table, or fewer nodes than you expect
- Maintenance is set to True
- Provision State is set to manageable. This usually indicates an issue with the registration or discovery processes. For example, if Maintenance sets itself to True automatically, the nodes are usually using the wrong power management credentials.
If Provision State is available, then the problem occurred before bare metal deployment has even started.
If Provision State is active and Power State is power on, the bare metal deployment has finished successfully. This means that the problem occurred during the post-deployment configuration step.
If Provision State is wait call-back for a node, the bare metal provisioning process has not yet finished for this node. Wait until this status changes, otherwise, connect to the virtual console of the failed node and check the output.
If Provision State is error or deploy failed, then bare metal provisioning has failed for this node. Check the bare metal node’s details:
```
(undercloud) $ openstack baremetal node show [NODE UUID]
```
```
(undercloud) $ openstack baremetal node show [NODE UUID]
```
Copy to Clipboard Toggle word wrap
Look for last_error field, which contains error description. If the error message is vague, you can use logs to clarify it:
```
(undercloud) $ sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
```
```
(undercloud) $ sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
```
Copy to Clipboard Toggle word wrap
If you see wait timeout error and the node Power State is power on, connect to the virtual console of the failed node and check the output.

15.4.4. Post-Deployment Configuration
Copiar o link

Many things can occur during the configuration stage. For example, a particular Puppet module could fail to complete due to an issue with the setup. This section provides a process to diagnose such issues.

List all the resources from the overcloud stack to see which one failed:

source ~/stackrc

$ source ~/stackrc
(undercloud) $ openstack stack resource list overcloud --filter status=FAILED

Copy to Clipboard

Toggle word wrap

This shows a table of all failed resources.

Show the failed resource:

(undercloud) $ openstack stack resource show overcloud [FAILED RESOURCE]

(undercloud) $ openstack stack resource show overcloud [FAILED RESOURCE]

Copy to Clipboard

Toggle word wrap

Check for any information in the resource_status_reason field that can help your diagnosis.

Use the nova command to see the IP addresses of the overcloud nodes.

(undercloud) $ openstack server list

(undercloud) $ openstack server list

Copy to Clipboard

Toggle word wrap

Log in as the heat-admin user to one of the deployed nodes. For example, if the stack’s resource list shows the error occurred on a Controller node, log in to a Controller node. The heat-admin user has sudo access.

(undercloud) $ ssh heat-admin@192.168.24.14

(undercloud) $ ssh heat-admin@192.168.24.14

Copy to Clipboard

Toggle word wrap

Check the os-collect-config log for a possible reason for the failure.

sudo journalctl -u os-collect-config

[heat-admin@overcloud-controller-0 ~]$ sudo journalctl -u os-collect-config

Copy to Clipboard

Toggle word wrap

In some cases, nova fails deploying the node in entirety. This situation would be indicated by a failed OS::Heat::ResourceGroup for one of the overcloud role types. Use nova to see the failure in this case.

(undercloud) $ openstack server list
(undercloud) $ openstack server show [SERVER ID]

(undercloud) $ openstack server list
(undercloud) $ openstack server show [SERVER ID]

Copy to Clipboard

Toggle word wrap

The most common error shown will reference the error message No valid host was found. See Section 15.6, “Troubleshooting "No Valid Host Found" Errors” for details on troubleshooting this error. In other cases, look at the following log files for further troubleshooting:

/var/log/nova/*
/var/log/heat/*
/var/log/ironic/*

The post-deployment process for Controller nodes uses five main steps for the deployment. This includes:

Expand

Table 15.1. Controller Node Configuration Steps
Step	Description
`ControllerDeployment_Step1`	Initial load balancing software configuration, including Pacemaker, RabbitMQ, Memcached, Redis, and Galera.
`ControllerDeployment_Step2`	Initial cluster configuration, including Pacemaker configuration, HAProxy, MongoDB, Galera, Ceph Monitor, and database initialization for OpenStack Platform services.
`ControllerDeployment_Step3`	Initial ring build for OpenStack Object Storage (`swift`). Configuration of all OpenStack Platform services (`nova`, `neutron`, `cinder`, `sahara`, `ceilometer`, `heat`, `horizon`, `aodh`, `gnocchi`).
`ControllerDeployment_Step4`	Configure service start up settings in Pacemaker, including constraints to determine service start up order and service start up parameters.
`ControllerDeployment_Step5`	Initial configuration of projects, roles, and users in OpenStack Identity (`keystone`).

15.5. Troubleshooting IP Address Conflicts on the Provisioning Network
Copiar o link

Discovery and deployment tasks will fail if the destination hosts are allocated an IP address which is already in use. To avoid this issue, you can perform a port scan of the Provisioning network to determine whether the discovery IP range and host IP range are free.

Perform the following steps from the undercloud host:

Install nmap:

sudo yum install nmap

$ sudo yum install nmap

Copy to Clipboard

Toggle word wrap

Use nmap to scan the IP address range for active addresses. This example scans the 192.168.24.0/24 range, replace this with the IP subnet of the Provisioning network (using CIDR bitmask notation):

sudo nmap -sn 192.168.24.0/24

$ sudo nmap -sn 192.168.24.0/24

Copy to Clipboard

Toggle word wrap

Review the output of the nmap scan:

For example, you should see the IP address(es) of the undercloud, and any other hosts that are present on the subnet. If any of the active IP addresses conflict with the IP ranges in undercloud.conf, you will need to either change the IP address ranges or free up the IP addresses before introspecting or deploying the overcloud nodes.

sudo nmap -sn 192.168.24.0/24

$ sudo nmap -sn 192.168.24.0/24

Starting Nmap 6.40 ( http://nmap.org ) at 2015-10-02 15:14 EDT
Nmap scan report for 192.168.24.1
Host is up (0.00057s latency).
Nmap scan report for 192.168.24.2
Host is up (0.00048s latency).
Nmap scan report for 192.168.24.3
Host is up (0.00045s latency).
Nmap scan report for 192.168.24.5
Host is up (0.00040s latency).
Nmap scan report for 192.168.24.9
Host is up (0.00019s latency).
Nmap done: 256 IP addresses (5 hosts up) scanned in 2.45 seconds

Copy to Clipboard

Toggle word wrap

15.6. Troubleshooting "No Valid Host Found" Errors
Copiar o link

Sometimes the /var/log/nova/nova-conductor.log contains the following error:

NoValidHost: No valid host was found. There are not enough hosts available.

NoValidHost: No valid host was found. There are not enough hosts available.

Copy to Clipboard

Toggle word wrap

This means the nova Scheduler could not find a bare metal node suitable for booting the new instance. This in turn usually means a mismatch between resources that nova expects to find and resources that ironic advertised to nova. Check the following in this case:

Make sure introspection succeeds for you. Otherwise check that each node contains the required ironic node properties. For each node:
```
source ~/stackrc
```
```
$ source ~/stackrc
(undercloud) $ openstack baremetal node show [NODE UUID]
```
Copy to Clipboard Toggle word wrap
Check the properties JSON field has valid values for keys cpus, cpu_arch, memory_mb and local_gb.
Check that the nova flavor used does not exceed the ironic node properties above for a required number of nodes:
```
(undercloud) $ openstack flavor show [FLAVOR NAME]
```
```
(undercloud) $ openstack flavor show [FLAVOR NAME]
```
Copy to Clipboard Toggle word wrap
Check that sufficient nodes are in the available state according to openstack baremetal node list. Nodes in manageable state usually mean a failed introspection.
Check the nodes are not in maintenance mode. Use openstack baremetal node list to check. A node automatically changing to maintenance mode usually means incorrect power credentials. Check them and then remove maintenance mode:
```
(undercloud) $ openstack baremetal node maintenance unset [NODE UUID]
```
```
(undercloud) $ openstack baremetal node maintenance unset [NODE UUID]
```
Copy to Clipboard Toggle word wrap
If you’re using the Automated Health Check (AHC) tools to perform automatic node tagging, check that you have enough nodes corresponding to each flavor/profile. Check the capabilities key in properties field for openstack baremetal node show. For example, a node tagged for the Compute role should contain profile:compute.
It takes some time for node information to propagate from ironic to nova after introspection. The director’s tool usually accounts for it. However, if you performed some steps manually, there might be a short period of time when nodes are not available to nova. Use the following command to check the total resources in your system:
```
(undercloud) $ openstack hypervisor stats show
```
```
(undercloud) $ openstack hypervisor stats show
```
Copy to Clipboard Toggle word wrap

15.7. Troubleshooting the Overcloud after Creation
Copiar o link

After creating your overcloud, you might want to perform certain overcloud operations in the future. For example, you might aim to scale your available nodes, or replace faulty nodes. Certain issues might arise when performing these operations. This section provides some advice to diagnose and troubleshoot failed post-creation operations.

15.7.1. Overcloud Stack Modifications
Copiar o link

Problems can occur when modifying the overcloud stack through the director. Example of stack modifications include:

Scaling Nodes
Removing Nodes
Replacing Nodes

Modifying the stack is similar to the process of creating the stack, in that the director checks the availability of the requested number of nodes, provisions additional or removes existing nodes, and then applies the Puppet configuration. Here are some guidelines to follow in situations when modifying the overcloud stack.

As an initial step, follow the advice set in Section 15.4.4, “Post-Deployment Configuration”. These same steps can help diagnose problems with updating the overcloud heat stack. In particular, use the following command to help identify problematic resources:

openstack stack list --show-nested: List all stacks. The --show-nested displays all child stacks and their respective parent stacks. This command helps identify the point where a stack failed.
openstack stack resource list overcloud: List all resources in the overcloud stack and their current states. This helps identify which resource is causing failures in the stack. You can trace this resource failure to its respective parameters and configuration in the heat template collection and the Puppet modules.
openstack stack event list overcloud: List all events related to the overcloud stack in chronological order. This includes the initiation, completion, and failure of all resources in the stack. This helps identify points of resource failure.

The next few sections provide advice to diagnose issues on specific node types.

15.7.2. Controller Service Failures
Copiar o link

The overcloud Controller nodes contain the bulk of Red Hat OpenStack Platform services. Likewise, you might use multiple Controller nodes in a high availability cluster. If a certain service on a node is faulty, the high availability cluster provides a certain level of failover. However, it then becomes necessary to diagnose the faulty service to ensure your overcloud operates at full capacity.

The Controller nodes use Pacemaker to manage the resources and services in the high availability cluster. The Pacemaker Configuration System (pcs) command is a tool that manages a Pacemaker cluster. Run this command on a Controller node in the cluster to perform configuration and monitoring functions. Here are few commands to help troubleshoot overcloud services on a high availability cluster:

pcs status: Provides a status overview of the entire cluster including enabled resources, failed resources, and online nodes.
pcs resource show: Shows a list of resources, and their respective nodes.
pcs resource disable [resource]: Stop a particular resource.
pcs resource enable [resource]: Start a particular resource.
pcs cluster standby [node]: Place a node in standby mode. The node is no longer available in the cluster. This is useful for performing maintenance on a specific node without affecting the cluster.
pcs cluster unstandby [node]: Remove a node from standby mode. The node becomes available in the cluster again.

Use these Pacemaker commands to identify the faulty component and/or node. After identifying the component, view the respective component log file in /var/log/.

15.7.3. Containerized Service Failures
Copiar o link

If a containerized service fails during or after overcloud deployment, use the following recommendations to determine the root cause for the failure:

Note

Before running these commands, check that you are logged into an overcloud node and not running these commands on the undercloud.

Checking the container logs

Each container retains standard output from its main process. This output acts as a log to help determine what actually occurs during a container run. For example, to view the log for the keystone container, use the following command:

sudo docker logs keystone

$ sudo docker logs keystone

Copy to Clipboard

Toggle word wrap

In most cases, this log provides the cause of a container’s failure.

Inspecting the container

In some situations, you might need to verify information about a container. For example, use the following command to view keystone container data:

sudo docker inspect keystone

$ sudo docker inspect keystone

Copy to Clipboard

Toggle word wrap

This provides a JSON object containing low-level configuration data. You can pipe the output to the jq command to parse specific data. For example, to view the container mounts for the keystone container, run the following command:

sudo docker inspect keystone | jq .[0].Mounts

$ sudo docker inspect keystone | jq .[0].Mounts

Copy to Clipboard

Toggle word wrap

You can also use the --format option to parse data to a single line, which is useful for running commands against sets of container data. For example, to recreate the options used to run the keystone container, use the following inspect command with the --format option:

sudo docker inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone

$ sudo docker inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone

Copy to Clipboard

Toggle word wrap

Note

The --format option uses Go syntax to create queries.

Use these options in conjunction with the docker run command to recreate the container for troubleshooting purposes:

OPTIONS=$( sudo docker inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone )
sudo docker run --rm $OPTIONS /bin/bash

$ OPTIONS=$( sudo docker inspect --format='{{range .Config.Env}} -e "{{.}}" {{end}} {{range .Mounts}} -v {{.Source}}:{{.Destination}}{{if .Mode}}:{{.Mode}}{{end}}{{end}} -ti {{.Config.Image}}' keystone )
$ sudo docker run --rm $OPTIONS /bin/bash

Copy to Clipboard

Toggle word wrap

Running commands in the container

In some cases, you might need to obtain information from within a container through a specific Bash command. In this situation, use the following docker command to execute commands within a running container. For example, to run a command in the keystone container:

sudo docker exec -ti keystone <COMMAND>

$ sudo docker exec -ti keystone <COMMAND>

Copy to Clipboard

Toggle word wrap

Note

The -ti options run the command through an interactive pseudoterminal.

Replace <COMMAND> with your desired command. For example, each container has a health check script to verify the service connection. You can run the health check script for keystone with the following command:

sudo docker exec -ti keystone /openstack/healthcheck

$ sudo docker exec -ti keystone /openstack/healthcheck

Copy to Clipboard

Toggle word wrap

To access the container’s shell, run docker exec using /bin/bash as the command:

sudo docker exec -ti keystone /bin/bash

$ sudo docker exec -ti keystone /bin/bash

Copy to Clipboard

Toggle word wrap

Exporting a container

When a container fails, you might need to investigate the full contents of the file. In this case, you can export the full file system of a container as a tar archive. For example, to export the keystone container’s file system, run the following command:

sudo docker export keystone -o keystone.tar

$ sudo docker export keystone -o keystone.tar

Copy to Clipboard

Toggle word wrap

This command create the keystone.tar archive, which you can extract and explore.

15.7.4. Compute Service Failures
Copiar o link

Compute nodes use the Compute service to perform hypervisor-based operations. This means the main diagnosis for Compute nodes revolves around this service. For example:

View the status of the container:
```
sudo docker ps -f name=nova_compute
```
```
$ sudo docker ps -f name=nova_compute
```
Copy to Clipboard Toggle word wrap
The primary log file for Compute nodes is /var/log/containers/nova/nova-compute.log. If issues occur with Compute node communication, this log file is usually a good place to start a diagnosis.
When you perform maintenance on the Compute node, migrate the existing instances from the host to an operational Compute node, then disable the node. For more information, see Migrating virtual machine instances between Compute nodes.

15.7.5. Ceph Storage Service Failures
Copiar o link

For any issues that occur with Red Hat Ceph Storage clusters, see "Logging Configuration Reference" in the Red Hat Ceph Storage Configuration Guide. This section provides information on diagnosing logs for all Ceph storage services.

15.8. Tuning the Undercloud
Copiar o link

The advice in this section aims to help increase the performance of your undercloud. Implement the recommendations as necessary.

The Identity Service (keystone) uses a token-based system for access control against the other OpenStack services. After a certain period, the database will accumulate a large number of unused tokens; a default cronjob flushes the token table every day. It is recommended that you monitor your environment and adjust the token flush interval as needed. For the undercloud, you can adjust the interval using crontab -u keystone -e. Note that this is a temporary change and that openstack undercloud update will reset this cronjob back to its default.
Heat stores a copy of all template files in its database’s raw_template table each time you run openstack overcloud deploy. The raw_template table retains all past templates and grows in size. To remove unused templates in the raw_templates table, create a daily cronjob that clears unused templates that exist in the database for longer than a day:
```
0 04 * * * /bin/heat-manage purge_deleted -g days 1
```
```
0 04 * * * /bin/heat-manage purge_deleted -g days 1
```
Copy to Clipboard Toggle word wrap
The openstack-heat-engine and openstack-heat-api services might consume too many resources at times. If so, set max_resources_per_stack=-1 in /etc/heat/heat.conf and restart the heat services:
```
sudo systemctl restart openstack-heat-engine openstack-heat-api
```
```
$ sudo systemctl restart openstack-heat-engine openstack-heat-api
```
Copy to Clipboard Toggle word wrap
Sometimes the director might not have enough resources to perform concurrent node provisioning. The default is 10 nodes at the same time. To reduce the number of concurrent nodes, set the max_concurrent_builds parameter in /etc/nova/nova.conf to a value less than 10 and restart the nova services:
```
sudo systemctl restart openstack-nova-api openstack-nova-scheduler
```
```
$ sudo systemctl restart openstack-nova-api openstack-nova-scheduler
```
Copy to Clipboard Toggle word wrap
Edit the /etc/my.cnf.d/galera.cnf file. Some recommended values to tune include:
max_connections
Number of simultaneous connections to the database. The recommended value is 4096.
innodb_additional_mem_pool_size
The size in bytes of a memory pool the database uses to store data dictionary information and other internal data structures. The default is usually 8M and an ideal value is 20M for the undercloud.
innodb_buffer_pool_size
The size in bytes of the buffer pool, the memory area where the database caches table and index data. The default is usually 128M and an ideal value is 1000M for the undercloud.
innodb_flush_log_at_trx_commit
Controls the balance between strict ACID compliance for commit operations, and higher performance that is possible when commit-related I/O operations are rearranged and done in batches. Set to 1.
innodb_lock_wait_timeout
The length of time in seconds a database transaction waits for a row lock before giving up. Set to 50.
innodb_max_purge_lag
This variable controls how to delay INSERT, UPDATE, and DELETE operations when purge operations are lagging. Set to 10000.
innodb_thread_concurrency
The limit of concurrent operating system threads. Ideally, provide at least two threads for each CPU and disk resource. For example, if using a quad-core CPU and a single disk, use 10 threads.
Ensure that heat has enough workers to perform an overcloud creation. Usually, this depends on how many CPUs the undercloud has. To manually set the number of workers, edit the /etc/heat/heat.conf file, set the num_engine_workers parameter to the number of workers you need (ideally 4), and restart the heat engine:
```
sudo systemctl restart openstack-heat-engine
```
```
$ sudo systemctl restart openstack-heat-engine
```
Copy to Clipboard Toggle word wrap

15.9. Creating an sosreport
Copiar o link

If you need to contact Red Hat for support on OpenStack Platform, you might need to generate an sosreport. See the following knowledgebase article for more information on how to create an sosreport:

"How to collect all required logs for Red Hat Support to investigate an OpenStack issue"

15.10. Important Logs for Undercloud and Overcloud
Copiar o link

Use the following logs to find out information about the undercloud and overcloud when troubleshooting.

Expand

Table 15.2. Important Logs for the Undercloud
Information	Log Location
OpenStack Compute log	`/var/log/nova/nova-compute.log`
OpenStack Compute API interactions	`/var/log/nova/nova-api.log`
OpenStack Compute Conductor log	`/var/log/nova/nova-conductor.log`
OpenStack Orchestration log	`/var/log/heat/heat-engine.log`
OpenStack Orchestration API interactions	`/var/log/heat/heat-api.log`
OpenStack Orchestration CloudFormations log	`/var/log/heat/heat-api-cfn.log`
OpenStack Bare Metal Conductor log	`/var/log/ironic/ironic-conductor.log`
OpenStack Bare Metal API interactions	`/var/log/ironic/ironic-api.log`
Introspection	`/var/log/ironic-inspector/ironic-inspector.log`
OpenStack Workflow Engine log	`/var/log/mistral/engine.log`
OpenStack Workflow Executor log	`/var/log/mistral/executor.log`
OpenStack Workflow API interactions	`/var/log/mistral/api.log`

Expand

Table 15.3. Important Logs for the Overcloud
Information	Log Location
Cloud-Init Log	`/var/log/cloud-init.log`
Overcloud Configuration (Summary of Last Puppet Run)	`/var/lib/puppet/state/last_run_summary.yaml`
Overcloud Configuration (Report from Last Puppet Run)	`/var/lib/puppet/state/last_run_report.yaml`
Overcloud Configuration (All Puppet Reports)	`/var/lib/puppet/reports/overcloud-/`
Overcloud Configuration (stdout from each Puppet Run)	`/var/run/heat-config/deployed/*-stdout.log`
Overcloud Configuration (stderr from each Puppet Run)	`/var/run/heat-config/deployed/*-stderr.log`
High availability log	`/var/log/pacemaker.log`

Este conteúdo não está disponível no idioma selecionado.

15.1. Troubleshooting Node Registration
Copiar o link

15.2. Troubleshooting Hardware Introspection
Copiar o link

15.3. Troubleshooting Workflows and Executions
Copiar o link

15.4. Troubleshooting Overcloud Creation
Copiar o link

15.4.1. Accessing deployment command history
Copiar o link

15.4.2. Orchestration
Copiar o link

15.4.3. Bare Metal Provisioning
Copiar o link

15.4.4. Post-Deployment Configuration
Copiar o link

15.5. Troubleshooting IP Address Conflicts on the Provisioning Network
Copiar o link

15.6. Troubleshooting "No Valid Host Found" Errors
Copiar o link

15.7. Troubleshooting the Overcloud after Creation
Copiar o link

15.7.1. Overcloud Stack Modifications
Copiar o link

15.7.2. Controller Service Failures
Copiar o link

15.7.3. Containerized Service Failures
Copiar o link

15.7.4. Compute Service Failures
Copiar o link

15.7.5. Ceph Storage Service Failures
Copiar o link

15.8. Tuning the Undercloud
Copiar o link

15.9. Creating an sosreport
Copiar o link

15.10. Important Logs for Undercloud and Overcloud
Copiar o link

Aprender

Experimente, compre e venda

Comunidades

Sobre a documentação da Red Hat

Tornando o open source mais inclusivo

Sobre a Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este conteúdo não está disponível no idioma selecionado.

Chapter 15. Troubleshooting Director Issues

15.1. Troubleshooting Node RegistrationCopiar o linkLink copiado para a área de transferência!

15.2. Troubleshooting Hardware IntrospectionCopiar o linkLink copiado para a área de transferência!

15.3. Troubleshooting Workflows and ExecutionsCopiar o linkLink copiado para a área de transferência!

15.4. Troubleshooting Overcloud CreationCopiar o linkLink copiado para a área de transferência!

15.4.1. Accessing deployment command historyCopiar o linkLink copiado para a área de transferência!

15.4.2. OrchestrationCopiar o linkLink copiado para a área de transferência!

15.4.3. Bare Metal ProvisioningCopiar o linkLink copiado para a área de transferência!

15.4.4. Post-Deployment ConfigurationCopiar o linkLink copiado para a área de transferência!

15.5. Troubleshooting IP Address Conflicts on the Provisioning NetworkCopiar o linkLink copiado para a área de transferência!

15.6. Troubleshooting "No Valid Host Found" ErrorsCopiar o linkLink copiado para a área de transferência!

15.7. Troubleshooting the Overcloud after CreationCopiar o linkLink copiado para a área de transferência!

15.7.1. Overcloud Stack ModificationsCopiar o linkLink copiado para a área de transferência!

15.7.2. Controller Service FailuresCopiar o linkLink copiado para a área de transferência!

15.7.3. Containerized Service FailuresCopiar o linkLink copiado para a área de transferência!

15.7.4. Compute Service FailuresCopiar o linkLink copiado para a área de transferência!

15.7.5. Ceph Storage Service FailuresCopiar o linkLink copiado para a área de transferência!

15.8. Tuning the UndercloudCopiar o linkLink copiado para a área de transferência!

15.9. Creating an sosreportCopiar o linkLink copiado para a área de transferência!

15.10. Important Logs for Undercloud and OvercloudCopiar o linkLink copiado para a área de transferência!

Aprender

Experimente, compre e venda

Comunidades

Sobre a documentação da Red Hat

Tornando o open source mais inclusivo

Sobre a Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

15.1. Troubleshooting Node Registration
Copiar o link

15.2. Troubleshooting Hardware Introspection
Copiar o link

15.3. Troubleshooting Workflows and Executions
Copiar o link

15.4. Troubleshooting Overcloud Creation
Copiar o link

15.4.1. Accessing deployment command history
Copiar o link

15.4.2. Orchestration
Copiar o link

15.4.3. Bare Metal Provisioning
Copiar o link

15.4.4. Post-Deployment Configuration
Copiar o link

15.5. Troubleshooting IP Address Conflicts on the Provisioning Network
Copiar o link

15.6. Troubleshooting "No Valid Host Found" Errors
Copiar o link

15.7. Troubleshooting the Overcloud after Creation
Copiar o link

15.7.1. Overcloud Stack Modifications
Copiar o link

15.7.2. Controller Service Failures
Copiar o link

15.7.3. Containerized Service Failures
Copiar o link

15.7.4. Compute Service Failures
Copiar o link

15.7.5. Ceph Storage Service Failures
Copiar o link

15.8. Tuning the Undercloud
Copiar o link

15.9. Creating an sosreport
Copiar o link

15.10. Important Logs for Undercloud and Overcloud
Copiar o link