Troubleshooting Ansible Automation Platform
Troubleshoot issues with Ansible Automation Platform
Abstract
Preface Copy linkLink copied to clipboard!
Use the Troubleshooting Ansible Automation Platform guide to troubleshoot your Ansible Automation Platform installation.
Providing feedback on Red Hat documentation Copy linkLink copied to clipboard!
If you have a suggestion to improve this documentation, or find an error, you can contact technical support at https://access.redhat.com to open a request.
Disclaimer: Links contained in this information to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
Chapter 1. Diagnosing the problem Copy linkLink copied to clipboard!
To start troubleshooting Ansible Automation Platform, use the must-gather command on OpenShift Container Platform or the sos utility on a VM-based installation to collect configuration and diagnostic information. You can attach the output of these utilities to your support case.
1.1. Troubleshooting Ansible Automation Platform on OpenShift Container Platform by using the must-gather command Copy linkLink copied to clipboard!
The oc adm must-gather command line interface (CLI) command collects information from your Ansible Automation Platform installation deployed on OpenShift Container Platform. It gathers information that is often needed for debugging issues, including resource definitions and service logs.
Running the oc adm must-gather CLI command creates a new directory containing the collected data that you can use to troubleshoot or attach to your support case.
If your OpenShift environment does not have access to registry.redhat.io and you cannot run the must-gather command, then run the oc adm inspect command instead.
Prerequisites
-
The OpenShift CLI (
oc) is installed.
Procedure
Log in to your cluster:
oc login <openshift_url>
oc login <openshift_url>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run one of the following commands based on your level of access in the cluster:
Run
must-gatheracross the entire cluster:oc adm must-gather --image=registry.redhat.io/ansible-automation-platform-25/aap-must-gather-rhel8 --dest-dir <dest_dir>
oc adm must-gather --image=registry.redhat.io/ansible-automation-platform-25/aap-must-gather-rhel8 --dest-dir <dest_dir>Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
--imagespecifies the image that gathers data -
--dest-dirspecifies the directory for the output
-
Run
must-gatherfor a specific namespace in the cluster:oc adm must-gather --image=registry.redhat.io/ansible-automation-platform-25/aap-must-gather-rhel8 --dest-dir <dest_dir> – /usr/bin/ns-gather <namespace>
oc adm must-gather --image=registry.redhat.io/ansible-automation-platform-25/aap-must-gather-rhel8 --dest-dir <dest_dir> – /usr/bin/ns-gather <namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
– /usr/bin/ns-gatherlimits themust-gatherdata collection to a specified namespace
-
To attach the
must-gatherarchive to your support case, create a compressed file from themust-gatherdirectory created before and attach it to your support case.For example, on a computer that uses a Linux operating system, run the following command, replacing
<must-gather-local.5421342344627712289/>with themust-gatherdirectory name:tar cvaf must-gather.tar.gz <must-gather.local.5421342344627712289/>
$ tar cvaf must-gather.tar.gz <must-gather.local.5421342344627712289/>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
1.2. Troubleshooting Ansible Automation Platform on VM-based installations by generating an sos report Copy linkLink copied to clipboard!
The sos utility collects configuration, diagnostic, and troubleshooting data from your Ansible Automation Platform on a VM-based installation.
For more information about installing and using the sos utility, see Generating an sos report for technical support.
Chapter 2. Resources for troubleshooting automation controller Copy linkLink copied to clipboard!
Find information about troubleshooting automation controller performance and logging issues.
- For information about performance troubleshooting for automation controller, see Performance troubleshooting for automation controller in Configuring automation execution.
- For information about troubleshooting automation controller logging, see Troubleshooting logging in Configuring automation execution.
Chapter 3. Backup and recovery Copy linkLink copied to clipboard!
Find information about troubleshooting backup and recovery operations for Ansible Automation Platform.
- For information about troubleshooting backup and recovery for installations of Ansible Automation Platform Operator on OpenShift Container Platform, see the Troubleshooting section in Backup and recovery for operator environments.
Chapter 4. Execution environments Copy linkLink copied to clipboard!
Resolve issues with execution environment images, including problems with the "Use in Controller" option.
4.1. Issue - Cannot select "Use in Controller" for execution environment on private automation hub Copy linkLink copied to clipboard!
You cannot use the Use in Controller option for an execution environment image on private automation hub. You also receive the error message: “No Controllers available”.
To resolve this issue, connect automation controller to your private automation hub instance.
Procedure
Change the
/etc/pulp/settings.pyfile on private automation hub and add one of the following parameters depending on your configuration:Single controller
CONNECTED_ANSIBLE_CONTROLLERS = ['<https://my.controller.node>']
CONNECTED_ANSIBLE_CONTROLLERS = ['<https://my.controller.node>']Copy to Clipboard Copied! Toggle word wrap Toggle overflow Many controllers behind a load balancer
CONNECTED_ANSIBLE_CONTROLLERS = ['<https://my.controller.loadbalancer>']
CONNECTED_ANSIBLE_CONTROLLERS = ['<https://my.controller.loadbalancer>']Copy to Clipboard Copied! Toggle word wrap Toggle overflow Many controllers without a load balancer
CONNECTED_ANSIBLE_CONTROLLERS = ['<https://my.controller.node1>', '<https://my.controller2.node2>']
CONNECTED_ANSIBLE_CONTROLLERS = ['<https://my.controller.node1>', '<https://my.controller2.node2>']Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Stop all of the private automation hub services:
systemctl stop pulpcore.service pulpcore-api.service pulpcore-content.service pulpcore-worker@1.service pulpcore-worker@2.service nginx.service redis.service
# systemctl stop pulpcore.service pulpcore-api.service pulpcore-content.service pulpcore-worker@1.service pulpcore-worker@2.service nginx.service redis.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart all of the private automation hub services:
systemctl start pulpcore.service pulpcore-api.service pulpcore-content.service pulpcore-worker@1.service pulpcore-worker@2.service nginx.service redis.service
# systemctl start pulpcore.service pulpcore-api.service pulpcore-content.service pulpcore-worker@1.service pulpcore-worker@2.service nginx.service redis.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verification
- Verify that you can now use the Use in Controller option in private automation hub.
Chapter 5. Installation Copy linkLink copied to clipboard!
Find information about troubleshooting containerized, operator, and RPM-based installations of Ansible Automation Platform.
- For information about troubleshooting your containerized Ansible Automation Platform installation, see Troubleshooting containerized Ansible Automation Platform installation.
- For information about troubleshooting your Red Hat Ansible Automation Platform Operator on OpenShift Container Platform installation, see Troubleshooting the Red Hat Ansible Automation Platform Operator on OpenShift Container Platform.
- For information about troubleshooting your RPM-based installation of Ansible Automation Platform, see Troubleshooting RPM installation of Ansible Automation Platform.
Chapter 6. Jobs Copy linkLink copied to clipboard!
Resolve common job issues including module resolution errors, timeout errors, pending jobs, and permission errors.
6.1. Issue - Jobs are failing with “ERROR! couldn’t resolve module/action” error message Copy linkLink copied to clipboard!
Jobs are failing with the error message “ERROR! couldn’t resolve module/action 'module name'. This often indicates a misspelling, missing collection, or incorrect module path”.
This error can happen when the collection associated with the module is missing from the execution environment.
The recommended resolution is to create a custom execution environment and add the required collections inside of that execution environment. For more information about creating an execution environment, see Using Ansible Builder in Creating and using execution environments.
Alternatively, you can complete these steps:
Procedure
-
Create a
collectionsfolder inside of the project repository. Add a
requirements.ymlfile inside of thecollectionsfolder and add the collection:collections: - <collection_name>
collections: - <collection_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.2. Issue - Jobs failing with timeout waiting for privilege escalation prompt Copy linkLink copied to clipboard!
This error can happen when the timeout value is too small, causing the job to stop before completion. The default timeout value for connection plugins is 10.
To resolve the issue, increase the timeout value by completing one of the following methods.
The following changes will affect all of the jobs in automation controller. To use a timeout value for a specific project, add an ansible.cfg file in the root of the project directory and add the timeout parameter value to that ansible.cfg file.
Procedure
Increase the timeout value by using one of the following methods:
Add ANSIBLE_TIMEOUT as an environment variable in the automation controller UI:
- Go to automation controller.
- From the navigation panel, select → .
Under Extra Environment Variables add the following:
{ "ANSIBLE_TIMEOUT": 60 }{ "ANSIBLE_TIMEOUT": 60 }Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Add a timeout value in the [defaults] section of the ansible.cfg file:
Edit the
/etc/ansible/ansible.cfgfile and add the following:[defaults] timeout = 60
[defaults] timeout = 60Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Run ad hoc commands with a timeout:
To run an ad hoc playbook in the command line, add the
--timeoutflag to theansible-playbookcommand, for example:ansible-playbook --timeout=60 <your_playbook.yml>
# ansible-playbook --timeout=60 <your_playbook.yml>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.3. Issue - Jobs in automation controller are stuck in a pending state Copy linkLink copied to clipboard!
After launching jobs in automation controller, the jobs stay in a pending state and do not start.
There are a few reasons jobs can become stuck in a pending state. For more information about troubleshooting this issue, see Playbook stays in pending in Configuring automation execution
Procedure
Run the following commands to list all of the pending jobs:
awx-manage shell_plus
# awx-manage shell_plusCopy to Clipboard Copied! Toggle word wrap Toggle overflow >>> UnifiedJob.objects.filter(status='pending')
>>> UnifiedJob.objects.filter(status='pending')Copy to Clipboard Copied! Toggle word wrap Toggle overflow Cancel the pending jobs by using one of the following methods:
To cancel all pending jobs, run the following command:
>>> UnifiedJob.objects.filter(status='pending').update(status='canceled')
>>> UnifiedJob.objects.filter(status='pending').update(status='canceled')Copy to Clipboard Copied! Toggle word wrap Toggle overflow To cancel a single job, run the following command, replacing
<job_id>with the job ID to cancel:>>> UnifiedJob.objects.filter(id=<job_id>).update(status='canceled')
>>> UnifiedJob.objects.filter(id=<job_id>).update(status='canceled')Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.4. Issue - Jobs failing with insufficient permissions error in private automation hub Copy linkLink copied to clipboard!
Jobs are failing with the error message "denied: requested access to the resource is denied, unauthorized: Insufficient permissions". This happens when using an execution environment in private automation hub.
This issue occurs when you protect private automation hub with a password or token but do not assign the registry credential to the execution environment.
Procedure
- Go to automation controller.
- From the navigation panel, select → .
- Click the execution environment assigned to the job template that is failing.
- Click .
- Assign the appropriate Registry credential from your private automation hub to the execution environment.
Chapter 7. Networking Copy linkLink copied to clipboard!
Resolve networking issues including subnet conflicts and SSL/TLS certificate problems.
7.1. Issue - Container subnet conflicts with internal network Copy linkLink copied to clipboard!
The default subnet used in Ansible Automation Platform containers conflicts with the internal network resulting in "No route to host" errors.
To resolve this issue, update the default classless inter-domain routing (CIDR) value so it does not conflict with the CIDR used by the default Podman networking plugin.
Procedure
In all controller and hybrid nodes, run the following commands to create a file called
custom.py:touch /etc/tower/conf.d/custom.py
# touch /etc/tower/conf.d/custom.pyCopy to Clipboard Copied! Toggle word wrap Toggle overflow chmod 640 /etc/tower/conf.d/custom.py
# chmod 640 /etc/tower/conf.d/custom.pyCopy to Clipboard Copied! Toggle word wrap Toggle overflow chown root:awx /etc/tower/conf.d/custom.py
# chown root:awx /etc/tower/conf.d/custom.pyCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the following to the
/etc/tower/conf.d/custom.pyfile:DEFAULT_CONTAINER_RUN_OPTIONS = ['--network', 'slirp4netns:enable_ipv6=true,cidr=192.168.1.0/24']
DEFAULT_CONTAINER_RUN_OPTIONS = ['--network', 'slirp4netns:enable_ipv6=true,cidr=192.168.1.0/24']Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
192.168.1.0/24is the value for the new CIDR in this example.
-
Stop and start the automation controller service in all controller and hybrid nodes:
automation-controller-service stop
# automation-controller-service stopCopy to Clipboard Copied! Toggle word wrap Toggle overflow automation-controller-service start
# automation-controller-service startCopy to Clipboard Copied! Toggle word wrap Toggle overflow All containers will start on the new CIDR.
7.2. Troubleshooting SSL/TLS issues Copy linkLink copied to clipboard!
To troubleshoot SSL/TLS issues, verify the certificate chain, use the correct certificates, and confirm that a trusted Certificate Authority (CA) signed the certificate.
Procedure
Check if the server is reachable over SSL/TLS.
Run the following command to confirm whether the server is reachable over SSL/TLS and to see the full certificate chain:
true | openssl s_client -showcerts -connect <fqdn_or_ip>:<port>
# true | openssl s_client -showcerts -connect <fqdn_or_ip>:<port>Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Replace
<fqdn_or_ip>and<port>with suitable values.
Verify the certificate details.
Run the following command to view the details of a certificate:
openssl x509 -in <path_to_certificate> -noout -text
# openssl x509 -in <path_to_certificate> -noout -textCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Replace
<path_to_certificate>with the path to the certificate file you want to inspect.The result of the command shows information such as:
- Subject - The entity the certificate has been issued to.
- Issuer - The CA that issued the certificate.
- Validity "Not Before" - The date the certificate was issued.
- Validity "Not After" - The date the certificate expires.
Verify a trusted CA signed the certificate.
Run the following command to verify that a specific certificate is valid and was signed by a trusted CA:
openssl verify -CAfile <path_to_ca_public_certificate> <path_to_server_certificate_file_to_verify>
openssl verify -CAfile <path_to_ca_public_certificate> <path_to_server_certificate_file_to_verify>Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
If the command returns
OK, it means the certificate file is valid and signed by a trusted CA.
Chapter 8. Playbooks Copy linkLink copied to clipboard!
You can use automation content navigator to interactively troubleshoot your playbook. For more information, see Troubleshooting Ansible content with automation content navigator.
Chapter 9. Upgrading Copy linkLink copied to clipboard!
Troubleshoot issues when upgrading to Ansible Automation Platform 2.5.
9.1. Issue - automation controller API connection fails after upgrade with load balancer Copy linkLink copied to clipboard!
When upgrading from Ansible Automation Platform 2.4 to 2.5, the upgrade completes successfully. However, connections to the platform gateway URL fail if you are using automation controller behind a load balancer.
You see this error message in the logs:
Error connecting to Controller API
Procedure
To resolve this issue, perform the following tasks for all controller hosts:
For each controller host, add the platform gateway URL as a trusted source in the
CSRF_TRUSTED_ORIGINsetting in the settings.py file.For example, if you configured the platform gateway URL as
https://www.example.com, you must add that URL in the settings.py file too as shown:CSRF_TRUSTED_ORIGINS = ['https://appX.example.com:8443','https://www.example.com']
CSRF_TRUSTED_ORIGINS = ['https://appX.example.com:8443','https://www.example.com']Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Restart each controller host by using the
automation-controller-service restartcommand so that the URL changes are implemented. For the procedure, see Start, stop, and restart automation controller in Configuring automation execution.