Appendix A. Troubleshooting containerized Ansible Automation Platform
Use this information to troubleshoot your containerized Ansible Automation Platform installation.
A.1. Troubleshooting containerized Ansible Automation Platform installation
The installation takes a long time, or has errors, what should I check?
- Ensure your system meets the minimum requirements as outlined in the installation guide. Items such as improper storage choices and high latency when distributing across many hosts will all have a significant impact.
-
Check the installation log file located by default at
./aap_install.log
unless otherwise changed within the local installeransible.cfg
. -
Enable task profiling callbacks on an ad hoc basis to give an overview of where the installation program spends the most time. To do this, use the local
ansible.cfg
file. Add a callback line such as this under the[defaults]
section:
$ cat ansible.cfg [defaults] callbacks_enabled = ansible.posix.profile_tasks
Automation controller returns an error of 413
This error is due to manifest.zip
license files that are larger than the nginx_client_max_body_size
setting. If this error occurs, you will need to change the installation inventory file to include the following variables:
nginx_disable_hsts: false nginx_http_port: 8081 nginx_https_port: 8444 nginx_client_max_body_size: 20m nginx_user_headers: []
The current default setting of 20m
should be enough to avoid this issue.
The installation failed with a “502 Bad Gateway” when going to the controller UI.
This error can occur and manifest itself in the installation application output as:
TASK [ansible.containerized_installer.automationcontroller : Wait for the Controller API to te ready] ****************************************************** fatal: [daap1.lan]: FAILED! => {"changed": false, "connection": "close", "content_length": "150", "content_type": "text/html", "date": "Fri, 29 Sep 2023 09:42:32 GMT", "elapsed": 0, "msg": "Status code was 502 and not [200]: HTTP Error 502: Bad Gateway", "redirected": false, "server": "nginx", "status": 502, "url": "https://daap1.lan:443/api/v2/ping/"}
-
Check if you have an
automation-controller-web
container running and a systemd service.
This is used at the regular unprivileged user not system wide level. If you have used su
to switch to the user running the containers, you must set your XDG_RUNTIME_DIR
environment variable to the correct value to be able to interact with the user systemctl
units.
export XDG_RUNTIME_DIR="/run/user/$UID"
podman ps | grep web systemctl --user | grep web
No output indicates a problem.
Try restarting the
automation-controller-web
service:systemctl start automation-controller-web.service --user systemctl --user | grep web systemctl status automation-controller-web.service --user
Sep 29 10:55:16 daap1.lan automation-controller-web[29875]: nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use) Sep 29 10:55:16 daap1.lan automation-controller-web[29875]: nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
The output indicates that the port is already, or still, in use by another service. In this case
nginx
.Run:
sudo pkill nginx
- Restart and status check the web service again.
Normal service output should look similar to the following, and should still be running:
Sep 29 10:59:26 daap1.lan automation-controller-web[30274]: WSGI app 0 (mountpoint='/') ready in 3 seconds on interpreter 0x1a458c10 pid: 17 (default app) Sep 29 10:59:26 daap1.lan automation-controller-web[30274]: WSGI app 0 (mountpoint='/') ready in 3 seconds on interpreter 0x1a458c10 pid: 20 (default app) Sep 29 10:59:27 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:27,043 INFO [-] daphne.cli Starting server at tcp:port=8051:interface=127.0.> Sep 29 10:59:27 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:27,043 INFO Starting server at tcp:port=8051:interface=127.0.0.1 Sep 29 10:59:27 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:27,048 INFO [-] daphne.server HTTP/2 support not enabled (install the http2 > Sep 29 10:59:27 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:27,048 INFO HTTP/2 support not enabled (install the http2 and tls Twisted ex> Sep 29 10:59:27 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:27,049 INFO [-] daphne.server Configuring endpoint tcp:port=8051:interface=1> Sep 29 10:59:27 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:27,049 INFO Configuring endpoint tcp:port=8051:interface=127.0.0.1 Sep 29 10:59:27 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:27,051 INFO [-] daphne.server Listening on TCP address 127.0.0.1:8051 Sep 29 10:59:27 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:27,051 INFO Listening on TCP address 127.0.0.1:8051 Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: nginx entered RUNNING state, process has stayed up for > th> Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: nginx entered RUNNING state, process has stayed up for > th> Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: uwsgi entered RUNNING state, process has stayed up for > th> Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: uwsgi entered RUNNING state, process has stayed up for > th> Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: daphne entered RUNNING state, process has stayed up for > t> Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: daphne entered RUNNING state, process has stayed up for > t> Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: ws-heartbeat entered RUNNING state, process has stayed up f> Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: ws-heartbeat entered RUNNING state, process has stayed up f> Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: cache-clear entered RUNNING state, process has stayed up fo> Sep 29 10:59:54 daap1.lan automation-controller-web[30274]: 2023-09-29 09:59:54,139 INFO success: cache-clear entered RUNNING state, process has stayed up
You can run the installation program again to ensure everything installs as expected.
When attempting to install containerized Ansible Automation Platform in Amazon Web Services you receive output that there is no space left on device
TASK [ansible.containerized_installer.automationcontroller : Create the receptor container] *************************************************** fatal: [ec2-13-48-25-168.eu-north-1.compute.amazonaws.com]: FAILED! => {"changed": false, "msg": "Can't create container receptor", "stderr": "Error: creating container storage: creating an ID-mapped copy of layer \"98955f43cc908bd50ff43585fec2c7dd9445eaf05eecd1e3144f93ffc00ed4ba\": error during chown: storage-chown-by-maps: lchown usr/local/lib/python3.9/site-packages/azure/mgmt/network/v2019_11_01/operations/__pycache__/_available_service_aliases_operations.cpython-39.pyc: no space left on device: exit status 1\n", "stderr_lines": ["Error: creating container storage: creating an ID-mapped copy of layer \"98955f43cc908bd50ff43585fec2c7dd9445eaf05eecd1e3144f93ffc00ed4ba\": error during chown: storage-chown-by-maps: lchown usr/local/lib/python3.9/site-packages/azure/mgmt/network/v2019_11_01/operations/__pycache__/_available_service_aliases_operations.cpython-39.pyc: no space left on device: exit status 1"], "stdout": "", "stdout_lines": []}
If you are installing a /home
filesystem into a default Amazon Web Services marketplace RHEL instance, it might be too small since /home
is part of the root /
filesystem. You will need to make more space available. The documentation specifies a minimum of 40GB for a single-node deployment of containerized Ansible Automation Platform.
A.2. Troubleshooting containerized Ansible Automation Platform configuration
Sometimes the post install for seeding my Ansible Automation Platform content errors out. This could manifest itself as output similar to this:
TASK [infra.controller_configuration.projects : Configure Controller Projects | Wait for finish the projects creation] *************************************** Friday 29 September 2023 11:02:32 +0100 (0:00:00.443) 0:00:53.521 ****** FAILED - RETRYING: [daap1.lan]: Configure Controller Projects | Wait for finish the projects creation (1 retries left). failed: [daap1.lan] (item={'failed': 0, 'started': 1, 'finished': 0, 'ansible_job_id': '536962174348.33944', 'results_file': '/home/aap/.ansible_async/536962174348.33944', 'changed': False, '__controller_project_item': {'name': 'AAP Config-As-Code Examples', 'organization': 'Default', 'scm_branch': 'main', 'scm_clean': 'no', 'scm_delete_on_update': 'no', 'scm_type': 'git', 'scm_update_on_launch': 'no', 'scm_url': 'https://github.com/user/repo.git'}, 'ansible_loop_var': '__controller_project_item'}) => {"__projects_job_async_results_item": {"__controller_project_item": {"name": "AAP Config-As-Code Examples", "organization": "Default", "scm_branch": "main", "scm_clean": "no", "scm_delete_on_update": "no", "scm_type": "git", "scm_update_on_launch": "no", "scm_url": "https://github.com/user/repo.git"}, "ansible_job_id": "536962174348.33944", "ansible_loop_var": "__controller_project_item", "changed": false, "failed": 0, "finished": 0, "results_file": "/home/aap/.ansible_async/536962174348.33944", "started": 1}, "ansible_job_id": "536962174348.33944", "ansible_loop_var": "__projects_job_async_results_item", "attempts": 30, "changed": false, "finished": 0, "results_file": "/home/aap/.ansible_async/536962174348.33944", "started": 1, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
The infra.controller_configuration.dispatch
role uses an asynchronous loop with 30 retries to apply each configuration type, and the default delay between retries is 1 second. If the configuration is large, this might not be enough time to apply everything before the last retry occurs.
Increase the retry delay by setting the controller_configuration_async_delay
variable to something other than 1 second. For example, setting it to 2 seconds doubles the retry time. The place to do this would be in the repository where the controller configuration is defined. It could also be added to the [all:vars]
section of the installation program inventory file.
A few instances have shown that no additional modification is required, and re-running the installation program again worked.
A.3. Containerized Ansible Automation Platform reference
Can you provide details of the architecture for the Ansible Automation Platform containerized design?
We use as much of the underlying native RHEL technology as possible. For the container runtime and management of services we use Podman. Many Podman services and commands are used to show and investigate the solution.
For instance, use podman ps
, and podman images
to see some of the foundational and running pieces:
[aap@daap1 aap]$ podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 88ed40495117 registry.redhat.io/rhel8/postgresql-13:latest run-postgresql 48 minutes ago Up 47 minutes postgresql 8f55ba612f04 registry.redhat.io/rhel8/redis-6:latest run-redis 47 minutes ago Up 47 minutes redis 56c40445c590 registry.redhat.io/ansible-automation-platform-24/ee-supported-rhel8:latest /usr/bin/receptor... 47 minutes ago Up 47 minutes receptor f346f05d56ee registry.redhat.io/ansible-automation-platform-24/controller-rhel8:latest /usr/bin/launch_a... 47 minutes ago Up 45 minutes automation-controller-rsyslog 26e3221963e3 registry.redhat.io/ansible-automation-platform-24/controller-rhel8:latest /usr/bin/launch_a... 46 minutes ago Up 45 minutes automation-controller-task c7ac92a1e8a1 registry.redhat.io/ansible-automation-platform-24/controller-rhel8:latest /usr/bin/launch_a... 46 minutes ago Up 28 minutes automation-controller-web [aap@daap1 aap]$ podman images REPOSITORY TAG IMAGE ID CREATED SIZE registry.redhat.io/ansible-automation-platform-24/ee-supported-rhel8 latest b497bdbee59e 10 days ago 3.16 GB registry.redhat.io/ansible-automation-platform-24/controller-rhel8 latest ed8ebb1c1baa 10 days ago 1.48 GB registry.redhat.io/rhel8/redis-6 latest 78905519bb05 2 weeks ago 357 MB registry.redhat.io/rhel8/postgresql-13 latest 9b65bc3d0413 2 weeks ago 765 MB [aap@daap1 aap]$
Containerized Ansible Automation Platform runs as rootless containers for maximum out-of-the-box security. This means you can install containerized Ansible Automation Platform by using any local unprivileged user account. Privilege escalation is only needed for certain root level tasks, and by default is not needed to use root directly.
Once installed, you will notice certain items have populate on the filesystem where the installation program is run (the underlying RHEL host).
[aap@daap1 aap]$ tree -L 1 . ├── aap_install.log ├── ansible.cfg ├── collections ├── galaxy.yml ├── inventory ├── LICENSE ├── meta ├── playbooks ├── plugins ├── README.md ├── requirements.yml ├── roles
Other containerized services that make use of things such as Podman volumes, reside under the installation root directory used. Here are some examples for further reference:
The containers directory contains some of the Podman specifics used and installed for the execution plane:
containers/ ├── podman ├── storage │ ├── defaultNetworkBackend │ ├── libpod │ ├── networks │ ├── overlay │ ├── overlay-containers │ ├── overlay-images │ ├── overlay-layers │ ├── storage.lock │ └── userns.lock └── storage.conf
The controller directory has some of the installed configuration and runtime data points:
controller/ ├── data │ ├── job_execution │ ├── projects │ └── rsyslog ├── etc │ ├── conf.d │ ├── launch_awx_task.sh │ ├── settings.py │ ├── tower.cert │ └── tower.key ├── nginx │ └── etc ├── rsyslog │ └── run └── supervisor └── run
The receptor directory has the automation mesh configuration:
receptor/ ├── etc │ └── receptor.conf └── run ├── receptor.sock └── receptor.sock.lock
After installation, you will also find other pieces in the local users home directory such as the .cache
directory:
.cache/ ├── containers │ └── short-name-aliases.conf.lock └── rhsm └── rhsm.log
As we run by default in the most secure manner, such as rootless Podman, we can also use other services such as running systemd
as non-privileged users. Under systemd
you can see some of the component service controls available:
The .config
directory:
.config/ ├── cni │ └── net.d │ └── cni.lock ├── containers │ ├── auth.json │ └── containers.conf └── systemd └── user ├── automation-controller-rsyslog.service ├── automation-controller-task.service ├── automation-controller-web.service ├── default.target.wants ├── podman.service.d ├── postgresql.service ├── receptor.service ├── redis.service └── sockets.target.wants
This is specific to Podman and conforms to the Open Container Initiative (OCI) specifications. Whereas Podman run as the root user would use /var/lib/containers
by default, for standard users the hierarchy under $HOME/.local
is used.
The .local
directory:
.local/ └── share └── containers ├── cache ├── podman └── storage As an example `.local/storage/volumes` contains what the output from `podman volume ls` provides: [aap@daap1 containers]$ podman volume ls DRIVER VOLUME NAME local d73d3fe63a957bee04b4853fd38c39bf37c321d14fdab9ee3c9df03645135788 local postgresql local redis_data local redis_etc local redis_run
We isolate the execution plane from the control plane main services (PostgreSQL, Redis, automation controller, receptor, automation hub and Event-Driven Ansible).
Control plane services run with the standard Podman configuration (~/.local/share/containers/storage
).
Execution plane services use a dedicated configuration or storage (~/aap/containers/storage
) to avoid execution plane containers to be able to interact with the control plane.
How can I see host resource utilization statistics?
- Run:
$ podman container stats -a
podman container stats -a ID NAME CPU % MEM USAGE / LIMIT MEM % NET IO BLOCK IO PIDS CPU TIME AVG CPU % 0d5d8eb93c18 automation-controller-web 0.23% 959.1MB / 3.761GB 25.50% 0B / 0B 0B / 0B 16 20.885142s 1.19% 3429d559836d automation-controller-rsyslog 0.07% 144.5MB / 3.761GB 3.84% 0B / 0B 0B / 0B 6 4.099565s 0.23% 448d0bae0942 automation-controller-task 1.51% 633.1MB / 3.761GB 16.83% 0B / 0B 0B / 0B 33 34.285272s 1.93% 7f140e65b57e receptor 0.01% 5.923MB / 3.761GB 0.16% 0B / 0B 0B / 0B 7 1.010613s 0.06% c1458367ca9c redis 0.48% 10.52MB / 3.761GB 0.28% 0B / 0B 0B / 0B 5 9.074042s 0.47% ef712cc2dc89 postgresql 0.09% 21.88MB / 3.761GB 0.58% 0B / 0B 0B / 0B 21 15.571059s 0.80%
The previous is an example of a Dell sold and offered containerized Ansible Automation Platform solution (DAAP) install and utilizes ~1.8Gb RAM.
How much storage is used and where?
As we run rootless Podman the container volume storage is under the local user at $HOME/.local/share/containers/storage/volumes
.
To view the details of each volume run:
$ podman volume ls
Then run:
$ podman volume inspect <volume_name>
Here is an example:
$ podman volume inspect postgresql [ { "Name": "postgresql", "Driver": "local", "Mountpoint": "/home/aap/.local/share/containers/storage/volumes/postgresql/_data", "CreatedAt": "2024-01-08T23:39:24.983964686Z", "Labels": {}, "Scope": "local", "Options": {}, "MountCount": 0, "NeedsCopyUp": true } ]
Several files created by the installation program are located in $HOME/aap/
and bind-mounted into various running containers.
To view the mounts associated with a container run:
$ podman ps --format "{{.ID}}\t{{.Command}}\t{{.Names}}"
Example: $ podman ps --format "{{.ID}}\t{{.Command}}\t{{.Names}}" 89e779b81b83 run-postgresql postgresql 4c33cc77ef7d run-redis redis 3d8a028d892d /usr/bin/receptor... receptor 09821701645c /usr/bin/launch_a... automation-controller-rsyslog a2ddb5cac71b /usr/bin/launch_a... automation-controller-task fa0029a3b003 /usr/bin/launch_a... automation-controller-web 20f192534691 gunicorn --bind 1... automation-eda-api f49804c7e6cb daphne -b 127.0.0... automation-eda-daphne d340b9c1cb74 /bin/sh -c nginx ... automation-eda-web 111f47de5205 aap-eda-manage rq... automation-eda-worker-1 171fcb1785af aap-eda-manage rq... automation-eda-worker-2 049d10555b51 aap-eda-manage rq... automation-eda-activation-worker-1 7a78a41a8425 aap-eda-manage rq... automation-eda-activation-worker-2 da9afa8ef5e2 aap-eda-manage sc... automation-eda-scheduler 8a2958be9baf gunicorn --name p... automation-hub-api 0a8b57581749 gunicorn --name p... automation-hub-content 68005b987498 nginx -g daemon o... automation-hub-web cb07af77f89f pulpcore-worker automation-hub-worker-1 a3ba05136446 pulpcore-worker automation-hub-worker-2
Then run:
$ podman inspect <container_name> | jq -r .[].Mounts[].Source
Example: /home/aap/.local/share/containers/storage/volumes/receptor_run/_data /home/aap/.local/share/containers/storage/volumes/redis_run/_data /home/aap/aap/controller/data/rsyslog /home/aap/aap/controller/etc/tower.key /home/aap/aap/controller/etc/conf.d/callback_receiver_workers.py /home/aap/aap/controller/data/job_execution /home/aap/aap/controller/nginx/etc/controller.conf /home/aap/aap/controller/etc/conf.d/subscription_usage_model.py /home/aap/aap/controller/etc/conf.d/cluster_host_id.py /home/aap/aap/controller/etc/conf.d/insights.py /home/aap/aap/controller/rsyslog/run /home/aap/aap/controller/data/projects /home/aap/aap/controller/etc/settings.py /home/aap/aap/receptor/etc/receptor.conf /home/aap/aap/controller/etc/conf.d/execution_environments.py /home/aap/aap/tls/extracted /home/aap/aap/controller/supervisor/run /home/aap/aap/controller/etc/uwsgi.ini /home/aap/aap/controller/etc/conf.d/container_groups.py /home/aap/aap/controller/etc/launch_awx_task.sh /home/aap/aap/controller/etc/tower.cert
If the
jq
RPM is not installed, install with:$ sudo dnf -y install jq