Chapter 3. Before you start

Prior to deploying AAP on Red Hat OpenShift, it’s important to understand key considerations that need to be addressed before installation. These factors determine the health and scalability of your AAP environment throughout its lifecycle.

In this section, you will find a breakdown of those key considerations including:

Red Hat OpenShift resource management
Sizing recommendations for your automation controller pod containers
Sizing recommendations for your Postgres pod
Sizing recommendations for your automation job pods
Sizing recommendations for your automation hub pods

One of the key aspects of a successful deployment is the proper resource management for pods and containers to ensure optimal performance and availability of your AAP application for your Red Hat OpenShift cluster.

3.1. Resource management for pods and containers
Copy link

Two key resources regarding resource management are your CPU and memory (RAM). Red Hat OpenShift uses resource requests and resource limits to control the amount of resources that a container can consume in a pod.

3.1.1. What is a resource request?
Copy link

A resource request is the minimum amount of resources a container needs to run and function properly. The Kubernetes scheduler uses this value to ensure that there are enough resources available for the container.

3.1.2. What is a resource limit?
Copy link

Resource limits, on the other hand, are the maximum amount of resources that a container can consume. Setting resource limits ensures a container does not consume more resources than it should, which can cause other containers to suffer from resource starvation.

3.1.3. Why does resource management matter?
Copy link

When it comes to AAP, setting the correct resource requests and limits is crucial. Inadequate resource allocation can result in the termination of the control pod, causing the loss of all automation jobs within the automation controller.

3.1.4. Planning of resources
Copy link

While setting the proper resource management values, organizations need to consider which architecture best suits their needs based on the resources available. For example, determining whether high availability of their Ansible Automation Platform environment is more important than maximizing capacity to run automation jobs.

To better illustrate, let’s take the existing Red Hat OpenShift environment used in this reference architecture. It consists of:

3 control plane nodes
3 worker nodes

Each of these nodes consists of 4 vCPUs and 16 GiB of RAM.

Since the control plane nodes of a Red Hat OpenShift cluster do not run any applications, the example focuses on the 3 worker nodes that are available.

With these 3 worker nodes, we need to determine what is more important: maximizing Ansible Automation Platform availability or running as many automation jobs as possible or both?

If availability is of utmost importance, then the focus will be on ensuring that two control pods run on separate worker nodes (e.g. worker0 and worker1), while all automation jobs are run within the remaining worker node (e.g. worker2).

However, this reduces the resources available for running automation jobs in half, as the recommended practice is to separate control pods and automation pods from running in the same worker node.

If maximizing the amount of automation jobs to run is the main goal, then using one worker node (e.g. worker0) for the control pod and utilizing the remaining two worker nodes (e.g. worker1 and worker2) for running automation jobs would double the resources available for running jobs but at the cost of not having redundancy of the control pod.

Of course, the solution may be that both are equally important and if that is the case, additional resources (e.g. adding more worker nodes) would be needed to satisfy both requirements.

3.2. Size recommendations for your automation controller pod containers
Copy link

Looking back at our Figure 1.1, “automation controller architecture” from the overview section, you will notice the control pod contains 4 containers:

web
ee
redis
task

Each of these containers performs a unique function in Ansible automation controller and it’s critical to understand how resource configurations affect the control pod. By default, Red Hat OpenShift provides low values that are sufficient for a minimal test installation but isn’t optimal for running Ansible Automation Platform in production.

The Red Hat OpenShift defaults for Ansible Automation Platform are:

CPU: 100m
Memory: 128Mi

Red Hat OpenShift, by default, does not configure any maximum resource limits and will attempt to assign all possible resources requested by the Ansible Automation Platform control pod. This configuration can cause starvation of resources and affect other applications running on the Red Hat OpenShift cluster.

To demonstrate a starting point for the resource requests and limits of our containers in the control Pod, I will be using the following assumptions:

3 Worker nodes available within a Red Hat OpenShift cluster each with 4 vCPU and 16GiB RAM
Maximizing resources for automation jobs is more important than high availability
One dedicated worker node for running automation controller
Remaining two worker nodes for running automation jobs

When it comes to sizing the containers within the control pod, it is important to consider the specifics of your workload. While I have conducted performance tests that provide specific recommendations for this reference environment, these recommendations may not be applicable to all types of workloads.

As a starting point, I decided to take advantage of the performance collection playbooks, specifically the chatty_tasks.yml.

The performance benchmark consisted of:

Creating an inventory with 1 host
Creating a job template that runs the chatty_tasks.yml file

The chatty_tasks job template utilizes the ansible.builtin.debug module to generate a set number of debug messages per host and generates the necessary inventory. By utilizing the ansible.builtin.debug module, I can obtain an accurate representation of the automation controller’s performance without introducing any additional overhead.

The job template was executed with a specified concurrency level ranging from 10 to 50, indicating the number of simultaneous invocations of the job template.

The following resource requests and resource limits depicted below are the results of the performance benchmark and can be used as a starting baseline to run AAP with a Red Hat OpenShift cluster with similar resources.

spec:
...
  ee_resource_requirements:
    limits:
      cpu: 500m
      memory: 400Mi
    requests:
      cpu: 100m
      memory: 400Mi
  task_resource_requirements:
    limits:
      cpu: 4000m
      memory: 8Gi
    requests:
      cpu: 1000m
      memory: 8Gi
  web_resource_requirements:
    limits:
      cpu: 2000m
      memory: 1.5Gi
    requests:
      cpu: 500m
      memory: 1.5Gi
  redis_resource_requirements:
    limits:
      cpu: 500m
      memory: 1.5Gi
    requests:
      cpu: 250m
      memory: 1.5Gi

Note

Memory resource requests and limits are matched to prevent overutilization of memory resources within your Red Hat OpenShift cluster which can cause Out Of Memory (OOM) Kill of your pods. If resource limits are greater than resource requests, it can cause a scenario where you are allowing overutilization of your Red Hat OpenShift nodes.

Note

CPU resource requests and limits differ from memory as CPU resources are considered compressible. This means that Red Hat OpenShift will attempt to throttle our container’s CPU when hitting the resource limit but it will not terminate the container. In the above containers within the control pod, CPU requests were provided that are sufficient CPU for the workload given, but allowed to burst higher under load by setting its threshold (CPU limit) to a higher value.

Warning

The scenario above is making the assumption that no other applications are using resources within that worker node that the control pod resides in as it uses a dedicated Red Hat OpenShift worker node. More details can be found Section 3.7, “Specifying dedicated nodes for automation controller pod”.

3.3. Size recommendations for your Postgres pod
Copy link

After conducting the performance benchmark tests using the chatty_task playbook, it was observed that a CPU resource request below 500m may cause CPU throttling in a Postgres pod, as additional resources requested above the initial resource request, but below the resource limit, are not guaranteed to the pod. However, the CPU limit was set to 1000m (1 vCPU) because there were bursts during the the test that exceeded the 500m request.

With regards to memory, since memory is not a compressible resource, it was observed that during the chatty_task performance tests the Postgres pod at its highest levels in the tests consumed slightly over 650Mi of RAM.

Therefore, based on the results, my memory resource request and limit recommendation for this reference environment is 1Gi to provide a sufficient buffer and avoid a potential Out Of Memory (OOM) Kill of the Postgres pod.

The following resource requests and resource limits depicted below are the results of the performance benchmark test and can be used as a starting baseline to run your Postgres Pod.

spec:
...
  postgres_resource_requirements:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 1Gi

Warning

The values below are specific to this reference environment and may not be sufficient for your workload. It is important to monitor the performance of your Postgres pod and adjust the resource allocations to meet your performance needs.

3.4. Size recommendations for your automation job pods
Copy link

An Ansible Automation Platform job is an instance of automation controller launching an Ansible playbook against an inventory of hosts. When Ansible Automation Platform runs on Red Hat OpenShift, the default execution queue is a Container Group created by the operator at install time.

A container group consists of a Kubernetes credential and a default Pod specification. When jobs are launched into a Container Group, a pod is created by automation controller in the namespace specified by the Container Group pod specification. These pods are referred to as automation job pods.

In order to determine an appropriate size for the automation job pods, one must first understand the capabilities of how many jobs the automation controller control plane can launch concurrently.

In this example, we have 3 worker nodes (each 4 vCPU and 16GiB of RAM). One worker node hosts the control pod and the other two worker nodes are used for automation jobs.

Based on these values, we can determine the control capacity that the automation controller control plane can run.

The following formula provides the breakdown:

Total control capacity = Total Memory in MB / Fork size in MB

Based on a worker node, this can be expressed as:

Total control capacity = 16,000 MB / 100 MB = 160

Note

For those interested in more details about the calculations, review Resource Determination for Capacity Algorithm

What this means is that the automation controller is configured to launch 160 jobs concurrently. However, adjustments to this value need to be made in order to match our container group/execution plane capacity that we will get into shortly.

Note

For simplicity, 16GB is rounded to 16,000 MB, and the size of one fork is 100MB by default.

Now that we’ve calculated the available control capacity, we can determine the maximum number of concurrent automation jobs.

To determine this, we must be aware that an automation job pod specification within a container group/execution plane has a default request of vCPU 250m and 100Mi of RAM.

Using the total memory of one worker node:

16,000 MB / 100 MiB = 160 concurrent jobs

Using the total CPU of one worker node:

4000 millicpu / 250 millicpu = 16 concurrent jobs

Based on the above values, we must set the maximum concurrent jobs on a node to be the smallest of the two concurrent job values - 16. Since there are two worker nodes allocated to run automation jobs in our example, this number doubles to 32 (16 concurrent jobs per worker node).

Automation controller’s configuration is currently set to 160 concurrent jobs, and the available worker node capacity only allows for 32 concurrent jobs. This is an issue as the numbers are unbalanced.

What this means is automation controller’s control plane believes it can launch 160 jobs concurrently, while the Kubernetes scheduler will only schedule up to 32 automation job pods concurrently in the Container Group namespace.

Unbalanced values between the control plane and the container group/execution plane can lead to issues where:

If the control plane capacity is higher than the Container Group’s maximum number of concurrent job pods that it can schedule, the control plane will attempt to start jobs by submitting pods to be started. These pods, however, won’t actually begin to run until resources are made available. If the job pod does not start within the timeout of AWX_CONTAINER_GROUP_POD_PENDING_TIMEOUT, the job will be aborted (default is 2 hours).
If the Container Group is able to support more concurrent automation jobs than the control plane believes it can launch, this capacity will be effectively wasted as the automation controller will not launch enough automation jobs to ever reach the max number of concurrent automation jobs the Container Group could support.

To avoid risking aborted jobs or unused resources, it is recommended to balance the effective control capacity with the max number of concurrent jobs that the default Container Group can support.

The term "effective control capacity" is used because the max number of jobs the control plane will launch is affected by a setting called AWX_CONTROL_NODE_TASK_IMPACT. The AWX_CONTROL_NODE_TASK_IMPACT variable defines the amount of capacity that can be consumed on the control pod per automation job, effectively controlling the number of automation jobs that the control pod will attempt to start.

To achieve a balance between the effective control capacity and the available execution capacity, we can set the AWX_CONTROL_NODE_TASK_IMPACT variable to a value that limits the number of concurrent jobs that are to run on the automation controller control plane to match the number of automation job pods that are to be launched by the container group/execution plane.

To calculate the optimal value of AWX_CONTROL_NODE_TASK_IMPACT to avoid launching more concurrent automation jobs than the Container Group can support, we can use the following formula:

AWX_CONTROL_NODE_TASK_IMPACT = control capacity / max concurrent jobs the container group can launch

For our reference environment, this is:

AWX_CONTROL_NODE_TASK_IMPACT = 160 / 32 = 5

This concludes that for this reference environment, AWX_CONTROL_NODE_TASK_IMPACT should equal 5. This value will be set within the extra_setting portion of the Chapter 6, Installing automation controller chapter, which we’ll cover later in this document.

3.5. Summary of automation controller pod size recommendations
Copy link

Properly setting the resource requests and limits of our control plane (control pod) and our container group/execution plane (automation job pods) is necessary to ensure the control and execution capacity is balanced. The correct configuration can be determined by:

Calculating the control capacity
Calculating the number of automation jobs that can run concurrently
Setting the AWX_CONTROL_NODE_TASK_IMPACT variable with the appropriate balance value within the install of automation controller

3.6. Size recommendations for your automation hub pods
Copy link

Within the Figure 1.2, “automation hub architecture” outlined in the overview section, you’ll notice that the deployment is composed of seven pods, each hosting a container.

The list of pods consists of:

content (x2)
redis
api
postgres
worker (x2)

The seven pods that comprise the automation hub architecture work together to efficiently manage and distribute content, and are critical to the overall performance and scalability of your automation hub environment.

Among these pods, the worker pods are particularly important as they are responsible for processing, synchronizing, and distributing content. Due to this, it is important to set the appropriate amount of resources to the worker pods to ensure they can perform their tasks.

Note

The following are guidelines intended to provide an estimate of the resource requests and limits required for your automation hub environment. The actual resource needs will vary depending on the setup.

For example, an environment with a large number of repositories that are performing frequent updates or synchronizations may require more resources to handle the processing load.

In this reference environment, to determine the size of the pods, preliminary tests were done using one of the highest memory consumption tasks that can take place in an automation hub environment — synchronization of remote repositories.

The findings determined that to successfully sync remote repositories within automation hub the following resource requests and resource limits needed to be set for each of the pods:

spec:
...
content:
  resource_requirements:
    limits:
      cpu: 250mm
      memory: 400Mi
    requests:
      cpu: 100m
      memory: 400Mi

redis:
  resource_requirements:
    limits:
      cpu: 250m
      memory: 200Mi
    requests:
      cpu: 100m
      memory: 200Mi

api:
  resource_requirements:
    limits:
      cpu: 250m
      memory: 400Mi
    requests:
      cpu: 150m
      memory: 400Mi

postgres_resource_requirements:
  resource_requirements:
    limits:
      cpu: 500m
      memory: 1Gi
    requests:
      cpu: 200m
      memory: 1Gi

worker:
  resource_requirements:
    limits:
      cpu: 1000m
      memory: 3Gi
    requests:
      cpu: 400m
      memory: 3Gi

3.7. Specifying dedicated nodes for automation controller pod
Copy link

Running control pods on dedicated nodes is important in order to separate control pods and automation job pods and prevent resource contention between these two types of pods. This separation helps to maintain the stability and reliability of the control pods and the services they provide, without the risk of degradation due to resource constraints.

In this reference environment, the focus is on maximizing the number of automation jobs that can be run. This means that of the available 3 worker nodes within the Red Hat OpenShift environment, one worker node is dedicated to running the control pod, while the other 2 worker nodes are used for execution of automation jobs.

Warning

Dedicating only one worker node to run the control pod runs the potential risk of losing the service as it won’t have anywhere else to start up if the dedicated worker node were to go down. To remedy this situation, reducing the number of worker nodes that run automation jobs or adding an additional worker node to run an additional control pod replica within the Red Hat OpenShift cluster are viable options.

3.7.1. Assigning control pods to specific worker nodes for automation controller
Copy link

To assign a control pod to a specific node in Red Hat OpenShift, a combination of using the node_selector field in the pod specification, as well as, the topology_spread_constraints fields are used. The node_selector field allows you to specify the label criteria that a node must match in order to be eligible to host the pod. For example, if you have a node with the label aap_node_type: control, specify the following in the pod specification to assign the pod to this node:

spec:
...
  node_selector: |
    aap_node_type: control

The topology_spread_constraints sets the maximum number of pods (maxSkew) that can be scheduled on a node with label aap_node_type: control to 1. The topologyKey is set to kubernetes.io/hostname, a built-in label that indicates the hostname of the node. The whenUnsatisfiable setting is set to ScheduleAnyway, that allows the pod to be scheduled when there aren’t enough nodes with the required label to meet the constraints. The labelSelector matches pods with the label aap_node_type: control. The impact of this is that Red Hat OpenShift prioritizes scheduling a single controller pod per node. However, if there are more replica requests than available worker nodes, Red Hat OpenShift permits scheduling multiple controller pods in the same existing worker node if sufficient resources are available.

The tolerations section specifies that pods can only be scheduled on nodes with the label dedicated: AutomationController, and the effect of the toleration is set to NoSchedule ensuring that pods will not be scheduled on nodes that don’t have the required label. This is used in combination with topology_spread_contstraints to not only specify how to spread the pods across nodes, but also to indicate which nodes they can be scheduled on.

spec:
...
  topology_spread_constraints: |
    - maxSkew: 1
      topologyKey: "kubernetes.io/hostname"
      whenUnsatisfiable: "ScheduleAnyway"
      labelSelector:
        matchLabels:
          aap_node_type: control
  tolerations: |
    - key: "dedicated"
      operator: "Equal"
      value: "AutomationController"
      effect: "NoSchedule"

Note

The application of the node label and taints can be found within Appendix C, Applying labels and taints to Red Hat OpenShift node. The steps to add a node selector, topology constraints and tolerations to the spec file are shown in Chapter 6, Installing automation controller.

3.8. Handling Database High Availability
Copy link

The deployment of automation controller and automation hub components within Ansible Automation Platform take advantage of PVCs for their PostgreSQL database. Ensuring the availability of these PVCs is critical for the stability of running Ansible Automation Platform.

There are several strategies that can be used to handle PVC availability within a Red Hat OpenShift cluster, such as those provided by Crunchy Data via the Postgres Operator (PGO) and OpenShift Data Foundation (ODF).

Crunchy Data provides PGO, the Postgres Operator that gives you a declarative Postgres solution that automatically manages your PostgreSQL clusters. With PGO, users can create their Postgres cluster, scale and create a high availability (HA) Postgres cluster and connect it to their applications such as Ansible Automation Platform.

OpenShift Data Foundation (ODF) is a highly available storage solution that can manage persistent storage for your containerized applications. It consists of multiple open source operators and technologies including Ceph, NooBaa, and Rook. These different operators allow you to provision and manage your File, Block, and Object storage that can then be connected to your applications such as Ansible Automation Platform.

Note

The steps to provide highly available PVCs for the PostgreSQL database are beyond the scope of this reference architecture.

Chapter 3. Before you start

3.1. Resource management for pods and containers
Copy link

3.1.1. What is a resource request?
Copy link

3.1.2. What is a resource limit?
Copy link

3.1.3. Why does resource management matter?
Copy link

3.1.4. Planning of resources
Copy link

3.2. Size recommendations for your automation controller pod containers
Copy link

3.3. Size recommendations for your Postgres pod
Copy link

3.4. Size recommendations for your automation job pods
Copy link

3.5. Summary of automation controller pod size recommendations
Copy link

3.6. Size recommendations for your automation hub pods
Copy link

3.7. Specifying dedicated nodes for automation controller pod
Copy link

3.7.1. Assigning control pods to specific worker nodes for automation controller
Copy link

3.8. Handling Database High Availability
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 3. Before you start

3.1. Resource management for pods and containersCopy linkLink copied to clipboard!

3.1.1. What is a resource request?Copy linkLink copied to clipboard!

3.1.2. What is a resource limit?Copy linkLink copied to clipboard!

3.1.3. Why does resource management matter?Copy linkLink copied to clipboard!

3.1.4. Planning of resourcesCopy linkLink copied to clipboard!

3.2. Size recommendations for your automation controller pod containersCopy linkLink copied to clipboard!

3.3. Size recommendations for your Postgres podCopy linkLink copied to clipboard!

3.4. Size recommendations for your automation job podsCopy linkLink copied to clipboard!

3.5. Summary of automation controller pod size recommendationsCopy linkLink copied to clipboard!

3.6. Size recommendations for your automation hub podsCopy linkLink copied to clipboard!

3.7. Specifying dedicated nodes for automation controller podCopy linkLink copied to clipboard!

3.7.1. Assigning control pods to specific worker nodes for automation controllerCopy linkLink copied to clipboard!

3.8. Handling Database High AvailabilityCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.1. Resource management for pods and containers
Copy link

3.1.1. What is a resource request?
Copy link

3.1.2. What is a resource limit?
Copy link

3.1.3. Why does resource management matter?
Copy link

3.1.4. Planning of resources
Copy link

3.2. Size recommendations for your automation controller pod containers
Copy link

3.3. Size recommendations for your Postgres pod
Copy link

3.4. Size recommendations for your automation job pods
Copy link

3.5. Summary of automation controller pod size recommendations
Copy link

3.6. Size recommendations for your automation hub pods
Copy link

3.7. Specifying dedicated nodes for automation controller pod
Copy link

3.7.1. Assigning control pods to specific worker nodes for automation controller
Copy link

3.8. Handling Database High Availability
Copy link