Chapter 3. API Service Performance

3.1. API request flow and latency sources
复制链接

The Ansible Automation Platform API is provided by a distributed set of services. Overall platform performance is affected by the path that each request takes through the platform and the multiple potential sources of latency and performance considerations at each layer. The following table describes each layer in the API request flow through Ansible Automation Platform:

Expand

Table 3.1. API Service architecture and performance considerations
Layer	Description	Performance Consideration
Client Request	The request from the client.	The request from the client may have timeout parameters set. In the VM-based installation and containerized installer, a variable `client_request_timeout` is used to inform downstream component timeouts. This value must match the external load balancer’s timeout. The size of the request body and/or headers can also impact performance.
Ingress Point	The first point of entry into Ansible Automation Platform, typically an OpenShift Container Platform Route or a customer-provided Load Balancer, directing traffic to an available platform gateway pod/instance.	Performance is dependent on the configuration, capacity, and health of the load balancer or OpenShift Container Platform Route. Any external load balancer’s timeout must be greater than or equal to the `client_request_timeout` setting passed to the installer. This layer is responsible for distributing traffic if there are multiple platform gateway nodes/pods.
Envoy Proxy	Located within the platform gateway pod/instance, this proxy handles authorization, internal routing, and applies filters to the request. Authorization by the gRPC service is performed before the Envoy forwards the request to the destination service.	Introduces minimal latency, typically on the order of 10 milliseconds. Can handle hundreds of concurrent requests.
Platform gateway gRPC Authentication Service	A local gRPC service within the platform gateway container responsible for authenticating each request. This service can interact with external authentication services (LDAP/SAML) and the database for validation. Authentication with the gRPC service can be disabled for individual URL routes, notably requests to the platform gateway service itself are not authenticated by this gRPC service (for example, under `/api/gateway/v1/`). These requests are authenticated by the platform gateway API service itself.	Potential source of latency. The service is multi-processed and multi-threaded, with capacity determined by `GRPC_SERVER_PROCESSES` and `GRPC_SERVER_MAX_THREADS_PER_PROCESS`. If all workers are busy, then requests queue, which adds to latency. In containerized or VM-based installation, its timeout is informed by the `client_request_timeout`. Slow database connections for session validation also negatively impact performance.
External Authentication Service (LDAP/SAML)	An optional external service invoked by the platform gateway gRPC Authentication Service for validating user credentials.	Potential source of latency. When external authentication services (e.g., LDAP or SAML) are configured, they are invoked during the gRPC authentication stage. Slow response times from these external systems can significantly increase the total latency for each request processed. It is the user’s responsibility to provide a low-latency, reliable external authentication service.
API Service Nginx Proxy	After authentication, Envoy forwards the request to the component API node or Service in OpenShift Container Platform. Nginx receives the request. Each distinct API service component has its own Nginx proxy that determines if the request is for a WSGI application, an ASGI-based WebSocket service, or static content.	Introduces minimal latency, typically on the order of 10 milliseconds. Can handle hundreds of concurrent requests.
WSGI Server (`uWSGI` / `Gunicorn`)	Handles standard API requests forwarded by Nginx. These servers process requests, validate JWT tokens, execute API operations, and frequently interact with the database.	Primary source of latency. API requests are handled by each component’s web application served by a WSGI server (`uWSGI` for automation controller and platform gateway and `Gunicorn` for automation hub and Event-Driven Ansible), and their timeout is also informed by the `client_request_timeout` in VM-based installation and container-based installation. In OpenShift Container Platform, the timeout on the platform gateway Route is propagated to inform this same setting. The servers are configured with a maximum number of concurrent workers. If all workers are busy, the request is queued. After a worker picks up a request, it validates the authentication and executes the API operation, which typically involves further database communication.
Databases	Almost every request requires interacting with the database to do activities such as validating sessions, storing data, and executing API operations.	Critical performance factor. Almost every request requires interacting with the database. The responsiveness of database connections remains a critical factor in API performance, impacting both the gRPC authentication service and the WSGI server processing. Responsiveness can be impacted by network latency between the database and components, as well as performance of the database itself.
Client Response	The final response returned to the client after the request has been processed and traversed back through the system components (Nginx proxy, Envoy, and the initial load balancer/Route).	The final response returned to the client after the request has been processed and traversed back through the system components (Nginx proxy, Envoy, and the initial load balancer/Route).

3.1.1. Sources of latency and scaling strategies
复制链接

The primary sources of latency across all layers are:

Queueing delays while awaiting an available worker from either the gRPC authentication service or the WSGI server
The authentication phase, particularly if external authentication systems exhibit slow response times
The actual processing time and associated database interactions within the Python WSGI application

Scaling strategies include the following:

Using more performant authentication methods, such as Session or Token
Horizontally scaling platform gateway and API service pods to increase worker availability and minimize queue times

The following sections describe how to identify which of the Ansible Automation Platform services provide which APIs and provide considerations for scaling them.

3.2. Key performance indicators for scaling the API services
复制链接

Scaling adds resources to handle increased load. This is primarily achieved through horizontal scaling (adding more pods or instances) or vertical scaling (adding CPU or memory resources to pods or instances). Proper scaling ensures high availability and maintains performance under load.

Consider scaling your services when you observe one or more of the following key performance indicators, which suggest a component is reaching its capacity and cannot efficiently handle the current request load:

High API latency
High CPU utilization
Errors that occur during periods of high traffic

3.2.1. High API latency
复制链接

Sustained high latency on API requests is a key performance indicator. All requests are made through the platform platform gateway, which acts as a proxy and forwards requests to the services in question. The request is sent to the destination service depending on which route is in the URL of the API request:

platform gateway: /api/gateway
automation controller: /api/controller
Event-Driven Ansible: /api/eda
Event-Driven Ansible Event Streams: /eda-event-streams/api/eda/v1/external_event_stream/
automation hub: /api/galaxy

Monitoring latency on the different routes through the Envoy proxy logs enables you to identify which service requires scaling. These routes are present in the proxy container of platform gateway pods in OpenShift Container Platform or in the proxy logs of the platform gateway nodes in VM-based installation and container-based installation. Latency exceeding target API thresholds (for example, 99th percentile >1500ms) indicates a need to trigger alerts or scale web services.

3.2.2. High CPU utilization
复制链接

When a service’s API pod shows consistently high CPU usage, it may be unable to process incoming requests in a timely manner, leading to a backlog of requests. The following indicators suggest high CPU utilization:

High total request time from the Envoy proxy logs with the processing time from the service’s WSGI logs
High total Envoy latency
Requests are waiting in a queue before being processed

3.2.3. Error codes
复制链接

Error codes in the platform gateway’s proxy container (on OpenShift Container Platform) or in the proxy logs (for VM-based and container-based installations) indicate the service must be scaled. They are often precipitated by the services being overloaded and unable to service requests in a timely manner, and are often preceded by periods of higher latency.

Upstream Authentication Failures: 502 UAEX (Upstream Authentication Extension) responses in Envoy logs indicate issues during the authentication phase of a request. This suggests the authentication service is overloaded, timing out, or returning broken responses.
Upstream Service Unhealthy: 503 UH (Upstream Service Unhealthy) responses for a specific service mean that Envoy has marked one or more of that service’s pods as unhealthy and is not sending traffic to it. This occurs when an upstream pod fails its health checks. Because health checks share the same request queue as client traffic, an overloaded pod that cannot respond to the health check in time will be temporarily removed from the load balancing pool.
Upstream Connection Failure: 503 UF (Upstream Connection Failure) for a specific service’s requests indicates Envoy attempted to contact an upstream pod, but the connection failed. This can occur if the upstream service is overwhelmed and cannot accept new connections. For more information about Envoy Response Flags (the letter codes that follow the HTTP response code), see Access logging.

3.3. Considerations for scaling the platform gateway proxy and authentication service
复制链接

Scaling the platform gateway proxy and authentication service may be appropriate if the volume of requests to proxy or authenticate exceeds existing platform gateway service capabilities. Horizontal scaling is preferred in this case, as vertical scaling does not automatically adjust all worker pool values for the gRPC, Envoy, Nginx, and WSGI services.

Note

Additional platform gateway service pods or instances can increase the necessary database connections for WSGI web service workers and gRPC authentication service workers.

If you observe a bottleneck in CPU utilization, then scaling the gRPC authentication service can improve throughput. However, if external authentication providers are the source of high latency, then horizontal scaling of the gRPC service has minimal benefits. You can determine the gRPC authentication latency for requests by observing the difference between the upstream service time and the total request latency.

The most performant authentication methods are:

Token authentication
Session authentication
Basic authentication must not be used for high-frequency API automation because CPU-intensive password hashing introduces significant request latency. If basic authentication is used in combination with LDAP authentication, reaching out to the LDAP service can introduce significant latency, especially if the LDAP service has limited availability. For this reason, we recommend creating OAuth Tokens to perform automation against the API.

Horizontally scaling the platform gateway service pods also increases the number of health checks to each component’s API, because each Envoy tracks this separately. You can observe these in logs with the user agent Envoy/HC. Since health checks flow through the same services and queues as user-initiated requests, if overall request times slow due to overload, health checks can also timeout. When this occurs, Envoy stops forwarding requests to these service nodes until they pass their health check again.

3.3.1. Special considerations for scaling the platform gateway proxy and authentication service on OpenShift Container Platform
复制链接

It is particularly important that the service is horizontally scaled sufficiently in OpenShift Container Platform, because if more than 100 requests are backlogged, then these requests are then dropped by uWSGI. This results in clients receiving a timeout for dropped requests. The following log text provides the corresponding error for this event:

*** uWSGI listen queue of socket ":8000" (fd: 3) full !!! (101/100) ***

*** uWSGI listen queue of socket ":8000" (fd: 3) full !!! (101/100) ***

Copy to Clipboard

Toggle word wrap

This error occurs due to a limitation of uWSGI tying its backlog length to the kernel parameter somaxconn. It is possible to raise this kernel parameter in OpenShift Container Platform, but doing so requires allowing “unsafe sysctls”.

3.4. Considerations for scaling the automation controller API service
复制链接

The automation controller API service handles HTTP requests to the application, including information about user roles in automation controller, project creation, inventory creation or updates, job launches, and job result checks.

3.4.1. Key performance indicators
复制链接

Key performance indicators for the automation controller API service include the following:

High API latency for requests under /api/controller
High CPU utilization on the API pods or nodes
Platform gateway returning 503 errors because the service is too busy to respond to health checks

The automation controller API service is located in web pods on operator-based installation and in control or hybrid nodes on VM-based installation or container-based installation.

3.4.2. Scaling strategies by deployment type
复制链接

Consider the following strategies to scale the automation controller API service:

OpenShift Container Platform: Adjust the web_replicas attribute on the AutomationController CR. Scaling the replicas attribute scales task and web replicas.
VM-based installation and container-based installation: Scale control or hybrid nodes, increasing the ability to control additional automation jobs.

3.4.3. Database connection and architecture considerations
复制链接

On OpenShift Container Platform, each web replica consumes database connections for WSGI web service workers and various background services facilitating task communication and WebSockets. The number of database connections used by the WSGI web server on VM-based installation and container-based installation scales with the machine’s CPU count. Additionally, control and hybrid nodes manage the Dispatcher (tasking system) and the Callback Receiver (job event processing worker pool). These worker pools scale with CPU availability and necessitate database connections.

Provisioning additional control nodes demands more database connections than solely scaling out the web deployment on OpenShift Container Platform. This demand occurs because containerized and RPM control node scaling also expands the tasking system, which operates as a distinct deployment on OpenShift Container Platform. This separation of services on OpenShift Container Platform deployments is an important distinction that allows administrators to more finely tune the deployment and conserve limited resources, such as database connections.

3.4.4. Special considerations for scaling on OpenShift Container Platform
复制链接

It is particularly important that the service is horizontally scaled sufficiently in OpenShift Container Platform, because if more than 100 requests are backlogged, then these requests are then dropped by uWSGI. This results in clients receiving a timeout for dropped requests. The following log text provides the corresponding error for this event:

*** uWSGI listen queue of socket ":8000" (fd: 3) full !!! (101/100) ***

*** uWSGI listen queue of socket ":8000" (fd: 3) full !!! (101/100) ***

Copy to Clipboard

Toggle word wrap

This error occurs due to a limitation of uWSGI tying its backlog length to the kernel parameter somaxconn. It is possible to raise this kernel parameter in OpenShift Container Platform, but doing so requires allowing “unsafe sysctls”.

3.5. Considerations for scaling the Event-Driven Ansible APIs
复制链接

Scaling Event-Driven Ansible involves considerations for each of its service types:

API and WebSocket service
EventStream service

API requests routed to /api/eda and /api/eda-event-stream are handled by two separate Gunicorn deployments. In OpenShift Container Platform, these services must be scaled independently. For VM-based installation and container-based installation, you can scale these services together by increasing the number of hybrid nodes.

3.5.1. Event-Driven Ansible API and WebSocket service
复制链接

The Event-Driven Ansible API service handles HTTP requests to the application, including information about user roles in Event-Driven Ansible, creating projects, creating activations, and checking results. Key performance indicators for the API service include the following:

High API latency for requests under /api/eda
High CPU utilization on the API pods/nodes
Platform gateway returning 503 errors because the service is too busy to respond to health checks

A WebSocket service is deployed alongside the API service to manage the output of the rulebook activations. Each activation maintains a persistent WebSocket connection to this service to communicate status and receive instructions. A large number of activations or a large amount of output from activations can overwhelm the WebSocket server, leading to failures.

Consider the following strategies to scale the Event-Driven Ansible API and WebSocket service:

OpenShift Container Platform: The WebSocket server (Daphne) runs within the eda-api pod. Horizontally scale the eda-api deployment to scale this service.

Ensure that you scale the eda-api deployment in proportion to the number of activations being run.

VM-based installation or container-based installation: Horizontally scale the WebSocket service alongside the API service by adding more hybrid nodes. This increases capacity for all Event-Driven Ansible components simultaneously.

To identify whether your deployment experienced a bottleneck in the WebSocket service, check the activation logs for the following errors:

ansible_rulebook.websocket - WARNING - websocket aborted by <class 'websockets.exceptions.InvalidMessage'>: did not receive a valid HTTP response
ansible_rulebook.cli - ERROR - Terminating: did not receive a valid HTTP response

ansible_rulebook.websocket - WARNING - websocket aborted by <class 'websockets.exceptions.InvalidMessage'>: did not receive a valid HTTP response
ansible_rulebook.cli - ERROR - Terminating: did not receive a valid HTTP response

Copy to Clipboard

Toggle word wrap

3.5.2. Event-Driven Ansible EventStream service
复制链接

The Event Stream API, which handles POST requests to /api/eda-event-stream, is a separate Gunicorn service designed to ingest events from external sources. If this service’s performance degrades, with high latency, low throughput, or availability issues, you must scale it. The platform gateway returns 503 errors for this service when a pod is too busy to respond to health checks in time.

Consider the following strategies to scale the Event-Driven Ansible EventStream service:

OpenShift Container Platform: Horizontally scale up the dedicated eda-event-stream worker deployment, because it is managed separately from the main eda-api deployment.
VM-based installation or container-based installation: Horizontally scale up this service by adding more hybrid nodes, which increases capacity for all Event-Driven Ansible components simultaneously.

3.6. Considerations for scaling the automation hub APIs
复制链接

Scaling automation hub involves considerations for each of its service types:

API service: manages HTTP requests through the API
Pulp workers service: manages syncs and content uploads
Content service: manages content delivery after content has been processed and stored

Separate Gunicorn deployments back these services and handle different types of requests. In OpenShift Container Platform, these services must be scaled independently. In VM-based installation and container-based installation, a standard automation hub node hosts all services, and scaling is achieved by adding more nodes.

3.6.1. Automation hub API service
复制链接

The automation hub API service handles metadata-driven requests for the application, including UI interactions, searches, and remote repository configuration. Key performance indicators for the automation hub API service include:

High API latency for requests under /api/galaxy
High CPU utilization on the API pods or nodes
Platform gateway returning 503 errors because the service is too busy to respond to health checks

Consider the following strategies to scale the automaton automation hub API service:

OpenShift Container Platform: Horizontally scale the API pods by increasing the hub.api.replicas attribute on the AnsibleAutomationPlatform Custom Resource (CR).
VM-based installation or container-based installation: Horizontally scale the API service by adding more automation hub nodes, which simultaneously scales all other automation hub services.

3.6.2. Automation hub Pulp worker and content services
复制链接

The Pulp worker and content services manage all operations related to content syncs, uploads and downloads. Key performance indicators for the Pulp worker and content services include:

High Content sync rates: Frequent or large synchronization operations from external repositories demanding significant pulp-content worker processing.
High Content upload or download rates: Frequent pushing or pulling of automation execution environments by automation controller, Event-Driven Ansible, or large Collection uploads or downloads by automation clients.
Disk I/O bottlenecks: Performance issues related to read/write operations on the underlying content storage (/var/lib/pulp), often shown as high disk I/O wait times.
Pulp worker saturation: High CPU utilization or queuing within pulp processes, indicating an inability to keep up with content processing and serving.

To scale your Pulp worker and content services, consider the following scaling strategies:

In OpenShift Container Platform: Scale the deployment of these services by increasing the hub.content.replicas and hub.worker.replicas attributes on the AnsibleAutomationPlatform Custom Resource.
For VM-based installation or container-based installation: Horizontally scale all services by adding more automation hub nodes.

此内容没有您所选择的语言版本。

3.1. API request flow and latency sources
复制链接

3.1.1. Sources of latency and scaling strategies
复制链接

3.2. Key performance indicators for scaling the API services
复制链接

3.2.1. High API latency
复制链接

3.2.2. High CPU utilization
复制链接

3.2.3. Error codes
复制链接

3.3. Considerations for scaling the platform gateway proxy and authentication service
复制链接

3.3.1. Special considerations for scaling the platform gateway proxy and authentication service on OpenShift Container Platform
复制链接

3.4. Considerations for scaling the automation controller API service
复制链接

3.4.1. Key performance indicators
复制链接

3.4.2. Scaling strategies by deployment type
复制链接

3.4.3. Database connection and architecture considerations
复制链接

3.4.4. Special considerations for scaling on OpenShift Container Platform
复制链接

3.5. Considerations for scaling the Event-Driven Ansible APIs
复制链接

3.5.1. Event-Driven Ansible API and WebSocket service
复制链接

3.5.2. Event-Driven Ansible EventStream service
复制链接

3.6. Considerations for scaling the automation hub APIs
复制链接

3.6.1. Automation hub API service
复制链接

3.6.2. Automation hub Pulp worker and content services
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

此内容没有您所选择的语言版本。

Chapter 3. API Service Performance

3.1. API request flow and latency sources复制链接链接已复制到粘贴板!

3.1.1. Sources of latency and scaling strategies复制链接链接已复制到粘贴板!

3.2. Key performance indicators for scaling the API services复制链接链接已复制到粘贴板!

3.2.1. High API latency复制链接链接已复制到粘贴板!

3.2.2. High CPU utilization复制链接链接已复制到粘贴板!

3.2.3. Error codes复制链接链接已复制到粘贴板!

3.3. Considerations for scaling the platform gateway proxy and authentication service复制链接链接已复制到粘贴板!

3.3.1. Special considerations for scaling the platform gateway proxy and authentication service on OpenShift Container Platform复制链接链接已复制到粘贴板!

3.4. Considerations for scaling the automation controller API service复制链接链接已复制到粘贴板!

3.4.1. Key performance indicators复制链接链接已复制到粘贴板!

3.4.2. Scaling strategies by deployment type复制链接链接已复制到粘贴板!

3.4.3. Database connection and architecture considerations复制链接链接已复制到粘贴板!

3.4.4. Special considerations for scaling on OpenShift Container Platform复制链接链接已复制到粘贴板!

3.5. Considerations for scaling the Event-Driven Ansible APIs复制链接链接已复制到粘贴板!

3.5.1. Event-Driven Ansible API and WebSocket service复制链接链接已复制到粘贴板!

3.5.2. Event-Driven Ansible EventStream service复制链接链接已复制到粘贴板!

3.6. Considerations for scaling the automation hub APIs复制链接链接已复制到粘贴板!

3.6.1. Automation hub API service复制链接链接已复制到粘贴板!

3.6.2. Automation hub Pulp worker and content services复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.1. API request flow and latency sources
复制链接

3.1.1. Sources of latency and scaling strategies
复制链接

3.2. Key performance indicators for scaling the API services
复制链接

3.2.1. High API latency
复制链接

3.2.2. High CPU utilization
复制链接

3.2.3. Error codes
复制链接

3.3. Considerations for scaling the platform gateway proxy and authentication service
复制链接

3.3.1. Special considerations for scaling the platform gateway proxy and authentication service on OpenShift Container Platform
复制链接

3.4. Considerations for scaling the automation controller API service
复制链接

3.4.1. Key performance indicators
复制链接

3.4.2. Scaling strategies by deployment type
复制链接

3.4.3. Database connection and architecture considerations
复制链接

3.4.4. Special considerations for scaling on OpenShift Container Platform
复制链接

3.5. Considerations for scaling the Event-Driven Ansible APIs
复制链接

3.5.1. Event-Driven Ansible API and WebSocket service
复制链接

3.5.2. Event-Driven Ansible EventStream service
复制链接

3.6. Considerations for scaling the automation hub APIs
复制链接

3.6.1. Automation hub API service
复制链接

3.6.2. Automation hub Pulp worker and content services
复制链接