Chapter 2. Scalability and performance of OpenShift Serverless Serving
OpenShift Serverless consists of several different components that have different resource requirements and scaling behaviors. These components are horizontally and vertically scalable, but their resource requirements and configuration highly depend on the actual use-case.
- Control-plane components
- These components are responsible for observing and reacting to custom resources and continuously reconfiguring the system, for example, the controller pods.
- Data-plane components
- These components are directly involved in requests and response handling, for example, the Knative Servings activator component.
The following metrics and findings were recorded using the following test setup:
- A cluster running OpenShift Container Platform 4.13
- The cluster running 4 compute nodes in AWS with a machine type of m6.xlarge
- OpenShift Serverless 1.30
2.1. Overhead of OpenShift Serverless Serving
As components of OpenShift Serverless Serving are part of the data-plane, requests from clients are routed through:
- The ingress-gateway (Kourier or Service Mesh)
- The activator component
- The queue-proxy sidecar container in each Knative Service
These components introduce an additional hop in networking and perform additional tasks, for example, adding observability and request queuing. The following are the measured latency overheads:
- Each additional network hop adds 0.5 ms to 1 ms latency to a request. Depending on the current load of the Knative Service and if the Knative Service was scaled to zero before the request, the activator component is not always a part of the data-plane.
- Depending on the payload size, each of the components is consuming up to 1 vCPU of CPU for handling 2500 requests per second.
2.2. Known limitations of OpenShift Serverless Serving
The maximum number of Knative Services that can be created is 3,000. This corresponds to the OpenShift Container Platform Kubernetes services limit of 10,000, since 1 Knative Service creates 3 Kubernetes services.
2.3. Scaling and performance of OpenShift Serverless Serving
OpenShift Serverless Serving has to be scaled and configured based on the following parameters:
- Number of Knative Services
- Number of Revisions
- Amount of concurrent requests in the system
- Size of payloads of the requests
- The startup-latency and response latency of the Knative Service added by the user’s web application
- Number of changes of the KnativeService custom resource (CR) over time
2.3.1. KnativeServing default configuration
Per default, OpenShift Serverless Serving is configured to run all components with high-availability and medium-sized CPU and memory requests and limits. This means that the high-available
field in KnativeServing
CR is automatically set to a value of 2
and all system components are scaled to two replicas. This configuration is suitable for medium workload scenarios and has been tested with:
- 170 Knative Services
- 1-2 Revisions per Knative Service
- 89 test scenarios mainly focused on testing the control plane
- 48 re-creating scenarios where Knative Services are deleted and re-created
- 41 stable scenarios, in which requests are slowly but continuously sent to the system
During these test cases, the system components effectively consumed:
Component | Measured Resources |
---|---|
Operator in project | 1 GB Memory, 0.2 Cores of CPU |
Serving components in project | 5 GB Memory, 2.5 Cores of CPU |
2.3.2. Minimal requirements of OpenShift Serverless Serving
While the default setup is suitable for medium-sized workloads, it might be over-sized for smaller setups or under-sized for high-workload scenarios. To configure OpenShift Serverless Serving for a minimal workload scenario, you need to know the idle consumption of the system components.
2.3.2.1. Idle consumption
The idle consumption is dependent on the number of Knative Services. The following memory usage has been measured for the components in the knative-serving
and knative-serving-ingress
OpenShift Container Platform projects:
Component | 0 Services | 100 Services | 500 Services | 1000 Services |
---|---|---|---|---|
| 55Mi | 86Mi | 300Mi | 450Mi |
| 52Mi | 102Mi | 225Mi | 350Mi |
| 100Mi | 135Mi | 310Mi | 500Mi |
| 60Mi | 60Mi | 60Mi | 60Mi |
| 20Mi | 60Mi | 190Mi | 330Mi |
| 90Mi | 170Mi | 340Mi | 430Mi |
Either 3scale-kourier-gateway
and net-kourier-controller
components or istio-ingressgateway
and net-istio-controller
components are installed.
The memory consumption of net-istio
is based on the total number of pods within the mesh.
2.3.3. Configuring Serving for minimal workloads
Procedure
You can configure Knative Serving for minimal workloads using the
KnativeServing
custom resource (CR):A minimal workload configuration in KnativeServing CR
apiVersion: operator.knative.dev/v1beta1 kind: KnativeServing metadata: name: knative-serving namespace: knative-serving spec: high-availability: replicas: 1 1 workloads: - name: activator replicas: 2 2 resources: - container: activator requests: cpu: 250m 3 memory: 60Mi 4 limits: cpu: 1000m memory: 600Mi - name: controller replicas: 1 5 resources: - container: controller requests: cpu: 10m memory: 100Mi limits: 6 cpu: 200m memory: 300Mi - name: webhook replicas: 2 resources: - container: webhook requests: cpu: 100m 7 memory: 60Mi limits: cpu: 200m memory: 200Mi podDisruptionBudgets: 8 - name: activator-pdb minAvailable: 1 - name: webhook-pdb minAvailable: 1
- 1
- Setting this to
1
scales all system components to one replica. - 2
- Activator should always be scaled to a minimum of
2
instances to avoid downtime. - 3
- Activator CPU requests should not be set lower than
250m
, as aHorizontalPodAutoscaler
will use this as a reference to scale up and down. - 4
- Adjust memory requests to the idle values from the previous table. Also adjust memory limits according to your expected load (this might need custom testing to find the best values).
- 5
- One webhook and one controller are sufficient for a minimal-workload scenario
- 6
- These limits are sufficient for a minimal-workload scenario, but they also might need adjustments depending on your concrete workload.
- 7
- Webhook CPU requests should not be set lower than
100m
, as a HorizontalPodAutoscaler will use this as a reference to scale up and down. - 8
- Adjust the
PodDistruptionBudgets
to a value lower thanreplicas
, to avoid problems during node maintenance.
2.3.4. Configuring Serving for high workloads
You can configure Knative Serving for high workloads using the KnativeServing
custom resource (CR). The following findings are relevant to configuring Knative Serving for a high workload:
These findings have been tested with requests with a payload size of 0-32 kb. The Knative Service backends used in those tests had a startup latency between 0 to 10 seconds and response times between 0 to 5 seconds.
- All data-plane components are mostly increasing CPU usage on higher requests and payload scenarios, so the CPU requests and limits have to be tested and potentially increased.
- The activator component also might need more memory, when it has to buffer more or bigger request payloads, so the memory requests and limits might need to be increased as well.
- One activator pod can handle approximately 2500 requests per second before it starts to increase latency and, at some point, leads to errors.
-
One
3scale-kourier-gateway
oristio-ingressgateway
pod can also handle approximately 2500 requests per second before it starts to increase latency and, at some point, leads to errors. - Each of the data-plane components consumes up to 1 vCPU of CPU for handling 2500 requests per second. Note that this highly depends on the payload size and the response times of the Knative Service backend.
Fast startup and fast response-times of your Knative Service user workloads are critical for good performance of the overall system. The Knative Serving components are buffering incoming requests when the Knative Service user backend is scaling up or when request concurrency has reached its capacity. If your Knative Service user workload introduces long startup or request latency, it will either overload the activator
component (when the CPU and memory configuration is too low) or lead to errors for the calling clients.
Procedure
To fine-tune your installation, use the previous findings combined with your own test results to configure the
KnativeServing
custom resource:A high workload configuration in KnativeServing CR
apiVersion: operator.knative.dev/v1beta1 kind: KnativeServing metadata: name: knative-serving namespace: knative-serving spec: high-availability: replicas: 2 1 workloads: - name: component-name 2 replicas: 2 3 resources: - container: container-name requests: cpu: 4 memory: limits: cpu: memory: podDisruptionBudgets: 5 - name: name-of-pod-disruption-budget minAvailable: 1
- 1
- Set this parameter to at least
2
to make sure you always have at least two instances of every component running. You can also useworkloads
to override the replicas for certain components. - 2
- Use the
workloads
list to configure specific components. Use thedeployment
name of the component and set thereplicas
field. - 3
- For the
activator
,webhook
, and3scale-kourier-gateway
components, which use horizontal pod autoscalers (HPAs), thereplicas
field sets the minimum number of replicas. The actual number of replicas depends on the CPU load and scaling done by the HPAs. - 4
- Set the requested and limited CPU and memory according to at least the idle consumption while also taking the previous findings and your own test results into consideration.
- 5
- Adjust the
PodDistruptionBudgets
to a value lower thanreplicas
to avoid problems during node maintenance. The defaultminAvailable
is set to1
, so if you increase the required replicas, you must also increaseminAvailable
.
As each environment is highly specific, it is essential to test and find your own ideal configuration. Use the monitoring and alerting functionality of OpenShift Container Platform to continuously monitor your actual resource consumption and make adjustments if needed.
If you are using the OpenShift Serverless and Service Mesh integration, additional CPU processing is added by the istio-proxy
sidecar containers. For more information about this, see the Service Mesh documentation.