Home
Products
OpenShift Container Platform
4.19
Scalability and performance
Chapter 2. Recommended performance and scalability practices

Chapter 2. Recommended performance and scalability practices

2.1. Recommended control plane practices
Copy link

To ensure optimal performance and scalability, apply the recommended practices for OpenShift Container Platform control planes. By understanding these recommended practices, you can configure your environment to handle increasing workloads while maintaining stability.

2.1.1. Recommended practices for scaling the cluster
Copy link

To scale your cluster effectively, apply the recommended practices for installations with cloud provider integration. By understanding this guidance, you can optimize performance and ensure stability as you increase the size of your environment.

Apply the following best practices to scale the number of compute machines in your OpenShift Container Platform cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the compute machine set.

When scaling up the cluster to higher node counts:

Spread nodes across all of the available zones for higher availability.
Scale up by no more than 25 to 50 machines at once.
Consider creating new compute machine sets in each available zone with alternative instance types of similar size to help mitigate any periodic provider capacity constraints. For example, on AWS, use m5.large and m5d.large.

Note

Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.

The controller might not be able to create the machines if the replicas in the compute machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which OpenShift Container Platform is deployed on top of, is able to handle impacts the process. The controller starts to query more while trying to create, check, and update the machines with the status. The cloud platform on which OpenShift Container Platform is deployed has API request limits; excessive queries might lead to machine creation failures due to cloud platform limitations.

Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.

Note

When scaling large and dense clusters to lower node counts, it might take large amounts of time because the process involves draining or evicting the objects running on the nodes being terminated in parallel. Also, the client might start to throttle the requests if there are too many objects to evict. The default client queries per second (QPS) and burst rates are currently set to 50 and 100 respectively. These values cannot be modified in OpenShift Container Platform.

2.1.2. Control plane node sizing
Copy link

To ensure optimal performance and stability, determine the resource requirements for control plane nodes. These sizing guidelines depend on the number and type of nodes and objects in your cluster.

The following control plane node size recommendations are based on the results of a control plane density focused testing, or Cluster-density. This test creates the following objects across a given number of namespaces:

1 image stream
1 build
5 deployments, with 2 pod replicas in a sleep state, mounting 4 secrets, 4 config maps, and 1 downward API volume each
5 services, each one pointing to the TCP/8080 and TCP/8443 ports of one of the previous deployments
1 route pointing to the first of the previous services
10 secrets containing 2048 random string characters
10 config maps containing 2048 random string characters

Expand

Number of compute nodes	Cluster-density (namespaces)	CPU cores	Memory (GB)
24	500	4	16
120	1000	8	32
252	4000	16, but 24 if using the OVN-Kubernetes network plug-in	64, but 128 if using the OVN-Kubernetes network plug-in
501, but untested with the OVN-Kubernetes network plug-in	4000	16	96

The data from the table above is based on an OpenShift Container Platform running on top of AWS, using r5.4xlarge instances as control-plane nodes and m5.2xlarge instances as compute nodes.

On a large and dense cluster with three control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted, or fails. The failures can be due to unexpected issues with power, network, underlying infrastructure, or intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available, which leads to increase in the resource usage. This is also expected during upgrades because the control plane nodes are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.

Important

The node sizing varies depending on the number of nodes and object counts in the cluster. It also depends on whether the objects are actively being created on the cluster. During object creation, the control plane is more active in terms of resource usage compared to when the objects are in the Running phase.

Operator Lifecycle Manager (OLM) runs on the control plane nodes and its memory footprint depends on the number of namespaces and user installed operators that OLM needs to manage on the cluster. Control plane nodes need to be sized accordingly to avoid OOM kills. Following data points are based on the results from cluster maximums testing.

Expand

Number of namespaces	OLM memory at idle state (GB)	OLM memory with 5 user operators installed (GB)
500	0.823	1.7
1000	1.2	2.5
1500	1.7	3.2
2000	2	4.4
3000	2.7	5.6
4000	3.8	7.6
5000	4.2	9.02
6000	5.8	11.3
7000	6.6	12.9
8000	6.9	14.8
9000	8	17.7
10,000	9.9	21.6

Important

You can modify the control plane node size in a running OpenShift Container Platform 4.19 cluster for the following configurations only:

Clusters installed with a user-provisioned installation method.
AWS clusters installed with an installer-provisioned infrastructure installation method.
Clusters that use a control plane machine set to manage control plane machines.

For all other configurations, you must estimate your total node count and use the suggested control plane node size during installation.

Note

In OpenShift Container Platform 4.19, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. The sizes are determined taking that into consideration.

2.2. Selecting a larger AWS instance type for control plane machines
Copy link

If the control plane machines in an Amazon Web Services (AWS) cluster require more resources, you can select a larger AWS instance type for the control plane machines to use.

Note

The procedure for clusters that use a control plane machine set is different from the procedure for clusters that do not use a control plane machine set.

If you are uncertain about the state of the ControlPlaneMachineSet CR in your cluster, you can verify the CR status.

2.2.2. Changing the Amazon Web Services instance type by using a control plane machine set
Copy link

You can change the Amazon Web Services (AWS) instance type that your control plane machines use by updating the specification in the control plane machine set custom resource (CR).

Prerequisites

Your AWS cluster uses a control plane machine set.

Procedure

Edit the following line under the providerSpec field:
```
providerSpec:
  value:
    ...
    instanceType: <compatible_aws_instance_type>
```
- <compatible_aws_instance_type>: Specifies a larger AWS instance type with the same base as the previous selection. For example, you can change m6i.xlarge to m6i.2xlarge or m6i.4xlarge.
Save your changes.

2.2.3. Changing the Amazon Web Services instance type by using the AWS console
Copy link

You can change the Amazon Web Services (AWS) instance type that your control plane machines use by updating the instance type in the AWS console.

Prerequisites

You have access to the AWS console with the permissions required to modify the EC2 Instance for your cluster.
You have access to the OpenShift Container Platform cluster as a user with the cluster-admin role.

Procedure

Open the AWS console and fetch the instances for the control plane machines.
Choose one control plane machine instance.
1. For the selected control plane machine, back up the etcd data by creating an etcd snapshot. For more information, see "Backing up etcd".
2. In the AWS console, stop the control plane machine instance.
3. Select the stopped instance, and click Actions Instance Settings Change instance type.
4. Change the instance to a larger type, ensuring that the type is the same base as the previous selection, and apply changes. For example, you can change m6i.xlarge to m6i.2xlarge or m6i.4xlarge.
5. Start the instance.
6. If your OpenShift Container Platform cluster has a corresponding Machine object for the instance, update the instance type of the object to match the instance type set in the AWS console.
Repeat this process for each control plane machine.

2.3. Recommended infrastructure practices
Copy link

This topic provides recommended performance and scalability practices for infrastructure in OpenShift Container Platform.

2.3.1. Infrastructure node sizing
Copy link

Infrastructure nodes are nodes that are labeled to run pieces of the OpenShift Container Platform environment. The infrastructure node resource requirements depend on the cluster age, nodes, and objects in the cluster, as these factors can lead to an increase in the number of metrics or time series in Prometheus. The following infrastructure node size recommendations are based on the results observed in cluster-density testing detailed in the Control plane node sizing section, where the monitoring stack and the default ingress-controller were moved to these nodes.

Expand

Number of worker nodes	Cluster density, or number of namespaces	CPU cores	Memory (GB)
27	500	4	24
120	1000	8	48
252	4000	16	128
501	4000	32	128

In general, three infrastructure nodes are recommended per cluster.

Important

These sizing recommendations should be used as a guideline. Prometheus is a highly memory intensive application; the resource usage depends on various factors including the number of nodes, objects, the Prometheus metrics scraping interval, metrics or time series, and the age of the cluster. In addition, the router resource usage can also be affected by the number of routes and the amount/type of inbound requests.

These recommendations apply only to infrastructure nodes hosting Monitoring, Ingress and Registry infrastructure components installed during cluster creation.

Note

2.3.2. Scaling the Cluster Monitoring Operator
Copy link

OpenShift Container Platform exposes metrics that the Cluster Monitoring Operator (CMO) collects and stores in the Prometheus-based monitoring stack. As an administrator, you can view dashboards for system resources, containers, and components metrics in the OpenShift Container Platform web console by navigating to Observe Dashboards.

2.3.3. Prometheus database storage requirements
Copy link

Red Hat performed various tests for different scale sizes.

Note

The following Prometheus storage requirements are not prescriptive and should be used as a reference. Higher resource consumption might be observed in your cluster depending on workload activity and resource density, including the number of pods, containers, routes, or other resources exposing metrics collected by Prometheus.
You can configure the size-based data retention policy to suit your storage requirements.

Expand

Table 2.1. Prometheus Database storage requirements based on number of nodes/pods in the cluster
Number of nodes	Number of pods (2 containers per pod)	Prometheus storage growth per day	Prometheus storage growth per 15 days	Network (per tsdb chunk)
50	1800	6.3 GB	94 GB	16 MB
100	3600	13 GB	195 GB	26 MB
150	5400	19 GB	283 GB	36 MB
200	7200	25 GB	375 GB	46 MB

Approximately 20 percent of the expected size was added as overhead to ensure that the storage requirements do not exceed the calculated value.

The above calculation is for the default OpenShift Container Platform Cluster Monitoring Operator.

Note

CPU utilization has minor impact. The ratio is approximately 1 core out of 40 per 50 nodes and 1800 pods.

Recommendations for OpenShift Container Platform

Use at least two infrastructure (infra) nodes.
Use at least three openshift-container-storage nodes with non-volatile memory express (SSD or NVMe) drives.

2.3.4. Configuring cluster monitoring
Copy link

You can increase the storage capacity for the Prometheus component in the cluster monitoring stack.

Procedure

To increase the storage capacity for Prometheus:

Create a YAML configuration file, cluster-monitoring-config.yaml. For example:

apiVersion: v1
kind: ConfigMap
data:
  config.yaml: |
    prometheusK8s:
      retention: {{PROMETHEUS_RETENTION_PERIOD}}


      nodeSelector:
        node-role.kubernetes.io/infra: ""
      volumeClaimTemplate:
        spec:
          storageClassName: {{STORAGE_CLASS}}


          resources:
            requests:
              storage: {{PROMETHEUS_STORAGE_SIZE}}


    alertmanagerMain:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      volumeClaimTemplate:
        spec:
          storageClassName: {{STORAGE_CLASS}}


          resources:
            requests:
              storage: {{ALERTMANAGER_STORAGE_SIZE}}


metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring

1: The default value of Prometheus retention is PROMETHEUS_RETENTION_PERIOD=15d. Units are measured in time using one of these suffixes: s, m, h, d.
2 4: The storage class for your cluster.
3: A typical value is PROMETHEUS_STORAGE_SIZE=2000Gi. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
5: A typical value is ALERTMANAGER_STORAGE_SIZE=20Gi. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.

Add values for the retention period, storage class, and storage sizes.
Save the file.

Apply the changes by running:

$ oc create -f cluster-monitoring-config.yaml

Chapter 2. Recommended performance and scalability practices

2.1. Recommended control plane practices
Copy link

2.1.1. Recommended practices for scaling the cluster
Copy link

2.1.2. Control plane node sizing
Copy link

2.2. Selecting a larger AWS instance type for control plane machines
Copy link

2.2.2. Changing the Amazon Web Services instance type by using a control plane machine set
Copy link

2.2.3. Changing the Amazon Web Services instance type by using the AWS console
Copy link

2.3. Recommended infrastructure practices
Copy link

2.3.1. Infrastructure node sizing
Copy link

2.3.2. Scaling the Cluster Monitoring Operator
Copy link

2.3.3. Prometheus database storage requirements
Copy link

2.3.4. Configuring cluster monitoring
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 2. Recommended performance and scalability practices

2.1. Recommended control plane practicesCopy linkLink copied to clipboard!

2.1.1. Recommended practices for scaling the clusterCopy linkLink copied to clipboard!

2.1.2. Control plane node sizingCopy linkLink copied to clipboard!

2.2. Selecting a larger AWS instance type for control plane machinesCopy linkLink copied to clipboard!

2.2.2. Changing the Amazon Web Services instance type by using a control plane machine setCopy linkLink copied to clipboard!

2.2.3. Changing the Amazon Web Services instance type by using the AWS consoleCopy linkLink copied to clipboard!

2.3. Recommended infrastructure practicesCopy linkLink copied to clipboard!

2.3.1. Infrastructure node sizingCopy linkLink copied to clipboard!

2.3.2. Scaling the Cluster Monitoring OperatorCopy linkLink copied to clipboard!

2.3.3. Prometheus database storage requirementsCopy linkLink copied to clipboard!

2.3.4. Configuring cluster monitoringCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.1. Recommended control plane practices
Copy link

2.1.1. Recommended practices for scaling the cluster
Copy link

2.1.2. Control plane node sizing
Copy link

2.2. Selecting a larger AWS instance type for control plane machines
Copy link

2.2.2. Changing the Amazon Web Services instance type by using a control plane machine set
Copy link

2.2.3. Changing the Amazon Web Services instance type by using the AWS console
Copy link

2.3. Recommended infrastructure practices
Copy link

2.3.1. Infrastructure node sizing
Copy link

2.3.2. Scaling the Cluster Monitoring Operator
Copy link

2.3.3. Prometheus database storage requirements
Copy link

2.3.4. Configuring cluster monitoring
Copy link