Scalability and performance
Scaling your OpenShift Container Platform cluster and tuning performance in production environments
Abstract
Chapter 1. OpenShift Container Platform scalability and performance overview
OpenShift Container Platform provides best practices and tools to help you optimize the performance and scale of your clusters. The following documentation provides information on recommended performance and scalability practices, reference design specifications, optimization, and low latency tuning.
To contact Red Hat support, see Getting support.
Some performance and scalability Operators have release cycles that are independent from OpenShift Container Platform release cycles. For more information, see Openshift Operators.
Recommended performance and scalability practices
Recommended control plane practices
Recommended infrastructure practices
Planning, optimization, and measurement
Recommended practices for IBM Z and IBM LinuxONE
Using the Node Tuning Operator
Using CPU Manager and Topology Manager
Scheduling NUMA-aware workloads
Optimizing storage, routing, networking and CPU usage
Managing bare metal hosts and events
What are huge pages and how are they used by apps
Improving cluster stability in high latency environments using worker latency profiles
Chapter 2. Recommended performance and scalability practices
2.1. Recommended control plane practices
This topic provides recommended performance and scalability practices for control planes in OpenShift Container Platform.
2.1.1. Recommended practices for scaling the cluster
The guidance in this section is only relevant for installations with cloud provider integration.
Apply the following best practices to scale the number of worker machines in your OpenShift Container Platform cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the worker machine set.
When scaling up the cluster to higher node counts:
- Spread nodes across all of the available zones for higher availability.
- Scale up by no more than 25 to 50 machines at once.
- Consider creating new compute machine sets in each available zone with alternative instance types of similar size to help mitigate any periodic provider capacity constraints. For example, on AWS, use m5.large and m5d.large.
Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.
The controller might not be able to create the machines if the replicas in the compute machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which OpenShift Container Platform is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which OpenShift Container Platform is deployed has API request limits; excessive queries might lead to machine creation failures due to cloud platform limitations.
Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.
When scaling large and dense clusters to lower node counts, it might take large amounts of time because the process involves draining or evicting the objects running on the nodes being terminated in parallel. Also, the client might start to throttle the requests if there are too many objects to evict. The default client queries per second (QPS) and burst rates are currently set to 5
and 10
respectively. These values cannot be modified in OpenShift Container Platform.
2.1.2. Control plane node sizing
The control plane node resource requirements depend on the number and type of nodes and objects in the cluster. The following control plane node size recommendations are based on the results of a control plane density focused testing, or Cluster-density. This test creates the following objects across a given number of namespaces:
- 1 image stream
- 1 build
-
5 deployments, with 2 pod replicas in a
sleep
state, mounting 4 secrets, 4 config maps, and 1 downward API volume each - 5 services, each one pointing to the TCP/8080 and TCP/8443 ports of one of the previous deployments
- 1 route pointing to the first of the previous services
- 10 secrets containing 2048 random string characters
- 10 config maps containing 2048 random string characters
Number of worker nodes | Cluster-density (namespaces) | CPU cores | Memory (GB) |
---|---|---|---|
24 | 500 | 4 | 16 |
120 | 1000 | 8 | 32 |
252 | 4000 | 16, but 24 if using the OVN-Kubernetes network plug-in | 64, but 128 if using the OVN-Kubernetes network plug-in |
501, but untested with the OVN-Kubernetes network plug-in | 4000 | 16 | 96 |
The data from the table above is based on an OpenShift Container Platform running on top of AWS, using r5.4xlarge instances as control-plane nodes and m5.2xlarge instances as worker nodes.
On a large and dense cluster with three control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted, or fails. The failures can be due to unexpected issues with power, network, underlying infrastructure, or intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available, which leads to increase in the resource usage. This is also expected during upgrades because the control plane nodes are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
The node sizing varies depending on the number of nodes and object counts in the cluster. It also depends on whether the objects are actively being created on the cluster. During object creation, the control plane is more active in terms of resource usage compared to when the objects are in the running
phase.
Operator Lifecycle Manager (OLM ) runs on the control plane nodes and its memory footprint depends on the number of namespaces and user installed operators that OLM needs to manage on the cluster. Control plane nodes need to be sized accordingly to avoid OOM kills. Following data points are based on the results from cluster maximums testing.
Number of namespaces | OLM memory at idle state (GB) | OLM memory with 5 user operators installed (GB) |
---|---|---|
500 | 0.823 | 1.7 |
1000 | 1.2 | 2.5 |
1500 | 1.7 | 3.2 |
2000 | 2 | 4.4 |
3000 | 2.7 | 5.6 |
4000 | 3.8 | 7.6 |
5000 | 4.2 | 9.02 |
6000 | 5.8 | 11.3 |
7000 | 6.6 | 12.9 |
8000 | 6.9 | 14.8 |
9000 | 8 | 17.7 |
10,000 | 9.9 | 21.6 |
You can modify the control plane node size in a running OpenShift Container Platform 4.13 cluster for the following configurations only:
- Clusters installed with a user-provisioned installation method.
- AWS clusters installed with an installer-provisioned infrastructure installation method.
- Clusters that use a control plane machine set to manage control plane machines.
For all other configurations, you must estimate your total node count and use the suggested control plane node size during installation.
The recommendations are based on the data points captured on OpenShift Container Platform clusters with OpenShift SDN as the network plugin.
In OpenShift Container Platform 4.13, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. The sizes are determined taking that into consideration.
2.1.2.1. Selecting a larger Amazon Web Services instance type for control plane machines
If the control plane machines in an Amazon Web Services (AWS) cluster require more resources, you can select a larger AWS instance type for the control plane machines to use.
The procedure for clusters that use a control plane machine set is different from the procedure for clusters that do not use a control plane machine set.
If you are uncertain about the state of the ControlPlaneMachineSet
CR in your cluster, you can verify the CR status.
2.1.2.1.1. Changing the Amazon Web Services instance type by using a control plane machine set
You can change the Amazon Web Services (AWS) instance type that your control plane machines use by updating the specification in the control plane machine set custom resource (CR).
Prerequisites
- Your AWS cluster uses a control plane machine set.
Procedure
Edit your control plane machine set CR by running the following command:
$ oc --namespace openshift-machine-api edit controlplanemachineset.machine.openshift.io cluster
Edit the following line under the
providerSpec
field:providerSpec: value: ... instanceType: <compatible_aws_instance_type> 1
- 1
- Specify a larger AWS instance type with the same base as the previous selection. For example, you can change
m6i.xlarge
tom6i.2xlarge
orm6i.4xlarge
.
Save your changes.
-
For clusters that use the default
RollingUpdate
update strategy, the Operator automatically propagates the changes to your control plane configuration. -
For clusters that are configured to use the
OnDelete
update strategy, you must replace your control plane machines manually.
-
For clusters that use the default
Additional resources
2.1.2.1.2. Changing the Amazon Web Services instance type by using the AWS console
You can change the Amazon Web Services (AWS) instance type that your control plane machines use by updating the instance type in the AWS console.
Prerequisites
- You have access to the AWS console with the permissions required to modify the EC2 Instance for your cluster.
-
You have access to the OpenShift Container Platform cluster as a user with the
cluster-admin
role.
Procedure
- Open the AWS console and fetch the instances for the control plane machines.
Choose one control plane machine instance.
- For the selected control plane machine, back up the etcd data by creating an etcd snapshot. For more information, see "Backing up etcd".
- In the AWS console, stop the control plane machine instance.
- Select the stopped instance, and click Actions → Instance Settings → Change instance type.
-
Change the instance to a larger type, ensuring that the type is the same base as the previous selection, and apply changes. For example, you can change
m6i.xlarge
tom6i.2xlarge
orm6i.4xlarge
. - Start the instance.
-
If your OpenShift Container Platform cluster has a corresponding
Machine
object for the instance, update the instance type of the object to match the instance type set in the AWS console.
- Repeat this process for each control plane machine.
Additional resources
2.2. Recommended infrastructure practices
This topic provides recommended performance and scalability practices for infrastructure in OpenShift Container Platform.
2.2.1. Infrastructure node sizing
Infrastructure nodes are nodes that are labeled to run pieces of the OpenShift Container Platform environment. The infrastructure node resource requirements depend on the cluster age, nodes, and objects in the cluster, as these factors can lead to an increase in the number of metrics or time series in Prometheus. The following infrastructure node size recommendations are based on the results observed in cluster-density testing detailed in the Control plane node sizing section, where the monitoring stack and the default ingress-controller were moved to these nodes.
Number of worker nodes | Cluster density, or number of namespaces | CPU cores | Memory (GB) |
---|---|---|---|
27 | 500 | 4 | 24 |
120 | 1000 | 8 | 48 |
252 | 4000 | 16 | 128 |
501 | 4000 | 32 | 128 |
In general, three infrastructure nodes are recommended per cluster.
These sizing recommendations should be used as a guideline. Prometheus is a highly memory intensive application; the resource usage depends on various factors including the number of nodes, objects, the Prometheus metrics scraping interval, metrics or time series, and the age of the cluster. In addition, the router resource usage can also be affected by the number of routes and the amount/type of inbound requests.
These recommendations apply only to infrastructure nodes hosting Monitoring, Ingress and Registry infrastructure components installed during cluster creation.
In OpenShift Container Platform 4.13, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. This influences the stated sizing recommendations.
2.2.2. Scaling the Cluster Monitoring Operator
OpenShift Container Platform exposes metrics that the Cluster Monitoring Operator (CMO) collects and stores in the Prometheus-based monitoring stack. As an administrator, you can view dashboards for system resources, containers, and components metrics in the OpenShift Container Platform web console by navigating to Observe → Dashboards.
2.2.3. Prometheus database storage requirements
Red Hat performed various tests for different scale sizes.
- The following Prometheus storage requirements are not prescriptive and should be used as a reference. Higher resource consumption might be observed in your cluster depending on workload activity and resource density, including the number of pods, containers, routes, or other resources exposing metrics collected by Prometheus.
- You can configure the size-based data retention policy to suit your storage requirements.
Number of nodes | Number of pods (2 containers per pod) | Prometheus storage growth per day | Prometheus storage growth per 15 days | Network (per tsdb chunk) |
---|---|---|---|---|
50 | 1800 | 6.3 GB | 94 GB | 16 MB |
100 | 3600 | 13 GB | 195 GB | 26 MB |
150 | 5400 | 19 GB | 283 GB | 36 MB |
200 | 7200 | 25 GB | 375 GB | 46 MB |
Approximately 20 percent of the expected size was added as overhead to ensure that the storage requirements do not exceed the calculated value.
The above calculation is for the default OpenShift Container Platform Cluster Monitoring Operator.
CPU utilization has minor impact. The ratio is approximately 1 core out of 40 per 50 nodes and 1800 pods.
Recommendations for OpenShift Container Platform
- Use at least two infrastructure (infra) nodes.
- Use at least three openshift-container-storage nodes with non-volatile memory express (SSD or NVMe) drives.
2.2.4. Configuring cluster monitoring
You can increase the storage capacity for the Prometheus component in the cluster monitoring stack.
Procedure
To increase the storage capacity for Prometheus:
Create a YAML configuration file,
cluster-monitoring-config.yaml
. For example:apiVersion: v1 kind: ConfigMap data: config.yaml: | prometheusK8s: retention: {{PROMETHEUS_RETENTION_PERIOD}} 1 nodeSelector: node-role.kubernetes.io/infra: "" volumeClaimTemplate: spec: storageClassName: {{STORAGE_CLASS}} 2 resources: requests: storage: {{PROMETHEUS_STORAGE_SIZE}} 3 alertmanagerMain: nodeSelector: node-role.kubernetes.io/infra: "" volumeClaimTemplate: spec: storageClassName: {{STORAGE_CLASS}} 4 resources: requests: storage: {{ALERTMANAGER_STORAGE_SIZE}} 5 metadata: name: cluster-monitoring-config namespace: openshift-monitoring
- 1
- The default value of Prometheus retention is
PROMETHEUS_RETENTION_PERIOD=15d
. Units are measured in time using one of these suffixes: s, m, h, d. - 2 4
- The storage class for your cluster.
- 3
- A typical value is
PROMETHEUS_STORAGE_SIZE=2000Gi
. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki. - 5
- A typical value is
ALERTMANAGER_STORAGE_SIZE=20Gi
. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
- Add values for the retention period, storage class, and storage sizes.
- Save the file.
Apply the changes by running:
$ oc create -f cluster-monitoring-config.yaml
2.2.5. Additional resources
2.3. Recommended etcd practices
This topic provides recommended performance and scalability practices for etcd in OpenShift Container Platform.
2.3.1. Recommended etcd practices
Because etcd writes data to disk and persists proposals on disk, its performance depends on disk performance. Although etcd is not particularly I/O intensive, it requires a low latency block device for optimal performance and stability. Because etcd’s consensus protocol depends on persistently storing metadata to a log (WAL), etcd is sensitive to disk-write latency. Slow disks and disk activity from other processes can cause long fsync latencies.
Those latencies can cause etcd to miss heartbeats, not commit new proposals to the disk on time, and ultimately experience request timeouts and temporary leader loss. High write latencies also lead to an OpenShift API slowness, which affects cluster performance. Because of these reasons, avoid colocating other workloads on the control-plane nodes that are I/O sensitive or intensive and share the same underlying I/O infrastructure.
In terms of latency, run etcd on top of a block device that can write at least 50 IOPS of 8000 bytes long sequentially. That is, with a latency of 10ms, keep in mind that uses fdatasync to synchronize each write in the WAL. For heavy loaded clusters, sequential 500 IOPS of 8000 bytes (2 ms) are recommended. To measure those numbers, you can use a benchmarking tool, such as fio.
To achieve such performance, run etcd on machines that are backed by SSD or NVMe disks with low latency and high throughput. Consider single-level cell (SLC) solid-state drives (SSDs), which provide 1 bit per memory cell, are durable and reliable, and are ideal for write-intensive workloads.
The load on etcd arises from static factors, such as the number of nodes and pods, and dynamic factors, including changes in endpoints due to pod autoscaling, pod restarts, job executions, and other workload-related events. To accurately size your etcd setup, you must analyze the specific requirements of your workload. Consider the number of nodes, pods, and other relevant factors that impact the load on etcd.
The following hard drive practices provide optimal etcd performance:
- Use dedicated etcd drives. Avoid drives that communicate over the network, such as iSCSI. Do not place log files or other heavy workloads on etcd drives.
- Prefer drives with low latency to support fast read and write operations.
- Prefer high-bandwidth writes for faster compactions and defragmentation.
- Prefer high-bandwidth reads for faster recovery from failures.
- Use solid state drives as a minimum selection. Prefer NVMe drives for production environments.
- Use server-grade hardware for increased reliability.
Avoid NAS or SAN setups and spinning drives. Ceph Rados Block Device (RBD) and other types of network-attached storage can result in unpredictable network latency. To provide fast storage to etcd nodes at scale, use PCI passthrough to pass NVM devices directly to the nodes.
Always benchmark by using utilities such as fio. You can use such utilities to continuously monitor the cluster performance as it increases.
Avoid using the Network File System (NFS) protocol or other network based file systems.
Some key metrics to monitor on a deployed OpenShift Container Platform cluster are p99 of etcd disk write ahead log duration and the number of etcd leader changes. Use Prometheus to track these metrics.
The etcd member database sizes can vary in a cluster during normal operations. This difference does not affect cluster upgrades, even if the leader size is different from the other members.
To validate the hardware for etcd before or after you create the OpenShift Container Platform cluster, you can use fio.
Prerequisites
- Container runtimes such as Podman or Docker are installed on the machine that you’re testing.
-
Data is written to the
/var/lib/etcd
path.
Procedure
Run fio and analyze the results:
If you use Podman, run this command:
$ sudo podman run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/cloud-bulldozer/etcd-perf
If you use Docker, run this command:
$ sudo docker run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/cloud-bulldozer/etcd-perf
The output reports whether the disk is fast enough to host etcd by comparing the 99th percentile of the fsync metric captured from the run to see if it is less than 10 ms. A few of the most important etcd metrics that might affected by I/O performance are as follow:
-
etcd_disk_wal_fsync_duration_seconds_bucket
metric reports the etcd’s WAL fsync duration -
etcd_disk_backend_commit_duration_seconds_bucket
metric reports the etcd backend commit latency duration -
etcd_server_leader_changes_seen_total
metric reports the leader changes
Because etcd replicates the requests among all the members, its performance strongly depends on network input/output (I/O) latency. High network latencies result in etcd heartbeats taking longer than the election timeout, which results in leader elections that are disruptive to the cluster. A key metric to monitor on a deployed OpenShift Container Platform cluster is the 99th percentile of etcd network peer latency on each etcd cluster member. Use Prometheus to track the metric.
The histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[2m]))
metric reports the round trip time for etcd to finish replicating the client requests between the members. Ensure that it is less than 50 ms.
2.3.2. Moving etcd to a different disk
You can move etcd from a shared disk to a separate disk to prevent or resolve performance issues.
The Machine Config Operator (MCO) is responsible for mounting a secondary disk for OpenShift Container Platform 4.13 container storage.
This encoded script only supports device names for the following device types:
- SCSI or SATA
-
/dev/sd*
- Virtual device
-
/dev/vd*
- NVMe
-
/dev/nvme*[0-9]*n*
Limitations
-
When the new disk is attached to the cluster, the etcd database is part of the root mount. It is not part of the secondary disk or the intended disk when the primary node is recreated. As a result, the primary node will not create a separate
/var/lib/etcd
mount.
Prerequisites
- You have a backup of your cluster’s etcd data.
-
You have installed the OpenShift CLI (
oc
). -
You have access to the cluster with
cluster-admin
privileges. - Add additional disks before uploading the machine configuration.
-
The
MachineConfigPool
must matchmetadata.labels[machineconfiguration.openshift.io/role]
. This applies to a controller, worker, or a custom pool.
This procedure does not move parts of the root file system, such as /var/
, to another disk or partition on an installed node.
This procedure is not supported when using control plane machine sets.
Procedure
Attach the new disk to the cluster and verify that the disk is detected in the node by running the
lsblk
command in a debug shell:$ oc debug node/<node_name>
# lsblk
Note the device name of the new disk reported by the
lsblk
command.Create the following script and name it
etcd-find-secondary-device.sh
:#!/bin/bash set -uo pipefail for device in <device_type_glob>; do 1 /usr/sbin/blkid "${device}" &> /dev/null if [ $? == 2 ]; then echo "secondary device found ${device}" echo "creating filesystem for etcd mount" mkfs.xfs -L var-lib-etcd -f "${device}" &> /dev/null udevadm settle touch /etc/var-lib-etcd-mount exit fi done echo "Couldn't find secondary block device!" >&2 exit 77
- 1
- Replace
<device_type_glob>
with a shell glob for your block device type. For SCSI or SATA drives, use/dev/sd*
; for virtual drives, use/dev/vd*
; for NVMe drives, use/dev/nvme*[0-9]*n*
.
Create a base64-encoded string from the
etcd-find-secondary-device.sh
script and note its contents:$ base64 -w0 etcd-find-secondary-device.sh
Create a
MachineConfig
YAML file namedetcd-mc.yml
with contents such as the following:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 98-var-lib-etcd spec: config: ignition: version: 3.2.0 storage: files: - path: /etc/find-secondary-device mode: 0755 contents: source: data:text/plain;charset=utf-8;base64,<encoded_etcd_find_secondary_device_script> 1 systemd: units: - name: find-secondary-device.service enabled: true contents: | [Unit] Description=Find secondary device DefaultDependencies=false After=systemd-udev-settle.service Before=local-fs-pre.target ConditionPathExists=!/etc/var-lib-etcd-mount [Service] RemainAfterExit=yes ExecStart=/etc/find-secondary-device RestartForceExitStatus=77 [Install] WantedBy=multi-user.target - name: var-lib-etcd.mount enabled: true contents: | [Unit] Before=local-fs.target [Mount] What=/dev/disk/by-label/var-lib-etcd Where=/var/lib/etcd Type=xfs TimeoutSec=120s [Install] RequiredBy=local-fs.target - name: sync-var-lib-etcd-to-etcd.service enabled: true contents: | [Unit] Description=Sync etcd data if new mount is empty DefaultDependencies=no After=var-lib-etcd.mount var.mount Before=crio.service [Service] Type=oneshot RemainAfterExit=yes ExecCondition=/usr/bin/test ! -d /var/lib/etcd/member ExecStart=/usr/sbin/setsebool -P rsync_full_access 1 ExecStart=/bin/rsync -ar /sysroot/ostree/deploy/rhcos/var/lib/etcd/ /var/lib/etcd/ ExecStart=/usr/sbin/semanage fcontext -a -t container_var_lib_t '/var/lib/etcd(/.*)?' ExecStart=/usr/sbin/setsebool -P rsync_full_access 0 TimeoutSec=0 [Install] WantedBy=multi-user.target graphical.target - name: restorecon-var-lib-etcd.service enabled: true contents: | [Unit] Description=Restore recursive SELinux security contexts DefaultDependencies=no After=var-lib-etcd.mount Before=crio.service [Service] Type=oneshot RemainAfterExit=yes ExecStart=/sbin/restorecon -R /var/lib/etcd/ TimeoutSec=0 [Install] WantedBy=multi-user.target graphical.target
- 1
- Replace
<encoded_etcd_find_secondary_device_script>
with the encoded script contents that you noted.
Verification steps
Run the
grep /var/lib/etcd /proc/mounts
command in a debug shell for the node to ensure that the disk is mounted:$ oc debug node/<node_name>
# grep -w "/var/lib/etcd" /proc/mounts
Example output
/dev/sdb /var/lib/etcd xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
Additional resources
2.3.3. Defragmenting etcd data
For large and dense clusters, etcd can suffer from poor performance if the keyspace grows too large and exceeds the space quota. Periodically maintain and defragment etcd to free up space in the data store. Monitor Prometheus for etcd metrics and defragment it when required; otherwise, etcd can raise a cluster-wide alarm that puts the cluster into a maintenance mode that accepts only key reads and deletes.
Monitor these key metrics:
-
etcd_server_quota_backend_bytes
, which is the current quota limit -
etcd_mvcc_db_total_size_in_use_in_bytes
, which indicates the actual database usage after a history compaction -
etcd_mvcc_db_total_size_in_bytes
, which shows the database size, including free space waiting for defragmentation
Defragment etcd data to reclaim disk space after events that cause disk fragmentation, such as etcd history compaction.
History compaction is performed automatically every five minutes and leaves gaps in the back-end database. This fragmented space is available for use by etcd, but is not available to the host file system. You must defragment etcd to make this space available to the host file system.
Defragmentation occurs automatically, but you can also trigger it manually.
Automatic defragmentation is good for most cases, because the etcd operator uses cluster information to determine the most efficient operation for the user.
2.3.3.1. Automatic defragmentation
The etcd Operator automatically defragments disks. No manual intervention is needed.
Verify that the defragmentation process is successful by viewing one of these logs:
- etcd logs
- cluster-etcd-operator pod
- operator status error log
Automatic defragmentation can cause leader election failure in various OpenShift core components, such as the Kubernetes controller manager, which triggers a restart of the failing component. The restart is harmless and either triggers failover to the next running instance or the component resumes work again after the restart.
Example log output for successful defragmentation
etcd member has been defragmented: <member_name>, memberID: <member_id>
Example log output for unsuccessful defragmentation
failed defrag on member: <member_name>, memberID: <member_id>: <error_message>
2.3.3.2. Manual defragmentation
A Prometheus alert indicates when you need to use manual defragmentation. The alert is displayed in two cases:
- When etcd uses more than 50% of its available space for more than 10 minutes
- When etcd is actively using less than 50% of its total database size for more than 10 minutes
You can also determine whether defragmentation is needed by checking the etcd database size in MB that will be freed by defragmentation with the PromQL expression: (etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024
Defragmenting etcd is a blocking action. The etcd member will not respond until defragmentation is complete. For this reason, wait at least one minute between defragmentation actions on each of the pods to allow the cluster to recover.
Follow this procedure to defragment etcd data on each etcd member.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role.
Procedure
Determine which etcd member is the leader, because the leader should be defragmented last.
Get the list of etcd pods:
$ oc -n openshift-etcd get pods -l k8s-app=etcd -o wide
Example output
etcd-ip-10-0-159-225.example.redhat.com 3/3 Running 0 175m 10.0.159.225 ip-10-0-159-225.example.redhat.com <none> <none> etcd-ip-10-0-191-37.example.redhat.com 3/3 Running 0 173m 10.0.191.37 ip-10-0-191-37.example.redhat.com <none> <none> etcd-ip-10-0-199-170.example.redhat.com 3/3 Running 0 176m 10.0.199.170 ip-10-0-199-170.example.redhat.com <none> <none>
Choose a pod and run the following command to determine which etcd member is the leader:
$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com etcdctl endpoint status --cluster -w table
Example output
Defaulting container name to etcdctl. Use 'oc describe pod/etcd-ip-10-0-159-225.example.redhat.com -n openshift-etcd' to see all of the containers in this pod. +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.191.37:2379 | 251cd44483d811c3 | 3.5.9 | 104 MB | false | false | 7 | 91624 | 91624 | | | https://10.0.159.225:2379 | 264c7c58ecbdabee | 3.5.9 | 104 MB | false | false | 7 | 91624 | 91624 | | | https://10.0.199.170:2379 | 9ac311f93915cc79 | 3.5.9 | 104 MB | true | false | 7 | 91624 | 91624 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Based on the
IS LEADER
column of this output, thehttps://10.0.199.170:2379
endpoint is the leader. Matching this endpoint with the output of the previous step, the pod name of the leader isetcd-ip-10-0-199-170.example.redhat.com
.
Defragment an etcd member.
Connect to the running etcd container, passing in the name of a pod that is not the leader:
$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com
Unset the
ETCDCTL_ENDPOINTS
environment variable:sh-4.4# unset ETCDCTL_ENDPOINTS
Defragment the etcd member:
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag
Example output
Finished defragmenting etcd member[https://localhost:2379]
If a timeout error occurs, increase the value for
--command-timeout
until the command succeeds.Verify that the database size was reduced:
sh-4.4# etcdctl endpoint status -w table --cluster
Example output
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.191.37:2379 | 251cd44483d811c3 | 3.5.9 | 104 MB | false | false | 7 | 91624 | 91624 | | | https://10.0.159.225:2379 | 264c7c58ecbdabee | 3.5.9 | 41 MB | false | false | 7 | 91624 | 91624 | | 1 | https://10.0.199.170:2379 | 9ac311f93915cc79 | 3.5.9 | 104 MB | true | false | 7 | 91624 | 91624 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
This example shows that the database size for this etcd member is now 41 MB as opposed to the starting size of 104 MB.
Repeat these steps to connect to each of the other etcd members and defragment them. Always defragment the leader last.
Wait at least one minute between defragmentation actions to allow the etcd pod to recover. Until the etcd pod recovers, the etcd member will not respond.
If any
NOSPACE
alarms were triggered due to the space quota being exceeded, clear them.Check if there are any
NOSPACE
alarms:sh-4.4# etcdctl alarm list
Example output
memberID:12345678912345678912 alarm:NOSPACE
Clear the alarms:
sh-4.4# etcdctl alarm disarm
Chapter 3. Planning your environment according to object maximums
Consider the following tested object maximums when you plan your OpenShift Container Platform cluster.
These guidelines are based on the largest possible cluster. For smaller clusters, the maximums are lower. There are many factors that influence the stated thresholds, including the etcd version or storage data format.
In most cases, exceeding these numbers results in lower overall performance. It does not necessarily mean that the cluster will fail.
Clusters that experience rapid change, such as those with many starting and stopping pods, can have a lower practical maximum size than documented.
3.1. OpenShift Container Platform tested cluster maximums for major releases
Red Hat does not provide direct guidance on sizing your OpenShift Container Platform cluster. This is because determining whether your cluster is within the supported bounds of OpenShift Container Platform requires careful consideration of all the multidimensional factors that limit the cluster scale.
OpenShift Container Platform supports tested cluster maximums rather than absolute cluster maximums. Not every combination of OpenShift Container Platform version, control plane workload, and network plugin are tested, so the following table does not represent an absolute expectation of scale for all deployments. It might not be possible to scale to a maximum on all dimensions simultaneously. The table contains tested maximums for specific workload and deployment configurations, and serves as a scale guide as to what can be expected with similar deployments.
Maximum type | 4.x tested maximum |
---|---|
Number of nodes | 2,000 [1] |
Number of pods [2] | 150,000 |
Number of pods per node | 2,500 [3][4] |
Number of pods per core | There is no default value. |
Number of namespaces [5] | 10,000 |
Number of builds | 10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy |
Number of pods per namespace [6] | 25,000 |
Number of routes and back ends per Ingress Controller | 2,000 per router |
Number of secrets | 80,000 |
Number of config maps | 90,000 |
Number of services [7] | 10,000 |
Number of services per namespace | 5,000 |
Number of back-ends per service | 5,000 |
Number of deployments per namespace [6] | 2,000 |
Number of build configs | 12,000 |
Number of custom resource definitions (CRD) | 512 [8] |
- Pause pods were deployed to stress the control plane components of OpenShift Container Platform at 2000 node scale. The ability to scale to similar numbers will vary depending upon specific deployment and workload parameters.
- The pod count displayed here is the number of test pods. The actual number of pods depends on the application’s memory, CPU, and storage requirements.
-
This was tested on a cluster with 31 servers: 3 control planes, 2 infrastructure nodes, and 26 worker nodes. If you need 2,500 user pods, you need both a
hostPrefix
of20
, which allocates a network large enough for each node to contain more than 2000 pods, and a custom kubelet config withmaxPods
set to2500
. For more information, see Running 2500 pods per node on OCP 4.13. -
The maximum tested pods per node is 2,500 for clusters using the
OVNKubernetes
network plugin. The maximum tested pods per node for theOpenShiftSDN
network plugin is 500 pods. - When there are a large number of active projects, etcd might suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentation, is highly recommended to free etcd storage.
- There are several control loops in the system that must iterate over all objects in a given namespace as a reaction to some changes in state. Having a large number of objects of a given type in a single namespace can make those loops expensive and slow down processing given state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy the application requirements.
-
Each service port and each service back-end has a corresponding entry in
iptables
. The number of back-ends of a given service impact the size of theEndpoints
objects, which impacts the size of data that is being sent all over the system. -
OpenShift Container Platform has a limit of 512 total custom resource definitions (CRD), including those installed by OpenShift Container Platform, products integrating with OpenShift Container Platform and user-created CRDs. If there are more than 512 CRDs created, then there is a possibility that
oc
command requests might be throttled.
3.1.1. Example scenario
As an example, 500 worker nodes (m5.2xl) were tested, and are supported, using OpenShift Container Platform 4.13, the OVN-Kubernetes network plugin, and the following workload objects:
- 200 namespaces, in addition to the defaults
- 60 pods per node; 30 server and 30 client pods (30k total)
- 57 image streams/ns (11.4k total)
- 15 services/ns backed by the server pods (3k total)
- 15 routes/ns backed by the previous services (3k total)
- 20 secrets/ns (4k total)
- 10 config maps/ns (2k total)
- 6 network policies/ns, including deny-all, allow-from ingress and intra-namespace rules
- 57 builds/ns
The following factors are known to affect cluster workload scaling, positively or negatively, and should be factored into the scale numbers when planning a deployment. For additional information and guidance, contact your sales representative or Red Hat support.
- Number of pods per node
- Number of containers per pod
- Type of probes used (for example, liveness/readiness, exec/http)
- Number of network policies
- Number of projects, or namespaces
- Number of image streams per project
- Number of builds per project
- Number of services/endpoints and type
- Number of routes
- Number of shards
- Number of secrets
- Number of config maps
Rate of API calls, or the cluster “churn”, which is an estimation of how quickly things change in the cluster configuration.
-
Prometheus query for pod creation requests per second over 5 minute windows:
sum(irate(apiserver_request_count{resource="pods",verb="POST"}[5m]))
-
Prometheus query for all API requests per second over 5 minute windows:
sum(irate(apiserver_request_count{}[5m]))
-
Prometheus query for pod creation requests per second over 5 minute windows:
- Cluster node resource consumption of CPU
- Cluster node resource consumption of memory
3.2. OpenShift Container Platform environment and configuration on which the cluster maximums are tested
3.2.1. AWS cloud platform
Node | Flavor | vCPU | RAM(GiB) | Disk type | Disk size(GiB)/IOS | Count | Region |
---|---|---|---|---|---|---|---|
Control plane/etcd [1] | r5.4xlarge | 16 | 128 | gp3 | 220 | 3 | us-west-2 |
Infra [2] | m5.12xlarge | 48 | 192 | gp3 | 100 | 3 | us-west-2 |
Workload [3] | m5.4xlarge | 16 | 64 | gp3 | 500 [4] | 1 | us-west-2 |
Compute | m5.2xlarge | 8 | 32 | gp3 | 100 | 3/25/250/500 [5] | us-west-2 |
- gp3 disks with a baseline performance of 3000 IOPS and 125 MiB per second are used for control plane/etcd nodes because etcd is latency sensitive. gp3 volumes do not use burst performance.
- Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.
- Workload node is dedicated to run performance and scalability workload generators.
- Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.
- Cluster is scaled in iterations and performance and scalability tests are executed at the specified node counts.
3.2.2. IBM Power platform
Node | vCPU | RAM(GiB) | Disk type | Disk size(GiB)/IOS | Count |
---|---|---|---|---|---|
Control plane/etcd [1] | 16 | 32 | io1 | 120 / 10 IOPS per GiB | 3 |
Infra [2] | 16 | 64 | gp2 | 120 | 2 |
Workload [3] | 16 | 256 | gp2 | 120 [4] | 1 |
Compute | 16 | 64 | gp2 | 120 | 2 to 100 [5] |
- io1 disks with 120 / 10 IOPS per GiB are used for control plane/etcd nodes as etcd is I/O intensive and latency sensitive.
- Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.
- Workload node is dedicated to run performance and scalability workload generators.
- Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.
- Cluster is scaled in iterations.
3.2.3. IBM Z platform
Node | vCPU [4] | RAM(GiB)[5] | Disk type | Disk size(GiB)/IOS | Count |
---|---|---|---|---|---|
Control plane/etcd [1,2] | 8 | 32 | ds8k | 300 / LCU 1 | 3 |
Compute [1,3] | 8 | 32 | ds8k | 150 / LCU 2 | 4 nodes (scaled to 100/250/500 pods per node) |
- Nodes are distributed between two logical control units (LCUs) to optimize disk I/O load of the control plane/etcd nodes as etcd is I/O intensive and latency sensitive. Etcd I/O demand should not interfere with other workloads.
- Four compute nodes are used for the tests running several iterations with 100/250/500 pods at the same time. First, idling pods were used to evaluate if pods can be instanced. Next, a network and CPU demanding client/server workload were used to evaluate the stability of the system under stress. Client and server pods were pairwise deployed and each pair was spread over two compute nodes.
- No separate workload node was used. The workload simulates a microservice workload between two compute nodes.
- Physical number of processors used is six Integrated Facilities for Linux (IFLs).
- Total physical memory used is 512 GiB.
3.3. How to plan your environment according to tested cluster maximums
Oversubscribing the physical resources on a node affects resource guarantees the Kubernetes scheduler makes during pod placement. Learn what measures you can take to avoid memory swapping.
Some of the tested maximums are stretched only in a single dimension. They will vary when many objects are running on the cluster.
The numbers noted in this documentation are based on Red Hat’s test methodology, setup, configuration, and tunings. These numbers can vary based on your own individual setup and environments.
While planning your environment, determine how many pods are expected to fit per node:
required pods per cluster / pods per node = total number of nodes needed
The default maximum number of pods per node is 250. However, the number of pods that fit on a node is dependent on the application itself. Consider the application’s memory, CPU, and storage requirements, as described in "How to plan your environment according to application requirements".
Example scenario
If you want to scope your cluster for 2200 pods per cluster, you would need at least five nodes, assuming that there are 500 maximum pods per node:
2200 / 500 = 4.4
If you increase the number of nodes to 20, then the pod distribution changes to 110 pods per node:
2200 / 20 = 110
Where:
required pods per cluster / total number of nodes = expected pods per node
OpenShift Container Platform comes with several system pods, such as SDN, DNS, Operators, and others, which run across every worker node by default. Therefore, the result of the above formula can vary.
3.4. How to plan your environment according to application requirements
Consider an example application environment:
Pod type | Pod quantity | Max memory | CPU cores | Persistent storage |
---|---|---|---|---|
apache | 100 | 500 MB | 0.5 | 1 GB |
node.js | 200 | 1 GB | 1 | 1 GB |
postgresql | 100 | 1 GB | 2 | 10 GB |
JBoss EAP | 100 | 1 GB | 1 | 1 GB |
Extrapolated requirements: 550 CPU cores, 450GB RAM, and 1.4TB storage.
Instance size for nodes can be modulated up or down, depending on your preference. Nodes are often resource overcommitted. In this deployment scenario, you can choose to run additional smaller nodes or fewer larger nodes to provide the same amount of resources. Factors such as operational agility and cost-per-instance should be considered.
Node type | Quantity | CPUs | RAM (GB) |
---|---|---|---|
Nodes (option 1) | 100 | 4 | 16 |
Nodes (option 2) | 50 | 8 | 32 |
Nodes (option 3) | 25 | 16 | 64 |
Some applications lend themselves well to overcommitted environments, and some do not. Most Java applications and applications that use huge pages are examples of applications that would not allow for overcommitment. That memory can not be used for other applications. In the example above, the environment would be roughly 30 percent overcommitted, a common ratio.
The application pods can access a service either by using environment variables or DNS. If using environment variables, for each active service the variables are injected by the kubelet when a pod is run on a node. A cluster-aware DNS server watches the Kubernetes API for new services and creates a set of DNS records for each one. If DNS is enabled throughout your cluster, then all pods should automatically be able to resolve services by their DNS name. Service discovery using DNS can be used in case you must go beyond 5000 services. When using environment variables for service discovery, the argument list exceeds the allowed length after 5000 services in a namespace, then the pods and deployments will start failing. Disable the service links in the deployment’s service specification file to overcome this:
--- apiVersion: template.openshift.io/v1 kind: Template metadata: name: deployment-config-template creationTimestamp: annotations: description: This template will create a deploymentConfig with 1 replica, 4 env vars and a service. tags: '' objects: - apiVersion: apps.openshift.io/v1 kind: DeploymentConfig metadata: name: deploymentconfig${IDENTIFIER} spec: template: metadata: labels: name: replicationcontroller${IDENTIFIER} spec: enableServiceLinks: false containers: - name: pause${IDENTIFIER} image: "${IMAGE}" ports: - containerPort: 8080 protocol: TCP env: - name: ENVVAR1_${IDENTIFIER} value: "${ENV_VALUE}" - name: ENVVAR2_${IDENTIFIER} value: "${ENV_VALUE}" - name: ENVVAR3_${IDENTIFIER} value: "${ENV_VALUE}" - name: ENVVAR4_${IDENTIFIER} value: "${ENV_VALUE}" resources: {} imagePullPolicy: IfNotPresent capabilities: {} securityContext: capabilities: {} privileged: false restartPolicy: Always serviceAccount: '' replicas: 1 selector: name: replicationcontroller${IDENTIFIER} triggers: - type: ConfigChange strategy: type: Rolling - apiVersion: v1 kind: Service metadata: name: service${IDENTIFIER} spec: selector: name: replicationcontroller${IDENTIFIER} ports: - name: serviceport${IDENTIFIER} protocol: TCP port: 80 targetPort: 8080 clusterIP: '' type: ClusterIP sessionAffinity: None status: loadBalancer: {} parameters: - name: IDENTIFIER description: Number to append to the name of resources value: '1' required: true - name: IMAGE description: Image to use for deploymentConfig value: gcr.io/google-containers/pause-amd64:3.0 required: false - name: ENV_VALUE description: Value to use for environment variables generate: expression from: "[A-Za-z0-9]{255}" required: false labels: template: deployment-config-template
The number of application pods that can run in a namespace is dependent on the number of services and the length of the service name when the environment variables are used for service discovery. ARG_MAX
on the system defines the maximum argument length for a new process and it is set to 2097152 bytes (2 MiB) by default. The Kubelet injects environment variables in to each pod scheduled to run in the namespace including:
-
<SERVICE_NAME>_SERVICE_HOST=<IP>
-
<SERVICE_NAME>_SERVICE_PORT=<PORT>
-
<SERVICE_NAME>_PORT=tcp://<IP>:<PORT>
-
<SERVICE_NAME>_PORT_<PORT>_TCP=tcp://<IP>:<PORT>
-
<SERVICE_NAME>_PORT_<PORT>_TCP_PROTO=tcp
-
<SERVICE_NAME>_PORT_<PORT>_TCP_PORT=<PORT>
-
<SERVICE_NAME>_PORT_<PORT>_TCP_ADDR=<ADDR>
The pods in the namespace will start to fail if the argument length exceeds the allowed value and the number of characters in a service name impacts it. For example, in a namespace with 5000 services, the limit on the service name is 33 characters, which enables you to run 5000 pods in the namespace.
Chapter 4. Using quotas and limit ranges
A resource quota, defined by a ResourceQuota
object, provides constraints that limit aggregate resource consumption per project. It can limit the quantity of objects that can be created in a project by type, as well as the total amount of compute resources and storage that may be consumed by resources in that project.
Using quotas and limit ranges, cluster administrators can set constraints to limit the number of objects or amount of compute resources that are used in your project. This helps cluster administrators better manage and allocate resources across all projects, and ensure that no projects are using more than is appropriate for the cluster size.
Quotas are set by cluster administrators and are scoped to a given project. OpenShift Container Platform project owners can change quotas for their project, but not limit ranges. OpenShift Container Platform users cannot modify quotas or limit ranges.
The following sections help you understand how to check on your quota and limit range settings, what sorts of things they can constrain, and how you can request or limit compute resources in your own pods and containers.
4.1. Resources managed by quota
A resource quota, defined by a ResourceQuota
object, provides constraints that limit aggregate resource consumption per project. It can limit the quantity of objects that can be created in a project by type, as well as the total amount of compute resources and storage that may be consumed by resources in that project.
The following describes the set of compute resources and object types that may be managed by a quota.
A pod is in a terminal state if status.phase
is Failed
or Succeeded
.
Resource Name | Description |
---|---|
|
The sum of CPU requests across all pods in a non-terminal state cannot exceed this value. |
|
The sum of memory requests across all pods in a non-terminal state cannot exceed this value. |
|
The sum of local ephemeral storage requests across all pods in a non-terminal state cannot exceed this value. |
|
The sum of CPU requests across all pods in a non-terminal state cannot exceed this value. |
|
The sum of memory requests across all pods in a non-terminal state cannot exceed this value. |
|
The sum of ephemeral storage requests across all pods in a non-terminal state cannot exceed this value. |
| The sum of CPU limits across all pods in a non-terminal state cannot exceed this value. |
| The sum of memory limits across all pods in a non-terminal state cannot exceed this value. |
| The sum of ephemeral storage limits across all pods in a non-terminal state cannot exceed this value. This resource is available only if you enabled the ephemeral storage technology preview. This feature is disabled by default. |
Resource Name | Description |
---|---|
| The sum of storage requests across all persistent volume claims in any state cannot exceed this value. |
| The total number of persistent volume claims that can exist in the project. |
| The sum of storage requests across all persistent volume claims in any state that have a matching storage class, cannot exceed this value. |
| The total number of persistent volume claims with a matching storage class that can exist in the project. |
Resource Name | Description |
---|---|
| The total number of pods in a non-terminal state that can exist in the project. |
| The total number of replication controllers that can exist in the project. |
| The total number of resource quotas that can exist in the project. |
| The total number of services that can exist in the project. |
| The total number of secrets that can exist in the project. |
|
The total number of |
| The total number of persistent volume claims that can exist in the project. |
| The total number of image streams that can exist in the project. |
You can configure an object count quota for these standard namespaced resource types using the count/<resource>.<group>
syntax.
$ oc create quota <name> --hard=count/<resource>.<group>=<quota> 1
4.1.1. Setting resource quota for extended resources
Overcommitment of resources is not allowed for extended resources, so you must specify requests
and limits
for the same extended resource in a quota. Currently, only quota items with the prefix requests.
are allowed for extended resources. The following is an example scenario of how to set resource quota for the GPU resource nvidia.com/gpu
.
Procedure
To determine how many GPUs are available on a node in your cluster, use the following command:
$ oc describe node ip-172-31-27-209.us-west-2.compute.internal | egrep 'Capacity|Allocatable|gpu'
Example output
openshift.com/gpu-accelerator=true Capacity: nvidia.com/gpu: 2 Allocatable: nvidia.com/gpu: 2 nvidia.com/gpu: 0 0
In this example, 2 GPUs are available.
Use this command to set a quota in the namespace
nvidia
. In this example, the quota is1
:$ cat gpu-quota.yaml
Example output
apiVersion: v1 kind: ResourceQuota metadata: name: gpu-quota namespace: nvidia spec: hard: requests.nvidia.com/gpu: 1
Create the quota with the following command:
$ oc create -f gpu-quota.yaml
Example output
resourcequota/gpu-quota created
Verify that the namespace has the correct quota set using the following command:
$ oc describe quota gpu-quota -n nvidia
Example output
Name: gpu-quota Namespace: nvidia Resource Used Hard -------- ---- ---- requests.nvidia.com/gpu 0 1
Run a pod that asks for a single GPU with the following command:
$ oc create pod gpu-pod.yaml
Example output
apiVersion: v1 kind: Pod metadata: generateName: gpu-pod-s46h7 namespace: nvidia spec: restartPolicy: OnFailure containers: - name: rhel7-gpu-pod image: rhel7 env: - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: "compute,utility" - name: NVIDIA_REQUIRE_CUDA value: "cuda>=5.0" command: ["sleep"] args: ["infinity"] resources: limits: nvidia.com/gpu: 1
Verify that the pod is running bwith the following command:
$ oc get pods
Example output
NAME READY STATUS RESTARTS AGE gpu-pod-s46h7 1/1 Running 0 1m
Verify that the quota
Used
counter is correct by running the following command:$ oc describe quota gpu-quota -n nvidia
Example output
Name: gpu-quota Namespace: nvidia Resource Used Hard -------- ---- ---- requests.nvidia.com/gpu 1 1
Using the following command, attempt to create a second GPU pod in the
nvidia
namespace. This is technically available on the node because it has 2 GPUs:$ oc create -f gpu-pod.yaml
Example output
Error from server (Forbidden): error when creating "gpu-pod.yaml": pods "gpu-pod-f7z2w" is forbidden: exceeded quota: gpu-quota, requested: requests.nvidia.com/gpu=1, used: requests.nvidia.com/gpu=1, limited: requests.nvidia.com/gpu=1
This
Forbidden
error message occurs because you have a quota of 1 GPU and this pod tried to allocate a second GPU, which exceeds its quota.
4.1.2. Quota scopes
Each quota can have an associated set of scopes. A quota only measures usage for a resource if it matches the intersection of enumerated scopes.
Adding a scope to a quota restricts the set of resources to which that quota can apply. Specifying a resource outside of the allowed set results in a validation error.
Scope | Description |
---|---|
|
Match pods where |
|
Match pods where |
|
Match pods that have best effort quality of service for either |
|
Match pods that do not have best effort quality of service for |
A BestEffort
scope restricts a quota to limiting the following resources:
-
pods
A Terminating
, NotTerminating
, and NotBestEffort
scope restricts a quota to tracking the following resources:
-
pods
-
memory
-
requests.memory
-
limits.memory
-
cpu
-
requests.cpu
-
limits.cpu
-
ephemeral-storage
-
requests.ephemeral-storage
-
limits.ephemeral-storage
Ephemeral storage requests and limits apply only if you enabled the ephemeral storage technology preview. This feature is disabled by default.
Additional resources
See Resources managed by quotas for more on compute resources.
See Quality of Service Classes for more on committing compute resources.
4.2. Admin quota usage
4.2.1. Quota enforcement
After a resource quota for a project is first created, the project restricts the ability to create any new resources that can violate a quota constraint until it has calculated updated usage statistics.
After a quota is created and usage statistics are updated, the project accepts the creation of new content. When you create or modify resources, your quota usage is incremented immediately upon the request to create or modify the resource.
When you delete a resource, your quota use is decremented during the next full recalculation of quota statistics for the project.
A configurable amount of time determines how long it takes to reduce quota usage statistics to their current observed system value.
If project modifications exceed a quota usage limit, the server denies the action, and an appropriate error message is returned to the user explaining the quota constraint violated, and what their currently observed usage stats are in the system.
4.2.2. Requests compared to limits
When allocating compute resources by quota, each container can specify a request and a limit value each for CPU, memory, and ephemeral storage. Quotas can restrict any of these values.
If the quota has a value specified for requests.cpu
or requests.memory
, then it requires that every incoming container make an explicit request for those resources. If the quota has a value specified for limits.cpu
or limits.memory
, then it requires that every incoming container specify an explicit limit for those resources.
4.2.3. Sample resource quota definitions
Example core-object-counts.yaml
apiVersion: v1 kind: ResourceQuota metadata: name: core-object-counts spec: hard: configmaps: "10" 1 persistentvolumeclaims: "4" 2 replicationcontrollers: "20" 3 secrets: "10" 4 services: "10" 5
- 1
- The total number of
ConfigMap
objects that can exist in the project. - 2
- The total number of persistent volume claims (PVCs) that can exist in the project.
- 3
- The total number of replication controllers that can exist in the project.
- 4
- The total number of secrets that can exist in the project.
- 5
- The total number of services that can exist in the project.
Example openshift-object-counts.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: openshift-object-counts
spec:
hard:
openshift.io/imagestreams: "10" 1
- 1
- The total number of image streams that can exist in the project.
Example compute-resources.yaml
apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "4" 1 requests.cpu: "1" 2 requests.memory: 1Gi 3 requests.ephemeral-storage: 2Gi 4 limits.cpu: "2" 5 limits.memory: 2Gi 6 limits.ephemeral-storage: 4Gi 7
- 1
- The total number of pods in a non-terminal state that can exist in the project.
- 2
- Across all pods in a non-terminal state, the sum of CPU requests cannot exceed 1 core.
- 3
- Across all pods in a non-terminal state, the sum of memory requests cannot exceed 1Gi.
- 4
- Across all pods in a non-terminal state, the sum of ephemeral storage requests cannot exceed 2Gi.
- 5
- Across all pods in a non-terminal state, the sum of CPU limits cannot exceed 2 cores.
- 6
- Across all pods in a non-terminal state, the sum of memory limits cannot exceed 2Gi.
- 7
- Across all pods in a non-terminal state, the sum of ephemeral storage limits cannot exceed 4Gi.
Example besteffort.yaml
apiVersion: v1 kind: ResourceQuota metadata: name: besteffort spec: hard: pods: "1" 1 scopes: - BestEffort 2
Example compute-resources-long-running.yaml
apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources-long-running spec: hard: pods: "4" 1 limits.cpu: "4" 2 limits.memory: "2Gi" 3 limits.ephemeral-storage: "4Gi" 4 scopes: - NotTerminating 5
- 1
- The total number of pods in a non-terminal state.
- 2
- Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value.
- 3
- Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value.
- 4
- Across all pods in a non-terminal state, the sum of ephemeral storage limits cannot exceed this value.
- 5
- Restricts the quota to only matching pods where
spec.activeDeadlineSeconds
is set tonil
. Build pods will fall underNotTerminating
unless theRestartNever
policy is applied.
Example compute-resources-time-bound.yaml
apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources-time-bound spec: hard: pods: "2" 1 limits.cpu: "1" 2 limits.memory: "1Gi" 3 limits.ephemeral-storage: "1Gi" 4 scopes: - Terminating 5
- 1
- The total number of pods in a non-terminal state.
- 2
- Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value.
- 3
- Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value.
- 4
- Across all pods in a non-terminal state, the sum of ephemeral storage limits cannot exceed this value.
- 5
- Restricts the quota to only matching pods where
spec.activeDeadlineSeconds >=0
. For example, this quota would charge for build pods, but not long running pods such as a web server or database.
Example storage-consumption.yaml
apiVersion: v1 kind: ResourceQuota metadata: name: storage-consumption spec: hard: persistentvolumeclaims: "10" 1 requests.storage: "50Gi" 2 gold.storageclass.storage.k8s.io/requests.storage: "10Gi" 3 silver.storageclass.storage.k8s.io/requests.storage: "20Gi" 4 silver.storageclass.storage.k8s.io/persistentvolumeclaims: "5" 5 bronze.storageclass.storage.k8s.io/requests.storage: "0" 6 bronze.storageclass.storage.k8s.io/persistentvolumeclaims: "0" 7
- 1
- The total number of persistent volume claims in a project
- 2
- Across all persistent volume claims in a project, the sum of storage requested cannot exceed this value.
- 3
- Across all persistent volume claims in a project, the sum of storage requested in the gold storage class cannot exceed this value.
- 4
- Across all persistent volume claims in a project, the sum of storage requested in the silver storage class cannot exceed this value.
- 5
- Across all persistent volume claims in a project, the total number of claims in the silver storage class cannot exceed this value.
- 6
- Across all persistent volume claims in a project, the sum of storage requested in the bronze storage class cannot exceed this value. When this is set to
0
, it means bronze storage class cannot request storage. - 7
- Across all persistent volume claims in a project, the sum of storage requested in the bronze storage class cannot exceed this value. When this is set to
0
, it means bronze storage class cannot create claims.
4.2.4. Creating a quota
To create a quota, first define the quota in a file. Then use that file to apply it to a project. See the Additional resources section for a link describing this.
$ oc create -f <resource_quota_definition> [-n <project_name>]
Here is an example using the core-object-counts.yaml
resource quota definition and the demoproject
project name:
$ oc create -f core-object-counts.yaml -n demoproject
4.2.5. Creating object count quotas
You can create an object count quota for all OpenShift Container Platform standard namespaced resource types, such as BuildConfig
, and DeploymentConfig
. An object quota count places a defined quota on all standard namespaced resource types.
When using a resource quota, an object is charged against the quota if it exists in server storage. These types of quotas are useful to protect against exhaustion of storage resources.
To configure an object count quota for a resource, run the following command:
$ oc create quota <name> --hard=count/<resource>.<group>=<quota>,count/<resource>.<group>=<quota>
Example showing object count quota:
$ oc create quota test --hard=count/deployments.extensions=2,count/replicasets.extensions=4,count/pods=3,count/secrets=4 resourcequota "test" created $ oc describe quota test Name: test Namespace: quota Resource Used Hard -------- ---- ---- count/deployments.extensions 0 2 count/pods 0 3 count/replicasets.extensions 0 4 count/secrets 0 4
This example limits the listed resources to the hard limit in each project in the cluster.
4.2.6. Viewing a quota
You can view usage statistics related to any hard limits defined in a project’s quota by navigating in the web console to the project’s Quota
page.
You can also use the CLI to view quota details:
First, get the list of quotas defined in the project. For example, for a project called
demoproject
:$ oc get quota -n demoproject NAME AGE besteffort 11m compute-resources 2m core-object-counts 29m
Describe the quota you are interested in, for example the
core-object-counts
quota:$ oc describe quota core-object-counts -n demoproject Name: core-object-counts Namespace: demoproject Resource Used Hard -------- ---- ---- configmaps 3 10 persistentvolumeclaims 0 4 replicationcontrollers 3 20 secrets 9 10 services 2 10
4.2.7. Configuring quota synchronization period
When a set of resources are deleted, the synchronization time frame of resources is determined by the resource-quota-sync-period
setting in the /etc/origin/master/master-config.yaml
file.
Before quota usage is restored, a user can encounter problems when attempting to reuse the resources. You can change the resource-quota-sync-period
setting to have the set of resources regenerate in the needed amount of time (in seconds) for the resources to be once again available:
Example resource-quota-sync-period
setting
kubernetesMasterConfig: apiLevels: - v1beta3 - v1 apiServerArguments: null controllerArguments: resource-quota-sync-period: - "10s"
After making any changes, restart the controller services to apply them.
$ master-restart api $ master-restart controllers
Adjusting the regeneration time can be helpful for creating resources and determining resource usage when automation is used.
The resource-quota-sync-period
setting balances system performance. Reducing the sync period can result in a heavy load on the controller.
4.2.8. Explicit quota to consume a resource
If a resource is not managed by quota, a user has no restriction on the amount of resource that can be consumed. For example, if there is no quota on storage related to the gold storage class, the amount of gold storage a project can create is unbounded.
For high-cost compute or storage resources, administrators can require an explicit quota be granted to consume a resource. For example, if a project was not explicitly given quota for storage related to the gold storage class, users of that project would not be able to create any storage of that type.
In order to require explicit quota to consume a particular resource, the following stanza should be added to the master-config.yaml.
admissionConfig: pluginConfig: ResourceQuota: configuration: apiVersion: resourcequota.admission.k8s.io/v1alpha1 kind: Configuration limitedResources: - resource: persistentvolumeclaims 1 matchContains: - gold.storageclass.storage.k8s.io/requests.storage 2
In the above example, the quota system intercepts every operation that creates or updates a PersistentVolumeClaim
. It checks what resources controlled by quota would be consumed. If there is no covering quota for those resources in the project, the request is denied. In this example, if a user creates a PersistentVolumeClaim
that uses storage associated with the gold storage class and there is no matching quota in the project, the request is denied.
Additional resources
For examples of how to create the file needed to set quotas, see Resources managed by quotas.
A description of how to allocate compute resources managed by quota.
For information on managing limits and quota on project resources, see Working with projects.
If a quota has been defined for your project, see Understanding deployments for considerations in cluster configurations.
4.3. Setting limit ranges
A limit range, defined by a LimitRange
object, defines compute resource constraints at the pod, container, image, image stream, and persistent volume claim level. The limit range specifies the amount of resources that a pod, container, image, image stream, or persistent volume claim can consume.
All requests to create and modify resources are evaluated against each LimitRange
object in the project. If the resource violates any of the enumerated constraints, the resource is rejected. If the resource does not set an explicit value, and if the constraint supports a default value, the default value is applied to the resource.
For CPU and memory limits, if you specify a maximum value but do not specify a minimum limit, the resource can consume more CPU and memory resources than the maximum value.
Core limit range object definition
apiVersion: "v1" kind: "LimitRange" metadata: name: "core-resource-limits" 1 spec: limits: - type: "Pod" max: cpu: "2" 2 memory: "1Gi" 3 min: cpu: "200m" 4 memory: "6Mi" 5 - type: "Container" max: cpu: "2" 6 memory: "1Gi" 7 min: cpu: "100m" 8 memory: "4Mi" 9 default: cpu: "300m" 10 memory: "200Mi" 11 defaultRequest: cpu: "200m" 12 memory: "100Mi" 13 maxLimitRequestRatio: cpu: "10" 14
- 1
- The name of the limit range object.
- 2
- The maximum amount of CPU that a pod can request on a node across all containers.
- 3
- The maximum amount of memory that a pod can request on a node across all containers.
- 4
- The minimum amount of CPU that a pod can request on a node across all containers. If you do not set a
min
value or you setmin
to0
, the result is no limit and the pod can consume more than themax
CPU value. - 5
- The minimum amount of memory that a pod can request on a node across all containers. If you do not set a
min
value or you setmin
to0
, the result is no limit and the pod can consume more than themax
memory value. - 6
- The maximum amount of CPU that a single container in a pod can request.
- 7
- The maximum amount of memory that a single container in a pod can request.
- 8
- The minimum amount of CPU that a single container in a pod can request. If you do not set a
min
value or you setmin
to0
, the result is no limit and the pod can consume more than themax
CPU value. - 9
- The minimum amount of memory that a single container in a pod can request. If you do not set a
min
value or you setmin
to0
, the result is no limit and the pod can consume more than themax
memory value. - 10
- The default CPU limit for a container if you do not specify a limit in the pod specification.
- 11
- The default memory limit for a container if you do not specify a limit in the pod specification.
- 12
- The default CPU request for a container if you do not specify a request in the pod specification.
- 13
- The default memory request for a container if you do not specify a request in the pod specification.
- 14
- The maximum limit-to-request ratio for a container.
OpenShift Container Platform Limit range object definition
apiVersion: "v1" kind: "LimitRange" metadata: name: "openshift-resource-limits" spec: limits: - type: openshift.io/Image max: storage: 1Gi 1 - type: openshift.io/ImageStream max: openshift.io/image-tags: 20 2 openshift.io/images: 30 3 - type: "Pod" max: cpu: "2" 4 memory: "1Gi" 5 ephemeral-storage: "1Gi" 6 min: cpu: "1" 7 memory: "1Gi" 8
- 1
- The maximum size of an image that can be pushed to an internal registry.
- 2
- The maximum number of unique image tags as defined in the specification for the image stream.
- 3
- The maximum number of unique image references as defined in the specification for the image stream status.
- 4
- The maximum amount of CPU that a pod can request on a node across all containers.
- 5
- The maximum amount of memory that a pod can request on a node across all containers.
- 6
- The maximum amount of ephemeral storage that a pod can request on a node across all containers.
- 7
- The minimum amount of CPU that a pod can request on a node across all containers. See the Supported Constraints table for important information.
- 8
- The minimum amount of memory that a pod can request on a node across all containers. If you do not set a
min
value or you setmin
to0
, the result` is no limit and the pod can consume more than themax
memory value.
You can specify both core and OpenShift Container Platform resources in one limit range object.
4.3.1. Container limits
Supported Resources:
- CPU
- Memory
Supported Constraints
Per container, the following must hold true if specified:
Container
Constraint | Behavior |
---|---|
|
If the configuration defines a |
|
If the configuration defines a |
|
If the limit range defines a
For example, if a container has |
Supported Defaults:
Default[<resource>]
-
Defaults
container.resources.limit[<resource>]
to specified value if none. Default Requests[<resource>]
-
Defaults
container.resources.requests[<resource>]
to specified value if none.
4.3.2. Pod limits
Supported Resources:
- CPU
- Memory
Supported Constraints:
Across all containers in a pod, the following must hold true:
Constraint | Enforced Behavior |
---|---|
|
|
|
|
|
|
4.3.3. Image limits
Supported Resources:
- Storage
Resource type name:
-
openshift.io/Image
Per image, the following must hold true if specified:
Constraint | Behavior |
---|---|
|
|
To prevent blobs that exceed the limit from being uploaded to the registry, the registry must be configured to enforce quota. The REGISTRY_MIDDLEWARE_REPOSITORY_OPENSHIFT_ENFORCEQUOTA
environment variable must be set to true
. By default, the environment variable is set to true
for new deployments.
4.3.4. Image stream limits
Supported Resources:
-
openshift.io/image-tags
-
openshift.io/images
Resource type name:
-
openshift.io/ImageStream
Per image stream, the following must hold true if specified:
Constraint | Behavior |
---|---|
|
|
|
|
4.3.5. Counting of image references
The openshift.io/image-tags
resource represents unique stream limits. Possible references are an ImageStreamTag
, an ImageStreamImage
, or a DockerImage
. Tags can be created by using the oc tag
and oc import-image
commands or by using image streams. No distinction is made between internal and external references. However, each unique reference that is tagged in an image stream specification is counted just once. It does not restrict pushes to an internal container image registry in any way, but is useful for tag restriction.
The openshift.io/images
resource represents unique image names that are recorded in image stream status. It helps to restrict several images that can be pushed to the internal registry. Internal and external references are not distinguished.
4.3.6. PersistentVolumeClaim limits
Supported Resources:
- Storage
Supported Constraints:
Across all persistent volume claims in a project, the following must hold true:
Constraint | Enforced Behavior |
---|---|
| Min[<resource>] <= claim.spec.resources.requests[<resource>] (required) |
| claim.spec.resources.requests[<resource>] (required) <= Max[<resource>] |
Limit Range Object Definition
{ "apiVersion": "v1", "kind": "LimitRange", "metadata": { "name": "pvcs" 1 }, "spec": { "limits": [{ "type": "PersistentVolumeClaim", "min": { "storage": "2Gi" 2 }, "max": { "storage": "50Gi" 3 } } ] } }
Additional resources
For information on stream limits, see managing images streams.
For information on stream limits.
For more information on compute resource constraints.
For more information on how CPU and memory are measured, see Recommended control plane practices.
You can specify limits and requests for ephemeral storage. For more information on this feature, see Understanding ephemeral storage.
4.4. Limit range operations
4.4.1. Creating a limit range
Shown here is an example procedure to follow for creating a limit range.
Procedure
Create the object:
$ oc create -f <limit_range_file> -n <project>
4.4.2. View the limit
You can view any limit ranges that are defined in a project by navigating in the web console to the Quota
page for the project. You can also use the CLI to view limit range details by performing the following steps:
Procedure
Get the list of limit range objects that are defined in the project. For example, a project called
demoproject
:$ oc get limits -n demoproject
Example Output
NAME AGE resource-limits 6d
Describe the limit range. For example, for a limit range called
resource-limits
:$ oc describe limits resource-limits -n demoproject
Example Output
Name: resource-limits Namespace: demoproject Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio ---- -------- --- --- --------------- ------------- ----------------------- Pod cpu 200m 2 - - - Pod memory 6Mi 1Gi - - - Container cpu 100m 2 200m 300m 10 Container memory 4Mi 1Gi 100Mi 200Mi - openshift.io/Image storage - 1Gi - - - openshift.io/ImageStream openshift.io/image - 12 - - - openshift.io/ImageStream openshift.io/image-tags - 10 - - -
4.4.3. Deleting a limit range
To remove a limit range, run the following command:
+
$ oc delete limits <limit_name>
S
Additional resources
For information about enforcing different limits on the number of projects that your users can create, managing limits, and quota on project resources, see Resource quotas per projects.
Chapter 5. Recommended host practices for IBM Z & IBM(R) LinuxONE environments
This topic provides recommended host practices for OpenShift Container Platform on IBM Z and IBM® LinuxONE.
The s390x architecture is unique in many aspects. Therefore, some recommendations made here might not apply to other platforms.
Unless stated otherwise, these practices apply to both z/VM and Red Hat Enterprise Linux (RHEL) KVM installations on IBM Z and IBM® LinuxONE.
5.1. Managing CPU overcommitment
In a highly virtualized IBM Z environment, you must carefully plan the infrastructure setup and sizing. One of the most important features of virtualization is the capability to do resource overcommitment, allocating more resources to the virtual machines than actually available at the hypervisor level. This is very workload dependent and there is no golden rule that can be applied to all setups.
Depending on your setup, consider these best practices regarding CPU overcommitment:
- At LPAR level (PR/SM hypervisor), avoid assigning all available physical cores (IFLs) to each LPAR. For example, with four physical IFLs available, you should not define three LPARs with four logical IFLs each.
- Check and understand LPAR shares and weights.
- An excessive number of virtual CPUs can adversely affect performance. Do not define more virtual processors to a guest than logical processors are defined to the LPAR.
- Configure the number of virtual processors per guest for peak workload, not more.
- Start small and monitor the workload. Increase the vCPU number incrementally if necessary.
- Not all workloads are suitable for high overcommitment ratios. If the workload is CPU intensive, you will probably not be able to achieve high ratios without performance problems. Workloads that are more I/O intensive can keep consistent performance even with high overcommitment ratios.
5.2. Disable Transparent Huge Pages
Transparent Huge Pages (THP) attempt to automate most aspects of creating, managing, and using huge pages. Since THP automatically manages the huge pages, this is not always handled optimally for all types of workloads. THP can lead to performance regressions, since many applications handle huge pages on their own. Therefore, consider disabling THP.
5.3. Boost networking performance with Receive Flow Steering
Receive Flow Steering (RFS) extends Receive Packet Steering (RPS) by further reducing network latency. RFS is technically based on RPS, and improves the efficiency of packet processing by increasing the CPU cache hit rate. RFS achieves this, and in addition considers queue length, by determining the most convenient CPU for computation so that cache hits are more likely to occur within the CPU. Thus, the CPU cache is invalidated less and requires fewer cycles to rebuild the cache. This can help reduce packet processing run time.
5.3.1. Use the Machine Config Operator (MCO) to activate RFS
Procedure
Copy the following MCO sample profile into a YAML file. For example,
enable-rfs.yaml
:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 50-enable-rfs spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:text/plain;charset=US-ASCII,%23%20turn%20on%20Receive%20Flow%20Steering%20%28RFS%29%20for%20all%20network%20interfaces%0ASUBSYSTEM%3D%3D%22net%22%2C%20ACTION%3D%3D%22add%22%2C%20RUN%7Bprogram%7D%2B%3D%22/bin/bash%20-c%20%27for%20x%20in%20/sys/%24DEVPATH/queues/rx-%2A%3B%20do%20echo%208192%20%3E%20%24x/rps_flow_cnt%3B%20%20done%27%22%0A filesystem: root mode: 0644 path: /etc/udev/rules.d/70-persistent-net.rules - contents: source: data:text/plain;charset=US-ASCII,%23%20define%20sock%20flow%20enbtried%20for%20%20Receive%20Flow%20Steering%20%28RFS%29%0Anet.core.rps_sock_flow_entries%3D8192%0A filesystem: root mode: 0644 path: /etc/sysctl.d/95-enable-rps.conf
Create the MCO profile:
$ oc create -f enable-rfs.yaml
Verify that an entry named
50-enable-rfs
is listed:$ oc get mc
To deactivate, enter:
$ oc delete mc 50-enable-rfs
5.4. Choose your networking setup
The networking stack is one of the most important components for a Kubernetes-based product like OpenShift Container Platform. For IBM Z setups, the networking setup depends on the hypervisor of your choice. Depending on the workload and the application, the best fit usually changes with the use case and the traffic pattern.
Depending on your setup, consider these best practices:
- Consider all options regarding networking devices to optimize your traffic pattern. Explore the advantages of OSA-Express, RoCE Express, HiperSockets, z/VM VSwitch, Linux Bridge (KVM), and others to decide which option leads to the greatest benefit for your setup.
- Always use the latest available NIC version. For example, OSA Express 7S 10 GbE shows great improvement compared to OSA Express 6S 10 GbE with transactional workload types, although both are 10 GbE adapters.
- Each virtual switch adds an additional layer of latency.
- The load balancer plays an important role for network communication outside the cluster. Consider using a production-grade hardware load balancer if this is critical for your application.
- OpenShift Container Platform SDN introduces flows and rules, which impact the networking performance. Make sure to consider pod affinities and placements, to benefit from the locality of services where communication is critical.
- Balance the trade-off between performance and functionality.
5.5. Ensure high disk performance with HyperPAV on z/VM
DASD and ECKD devices are commonly used disk types in IBM Z environments. In a typical OpenShift Container Platform setup in z/VM environments, DASD disks are commonly used to support the local storage for the nodes. You can set up HyperPAV alias devices to provide more throughput and overall better I/O performance for the DASD disks that support the z/VM guests.
Using HyperPAV for the local storage devices leads to a significant performance benefit. However, you must be aware that there is a trade-off between throughput and CPU costs.
5.5.1. Use the Machine Config Operator (MCO) to activate HyperPAV aliases in nodes using z/VM full-pack minidisks
For z/VM-based OpenShift Container Platform setups that use full-pack minidisks, you can leverage the advantage of MCO profiles by activating HyperPAV aliases in all of the nodes. You must add YAML configurations for both control plane and compute nodes.
Procedure
Copy the following MCO sample profile into a YAML file for the control plane node. For example,
05-master-kernelarg-hpav.yaml
:$ cat 05-master-kernelarg-hpav.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 05-master-kernelarg-hpav spec: config: ignition: version: 3.1.0 kernelArguments: - rd.dasd=800-805
Copy the following MCO sample profile into a YAML file for the compute node. For example,
05-worker-kernelarg-hpav.yaml
:$ cat 05-worker-kernelarg-hpav.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 05-worker-kernelarg-hpav spec: config: ignition: version: 3.1.0 kernelArguments: - rd.dasd=800-805
NoteYou must modify the
rd.dasd
arguments to fit the device IDs.Create the MCO profiles:
$ oc create -f 05-master-kernelarg-hpav.yaml
$ oc create -f 05-worker-kernelarg-hpav.yaml
To deactivate, enter:
$ oc delete -f 05-master-kernelarg-hpav.yaml
$ oc delete -f 05-worker-kernelarg-hpav.yaml
Additional resources
5.6. RHEL KVM on IBM Z host recommendations
Optimizing a KVM virtual server environment strongly depends on the workloads of the virtual servers and on the available resources. The same action that enhances performance in one environment can have adverse effects in another. Finding the best balance for a particular setting can be a challenge and often involves experimentation.
The following section introduces some best practices when using OpenShift Container Platform with RHEL KVM on IBM Z and IBM® LinuxONE environments.
5.6.1. Use I/O threads for your virtual block devices
To make virtual block devices use I/O threads, you must configure one or more I/O threads for the virtual server and each virtual block device to use one of these I/O threads.
The following example specifies <iothreads>3</iothreads>
to configure three I/O threads, with consecutive decimal thread IDs 1, 2, and 3. The iothread="2"
parameter specifies the driver element of the disk device to use the I/O thread with ID 2.
Sample I/O thread specification
... <domain> <iothreads>3</iothreads>1 ... <devices> ... <disk type="block" device="disk">2 <driver ... iothread="2"/> </disk> ... </devices> ... </domain>
Threads can increase the performance of I/O operations for disk devices, but they also use memory and CPU resources. You can configure multiple devices to use the same thread. The best mapping of threads to devices depends on the available resources and the workload.
Start with a small number of I/O threads. Often, a single I/O thread for all disk devices is sufficient. Do not configure more threads than the number of virtual CPUs, and do not configure idle threads.
You can use the virsh iothreadadd
command to add I/O threads with specific thread IDs to a running virtual server.
5.6.2. Avoid virtual SCSI devices
Configure virtual SCSI devices only if you need to address the device through SCSI-specific interfaces. Configure disk space as virtual block devices rather than virtual SCSI devices, regardless of the backing on the host.
However, you might need SCSI-specific interfaces for:
- A LUN for a SCSI-attached tape drive on the host.
- A DVD ISO file on the host file system that is mounted on a virtual DVD drive.
5.6.3. Configure guest caching for disk
Configure your disk devices to do caching by the guest and not by the host.
Ensure that the driver element of the disk device includes the cache="none"
and io="native"
parameters.
<disk type="block" device="disk"> <driver name="qemu" type="raw" cache="none" io="native" iothread="1"/> ... </disk>
5.6.4. Exclude the memory balloon device
Unless you need a dynamic memory size, do not define a memory balloon device and ensure that libvirt does not create one for you. Include the memballoon
parameter as a child of the devices element in your domain configuration XML file.
Check the list of active profiles:
<memballoon model="none"/>
5.6.5. Tune the CPU migration algorithm of the host scheduler
Do not change the scheduler settings unless you are an expert who understands the implications. Do not apply changes to production systems without testing them and confirming that they have the intended effect.
The kernel.sched_migration_cost_ns
parameter specifies a time interval in nanoseconds. After the last execution of a task, the CPU cache is considered to have useful content until this interval expires. Increasing this interval results in fewer task migrations. The default value is 500000 ns.
If the CPU idle time is higher than expected when there are runnable processes, try reducing this interval. If tasks bounce between CPUs or nodes too often, try increasing it.
To dynamically set the interval to 60000 ns, enter the following command:
# sysctl kernel.sched_migration_cost_ns=60000
To persistently change the value to 60000 ns, add the following entry to /etc/sysctl.conf
:
kernel.sched_migration_cost_ns=60000
5.6.6. Disable the cpuset cgroup controller
This setting applies only to KVM hosts with cgroups version 1. To enable CPU hotplug on the host, disable the cgroup controller.
Procedure
-
Open
/etc/libvirt/qemu.conf
with an editor of your choice. -
Go to the
cgroup_controllers
line. - Duplicate the entire line and remove the leading number sign (#) from the copy.
Remove the
cpuset
entry, as follows:cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]
For the new setting to take effect, you must restart the libvirtd daemon:
- Stop all virtual machines.
Run the following command:
# systemctl restart libvirtd
- Restart the virtual machines.
This setting persists across host reboots.
5.6.7. Tune the polling period for idle virtual CPUs
When a virtual CPU becomes idle, KVM polls for wakeup conditions for the virtual CPU before allocating the host resource. You can specify the time interval, during which polling takes place in sysfs at /sys/module/kvm/parameters/halt_poll_ns
. During the specified time, polling reduces the wakeup latency for the virtual CPU at the expense of resource usage. Depending on the workload, a longer or shorter time for polling can be beneficial. The time interval is specified in nanoseconds. The default is 50000 ns.
To optimize for low CPU consumption, enter a small value or write 0 to disable polling:
# echo 0 > /sys/module/kvm/parameters/halt_poll_ns
To optimize for low latency, for example for transactional workloads, enter a large value:
# echo 80000 > /sys/module/kvm/parameters/halt_poll_ns
Additional resources
Chapter 6. Using the Node Tuning Operator
Learn about the Node Tuning Operator and how you can use it to manage node-level tuning by orchestrating the tuned daemon.
6.1. About the Node Tuning Operator
The Node Tuning Operator helps you manage node-level tuning by orchestrating the TuneD daemon and achieves low latency performance by using the Performance Profile controller. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs.
The Operator manages the containerized TuneD daemon for OpenShift Container Platform as a Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized TuneD daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.
Node-level settings applied by the containerized TuneD daemon are rolled back on an event that triggers a profile change or when the containerized TuneD daemon is terminated gracefully by receiving and handling a termination signal.
The Node Tuning Operator uses the Performance Profile controller to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications.
The cluster administrator configures a performance profile to define node-level settings such as the following:
- Updating the kernel to kernel-rt.
- Choosing CPUs for housekeeping.
- Choosing CPUs for running workloads.
Currently, disabling CPU load balancing is not supported by cgroup v2. As a result, you might not get the desired behavior from performance profiles if you have cgroup v2 enabled. Enabling cgroup v2 is not recommended if you are using performance profiles.
The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1 and later.
In earlier versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.
6.2. Accessing an example Node Tuning Operator specification
Use this process to access an example Node Tuning Operator specification.
Procedure
Run the following command to access an example Node Tuning Operator specification:
oc get tuned.tuned.openshift.io/default -o yaml -n openshift-cluster-node-tuning-operator
The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift Container Platform nodes based on node or pod labels and profile priorities.
While in certain situations the support for pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without pod label matching. If a custom profile is created with pod label matching, then the functionality will be enabled at that time. The pod label functionality will be deprecated in future versions of the Node Tuning Operator.
6.3. Default profiles set on a cluster
The following are the default profiles set on a cluster.
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: default namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Optimize systems running OpenShift (provider specific parent profile) include=-provider-${f:exec:cat:/var/lib/tuned/provider},openshift name: openshift recommend: - profile: openshift-control-plane priority: 30 match: - label: node-role.kubernetes.io/master - label: node-role.kubernetes.io/infra - profile: openshift-node priority: 40
Starting with OpenShift Container Platform 4.9, all OpenShift TuneD profiles are shipped with the TuneD package. You can use the oc exec
command to view the contents of these profiles:
$ oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/openshift{,-control-plane,-node} -name tuned.conf -exec grep -H ^ {} \;
6.4. Verifying that the TuneD profiles are applied
Verify the TuneD profiles that are applied to your cluster node.
$ oc get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operator
Example output
NAME TUNED APPLIED DEGRADED AGE master-0 openshift-control-plane True False 6h33m master-1 openshift-control-plane True False 6h33m master-2 openshift-control-plane True False 6h33m worker-a openshift-node True False 6h28m worker-b openshift-node True False 6h28m
-
NAME
: Name of the Profile object. There is one Profile object per node and their names match. -
TUNED
: Name of the desired TuneD profile to apply. -
APPLIED
:True
if the TuneD daemon applied the desired profile. (True/False/Unknown
). -
DEGRADED
:True
if any errors were reported during application of the TuneD profile (True/False/Unknown
). -
AGE
: Time elapsed since the creation of Profile object.
The ClusterOperator/node-tuning
object also contains useful information about the Operator and its node agents' health. For example, Operator misconfiguration is reported by ClusterOperator/node-tuning
status messages.
To get status information about the ClusterOperator/node-tuning
object, run the following command:
$ oc get co/node-tuning -n openshift-cluster-node-tuning-operator
Example output
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE node-tuning 4.13.1 True False True 60m 1/5 Profiles with bootcmdline conflict
If either the ClusterOperator/node-tuning
or a profile object’s status is DEGRADED
, additional information is provided in the Operator or operand logs.
6.5. Custom tuning specification
The custom resource (CR) for the Operator has two major sections. The first section, profile:
, is a list of TuneD profiles and their names. The second, recommend:
, defines the profile selection logic.
Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized TuneD daemons are updated.
Management state
The Operator Management state is set by adjusting the default Tuned CR. By default, the Operator is in the Managed state and the spec.managementState
field is not present in the default Tuned CR. Valid values for the Operator Management state are as follows:
- Managed: the Operator will update its operands as configuration resources are updated
- Unmanaged: the Operator will ignore changes to the configuration resources
- Removed: the Operator will remove its operands and resources the Operator provisioned
Profile data
The profile:
section lists TuneD profiles and their names.
profile: - name: tuned_profile_1 data: | # TuneD profile specification [main] summary=Description of tuned_profile_1 profile [sysctl] net.ipv4.ip_forward=1 # ... other sysctl's or other TuneD daemon plugins supported by the containerized TuneD # ... - name: tuned_profile_n data: | # TuneD profile specification [main] summary=Description of tuned_profile_n profile # tuned_profile_n profile settings
Recommended profiles
The profile:
selection logic is defined by the recommend:
section of the CR. The recommend:
section is a list of items to recommend the profiles based on a selection criteria.
recommend: <recommend-item-1> # ... <recommend-item-n>
The individual items of the list:
- machineConfigLabels: 1 <mcLabels> 2 match: 3 <match> 4 priority: <priority> 5 profile: <tuned_profile_name> 6 operand: 7 debug: <bool> 8 tunedConfig: reapply_sysctl: <bool> 9
- 1
- Optional.
- 2
- A dictionary of key/value
MachineConfig
labels. The keys must be unique. - 3
- If omitted, profile match is assumed unless a profile with a higher priority matches first or
machineConfigLabels
is set. - 4
- An optional list.
- 5
- Profile ordering priority. Lower numbers mean higher priority (
0
is the highest priority). - 6
- A TuneD profile to apply on a match. For example
tuned_profile_1
. - 7
- Optional operand configuration.
- 8
- Turn debugging on or off for the TuneD daemon. Options are
true
for on orfalse
for off. The default isfalse
. - 9
- Turn
reapply_sysctl
functionality on or off for the TuneD daemon. Options aretrue
for on andfalse
for off.
<match>
is an optional list recursively defined as follows:
- label: <label_name> 1 value: <label_value> 2 type: <label_type> 3 <match> 4
If <match>
is not omitted, all nested <match>
sections must also evaluate to true
. Otherwise, false
is assumed and the profile with the respective <match>
section will not be applied or recommended. Therefore, the nesting (child <match>
sections) works as logical AND operator. Conversely, if any item of the <match>
list matches, the entire <match>
list evaluates to true
. Therefore, the list acts as logical OR operator.
If machineConfigLabels
is defined, machine config pool based matching is turned on for the given recommend:
list item. <mcLabels>
specifies the labels for a machine config. The machine config is created automatically to apply host settings, such as kernel boot parameters, for the profile <tuned_profile_name>
. This involves finding all machine config pools with machine config selector matching <mcLabels>
and setting the profile <tuned_profile_name>
on all nodes that are assigned the found machine config pools. To target nodes that have both master and worker roles, you must use the master role.
The list items match
and machineConfigLabels
are connected by the logical OR operator. The match
item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true
, the machineConfigLabels
item is not considered.
When using machine config pool based matching, it is advised to group nodes with the same hardware configuration into the same machine config pool. Not following this practice might result in TuneD operands calculating conflicting kernel parameters for two or more nodes sharing the same machine config pool.
Example: node or pod label based matching
- match: - label: tuned.openshift.io/elasticsearch match: - label: node-role.kubernetes.io/master - label: node-role.kubernetes.io/infra type: pod priority: 10 profile: openshift-control-plane-es - match: - label: node-role.kubernetes.io/master - label: node-role.kubernetes.io/infra priority: 20 profile: openshift-control-plane - priority: 30 profile: openshift-node
The CR above is translated for the containerized TuneD daemon into its recommend.conf
file based on the profile priorities. The profile with the highest priority (10
) is openshift-control-plane-es
and, therefore, it is considered first. The containerized TuneD daemon running on a given node looks to see if there is a pod running on the same node with the tuned.openshift.io/elasticsearch
label set. If not, the entire <match>
section evaluates as false
. If there is such a pod with the label, in order for the <match>
section to evaluate to true
, the node label also needs to be node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
If the labels for the profile with priority 10
matched, openshift-control-plane-es
profile is applied and no other profile is considered. If the node/pod label combination did not match, the second highest priority profile (openshift-control-plane
) is considered. This profile is applied if the containerized TuneD pod runs on a node with labels node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
Finally, the profile openshift-node
has the lowest priority of 30
. It lacks the <match>
section and, therefore, will always match. It acts as a profile catch-all to set openshift-node
profile, if no other profile with higher priority matches on a given node.
Example: machine config pool based matching
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-node-custom namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift node profile with an additional kernel parameter include=openshift-node [bootloader] cmdline_openshift_node_custom=+skew_tick=1 name: openshift-node-custom recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "worker-custom" priority: 20 profile: openshift-node-custom
To minimize node reboots, label the target nodes with a label the machine config pool’s node selector will match, then create the Tuned CR above and finally create the custom machine config pool itself.
Cloud provider-specific TuneD profiles
With this functionality, all Cloud provider-specific nodes can conveniently be assigned a TuneD profile specifically tailored to a given Cloud provider on a OpenShift Container Platform cluster. This can be accomplished without adding additional node labels or grouping nodes into machine config pools.
This functionality takes advantage of spec.providerID
node object values in the form of <cloud-provider>://<cloud-provider-specific-id>
and writes the file /var/lib/tuned/provider
with the value <cloud-provider>
in NTO operand containers. The content of this file is then used by TuneD to load provider-<cloud-provider>
profile if such profile exists.
The openshift
profile that both openshift-control-plane
and openshift-node
profiles inherit settings from is now updated to use this functionality through the use of conditional profile loading. Neither NTO nor TuneD currently include any Cloud provider-specific profiles. However, it is possible to create a custom profile provider-<cloud-provider>
that will be applied to all Cloud provider-specific cluster nodes.
Example GCE Cloud provider profile
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: provider-gce namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=GCE Cloud provider-specific profile # Your tuning for GCE Cloud provider goes here. name: provider-gce
Due to profile inheritance, any setting specified in the provider-<cloud-provider>
profile will be overwritten by the openshift
profile and its child profiles.
6.6. Custom tuning examples
Using TuneD profiles from the default CR
The following CR applies custom node-level tuning for OpenShift Container Platform nodes with label tuned.openshift.io/ingress-node-label
set to any value.
Example: custom tuning using the openshift-control-plane TuneD profile
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: ingress namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=A custom OpenShift ingress profile include=openshift-control-plane [sysctl] net.ipv4.ip_local_port_range="1024 65535" net.ipv4.tcp_tw_reuse=1 name: openshift-ingress recommend: - match: - label: tuned.openshift.io/ingress-node-label priority: 10 profile: openshift-ingress
Custom profile writers are strongly encouraged to include the default TuneD daemon profiles shipped within the default Tuned CR. The example above uses the default openshift-control-plane
profile to accomplish this.
Using built-in TuneD profiles
Given the successful rollout of the NTO-managed daemon set, the TuneD operands all manage the same version of the TuneD daemon. To list the built-in TuneD profiles supported by the daemon, query any TuneD pod in the following way:
$ oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/ -name tuned.conf -printf '%h\n' | sed 's|^.*/||'
You can use the profile names retrieved by this in your custom tuning specification.
Example: using built-in hpc-compute TuneD profile
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-node-hpc-compute namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift node profile for HPC compute workloads include=openshift-node,hpc-compute name: openshift-node-hpc-compute recommend: - match: - label: tuned.openshift.io/openshift-node-hpc-compute priority: 20 profile: openshift-node-hpc-compute
In addition to the built-in hpc-compute
profile, the example above includes the openshift-node
TuneD daemon profile shipped within the default Tuned CR to use OpenShift-specific tuning for compute nodes.
Overriding host-level sysctls
Various kernel parameters can be changed at runtime by using /run/sysctl.d/
, /etc/sysctl.d/
, and /etc/sysctl.conf
host configuration files. OpenShift Container Platform adds several host configuration files which set kernel parameters at runtime; for example, net.ipv[4-6].
, fs.inotify.
, and vm.max_map_count
. These runtime parameters provide basic functional tuning for the system prior to the kubelet and the Operator start.
The Operator does not override these settings unless the reapply_sysctl
option is set to false
. Setting this option to false
results in TuneD
not applying the settings from the host configuration files after it applies its custom profile.
Example: overriding host-level sysctls
apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-no-reapply-sysctl namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift profile include=openshift-node [sysctl] vm.max_map_count=>524288 name: openshift-no-reapply-sysctl recommend: - match: - label: tuned.openshift.io/openshift-no-reapply-sysctl priority: 15 profile: openshift-no-reapply-sysctl operand: tunedConfig: reapply_sysctl: false
6.7. Supported TuneD daemon plugins
Excluding the [main]
section, the following TuneD plugins are supported when using custom profiles defined in the profile:
section of the Tuned CR:
- audio
- cpu
- disk
- eeepc_she
- modules
- mounts
- net
- scheduler
- scsi_host
- selinux
- sysctl
- sysfs
- usb
- video
- vm
- bootloader
There is some dynamic tuning functionality provided by some of these plugins that is not supported. The following TuneD plugins are currently not supported:
- script
- systemd
The TuneD bootloader plugin only supports Red Hat Enterprise Linux CoreOS (RHCOS) worker nodes.
Additional resources
6.8. Configuring node tuning in a hosted cluster
Hosted control planes is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
To set node-level tuning on the nodes in your hosted cluster, you can use the Node Tuning Operator. In hosted control planes, you can configure node tuning by creating config maps that contain Tuned
objects and referencing those config maps in your node pools.
Procedure
Create a config map that contains a valid tuned manifest, and reference the manifest in a node pool. In the following example, a
Tuned
manifest defines a profile that setsvm.dirty_ratio
to 55 on nodes that contain thetuned-1-node-label
node label with any value. Save the followingConfigMap
manifest in a file namedtuned-1.yaml
:apiVersion: v1 kind: ConfigMap metadata: name: tuned-1 namespace: clusters data: tuning: | apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: tuned-1 namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift profile include=openshift-node [sysctl] vm.dirty_ratio="55" name: tuned-1-profile recommend: - priority: 20 profile: tuned-1-profile
NoteIf you do not add any labels to an entry in the
spec.recommend
section of the Tuned spec, node-pool-based matching is assumed, so the highest priority profile in thespec.recommend
section is applied to nodes in the pool. Although you can achieve more fine-grained node-label-based matching by setting a label value in the Tuned.spec.recommend.match
section, node labels will not persist during an upgrade unless you set the.spec.management.upgradeType
value of the node pool toInPlace
.Create the
ConfigMap
object in the management cluster:$ oc --kubeconfig="$MGMT_KUBECONFIG" create -f tuned-1.yaml
Reference the
ConfigMap
object in thespec.tuningConfig
field of the node pool, either by editing a node pool or creating one. In this example, assume that you have only oneNodePool
, namednodepool-1
, which contains 2 nodes.apiVersion: hypershift.openshift.io/v1alpha1 kind: NodePool metadata: ... name: nodepool-1 namespace: clusters ... spec: ... tuningConfig: - name: tuned-1 status: ...
NoteYou can reference the same config map in multiple node pools. In hosted control planes, the Node Tuning Operator appends a hash of the node pool name and namespace to the name of the Tuned CRs to distinguish them. Outside of this case, do not create multiple TuneD profiles of the same name in different Tuned CRs for the same hosted cluster.
Verification
Now that you have created the ConfigMap
object that contains a Tuned
manifest and referenced it in a NodePool
, the Node Tuning Operator syncs the Tuned
objects into the hosted cluster. You can verify which Tuned
objects are defined and which TuneD profiles are applied to each node.
List the
Tuned
objects in the hosted cluster:$ oc --kubeconfig="$HC_KUBECONFIG" get tuned.tuned.openshift.io -n openshift-cluster-node-tuning-operator
Example output
NAME AGE default 7m36s rendered 7m36s tuned-1 65s
List the
Profile
objects in the hosted cluster:$ oc --kubeconfig="$HC_KUBECONFIG" get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operator
Example output
NAME TUNED APPLIED DEGRADED AGE nodepool-1-worker-1 tuned-1-profile True False 7m43s nodepool-1-worker-2 tuned-1-profile True False 7m14s
NoteIf no custom profiles are created, the
openshift-node
profile is applied by default.To confirm that the tuning was applied correctly, start a debug shell on a node and check the sysctl values:
$ oc --kubeconfig="$HC_KUBECONFIG" debug node/nodepool-1-worker-1 -- chroot /host sysctl vm.dirty_ratio
Example output
vm.dirty_ratio = 55
6.9. Advanced node tuning for hosted clusters by setting kernel boot parameters
Hosted control planes is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
For more advanced tuning in hosted control planes, which requires setting kernel boot parameters, you can also use the Node Tuning Operator. The following example shows how you can create a node pool with huge pages reserved.
Procedure
Create a
ConfigMap
object that contains aTuned
object manifest for creating 10 huge pages that are 2 MB in size. Save thisConfigMap
manifest in a file namedtuned-hugepages.yaml
:apiVersion: v1 kind: ConfigMap metadata: name: tuned-hugepages namespace: clusters data: tuning: | apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: hugepages namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Boot time configuration for hugepages include=openshift-node [bootloader] cmdline_openshift_node_hugepages=hugepagesz=2M hugepages=50 name: openshift-node-hugepages recommend: - priority: 20 profile: openshift-node-hugepages
NoteThe
.spec.recommend.match
field is intentionally left blank. In this case, thisTuned
object is applied to all nodes in the node pool where thisConfigMap
object is referenced. Group nodes with the same hardware configuration into the same node pool. Otherwise, TuneD operands can calculate conflicting kernel parameters for two or more nodes that share the same node pool.Create the
ConfigMap
object in the management cluster:$ oc --kubeconfig="<management_cluster_kubeconfig>" create -f tuned-hugepages.yaml 1
- 1
- Replace
<management_cluster_kubeconfig>
with the name of your management clusterkubeconfig
file.
Create a
NodePool
manifest YAML file, customize the upgrade type of theNodePool
, and reference theConfigMap
object that you created in thespec.tuningConfig
section. Create theNodePool
manifest and save it in a file namedhugepages-nodepool.yaml
by using thehypershift
CLI:<<<<<<< HEAD NODEPOOL_NAME=hugepages-example INSTANCE_TYPE=m5.2xlarge NODEPOOL_REPLICAS=2 hypershift create nodepool aws \ --cluster-name $CLUSTER_NAME \ --name $NODEPOOL_NAME \ --node-count $NODEPOOL_REPLICAS \ --instance-type $INSTANCE_TYPE \ --render > hugepages-nodepool.yaml ======= $ hcp create nodepool aws \ --cluster-name <hosted_cluster_name> \1 --name <nodepool_name> \2 --node-count <nodepool_replicas> \3 --instance-type <instance_type> \4 --render > hugepages-nodepool.yaml >>>>>>> e990587823 (OSDOCS#12123: Describe the --render usage)
NoteThe
--render
flag in thehcp create
command does not render the secrets. To render the secrets, you must use both the--render
and the--render-sensitive
flags in thehcp create
command.In the
hugepages-nodepool.yaml
file, set.spec.management.upgradeType
toInPlace
, and set.spec.tuningConfig
to reference thetuned-hugepages
ConfigMap
object that you created.apiVersion: hypershift.openshift.io/v1alpha1 kind: NodePool metadata: name: hugepages-nodepool namespace: clusters ... spec: management: ... upgradeType: InPlace ... tuningConfig: - name: tuned-hugepages
NoteTo avoid the unnecessary re-creation of nodes when you apply the new
MachineConfig
objects, set.spec.management.upgradeType
toInPlace
. If you use theReplace
upgrade type, nodes are fully deleted and new nodes can replace them when you apply the new kernel boot parameters that the TuneD operand calculated.Create the
NodePool
in the management cluster:$ oc --kubeconfig="<management_cluster_kubeconfig>" create -f hugepages-nodepool.yaml
Verification
After the nodes are available, the containerized TuneD daemon calculates the required kernel boot parameters based on the applied TuneD profile. After the nodes are ready and reboot once to apply the generated MachineConfig
object, you can verify that the TuneD profile is applied and that the kernel boot parameters are set.
List the
Tuned
objects in the hosted cluster:$ oc --kubeconfig="<hosted_cluster_kubeconfig>" get tuned.tuned.openshift.io -n openshift-cluster-node-tuning-operator
Example output
NAME AGE default 123m hugepages-8dfb1fed 1m23s rendered 123m
List the
Profile
objects in the hosted cluster:$ oc --kubeconfig="<hosted_cluster_kubeconfig>" get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operator
Example output
NAME TUNED APPLIED DEGRADED AGE nodepool-1-worker-1 openshift-node True False 132m nodepool-1-worker-2 openshift-node True False 131m hugepages-nodepool-worker-1 openshift-node-hugepages True False 4m8s hugepages-nodepool-worker-2 openshift-node-hugepages True False 3m57s
Both of the worker nodes in the new
NodePool
have theopenshift-node-hugepages
profile applied.To confirm that the tuning was applied correctly, start a debug shell on a node and check
/proc/cmdline
.$ oc --kubeconfig="<hosted_cluster_kubeconfig>" debug node/nodepool-1-worker-1 -- chroot /host cat /proc/cmdline
Example output
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-... hugepagesz=2M hugepages=50
Additional resources
For more information about hosted control planes, see Hosted control planes (Technology Preview).
Chapter 7. Using CPU Manager and Topology Manager
CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.
CPU Manager is useful for workloads that have some of these attributes:
- Require as much CPU time as possible.
- Are sensitive to processor cache misses.
- Are low-latency network applications.
- Coordinate with other processes and benefit from sharing a single processor cache.
Topology Manager collects hints from the CPU Manager, Device Manager, and other Hint Providers to align pod resources, such as CPU, SR-IOV VFs, and other device resources, for all Quality of Service (QoS) classes on the same non-uniform memory access (NUMA) node.
Topology Manager uses topology information from the collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and pod resources requested.
Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.
To use Topology Manager you must configure CPU Manager with the static
policy.
7.1. Setting up CPU Manager
Procedure
Optional: Label a node:
# oc label node perf-node.example.com cpumanager=true
Edit the
MachineConfigPool
of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:# oc edit machineconfigpool worker
Add a label to the worker machine config pool:
metadata: creationTimestamp: 2020-xx-xxx generation: 3 labels: custom-kubelet: cpumanager-enabled
Create a
KubeletConfig
,cpumanager-kubeletconfig.yaml
, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See themachineConfigPoolSelector
section:apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: cpumanager-enabled spec: machineConfigPoolSelector: matchLabels: custom-kubelet: cpumanager-enabled kubeletConfig: cpuManagerPolicy: static 1 cpuManagerReconcilePeriod: 5s 2
- 1
- Specify a policy:
-
none
. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically. This is the default policy. -
static
. This policy allows containers in guaranteed pods with integer CPU requests. It also limits access to exclusive CPUs on the node. Ifstatic
, you must use a lowercases
.
-
- 2
- Optional. Specify the CPU Manager reconcile frequency. The default is
5s
.
Create the dynamic kubelet config:
# oc create -f cpumanager-kubeletconfig.yaml
This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.
Check for the merged kubelet config:
# oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
Example output
"ownerReferences": [ { "apiVersion": "machineconfiguration.openshift.io/v1", "kind": "KubeletConfig", "name": "cpumanager-enabled", "uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878" } ]
Check the worker for the updated
kubelet.conf
:# oc debug node/perf-node.example.com sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
Example output
cpuManagerPolicy: static 1 cpuManagerReconcilePeriod: 5s 2
Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:
# cat cpumanager-pod.yaml
Example output
apiVersion: v1 kind: Pod metadata: generateName: cpumanager- spec: containers: - name: cpumanager image: gcr.io/google_containers/pause-amd64:3.0 resources: requests: cpu: 1 memory: "1G" limits: cpu: 1 memory: "1G" nodeSelector: cpumanager: "true"
Create the pod:
# oc create -f cpumanager-pod.yaml
Verify that the pod is scheduled to the node that you labeled:
# oc describe pod cpumanager
Example output
Name: cpumanager-6cqz7 Namespace: default Priority: 0 PriorityClassName: <none> Node: perf-node.example.com/xxx.xx.xx.xxx ... Limits: cpu: 1 memory: 1G Requests: cpu: 1 memory: 1G ... QoS Class: Guaranteed Node-Selectors: cpumanager=true
Verify that the
cgroups
are set up correctly. Get the process ID (PID) of thepause
process:# ├─init.scope │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17 └─kubepods.slice ├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice │ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope │ └─32706 /pause
Pods of quality of service (QoS) tier
Guaranteed
are placed within thekubepods.slice
. Pods of other QoS tiers end up in childcgroups
ofkubepods
:# cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
Example output
cpuset.cpus 1 tasks 32706
Check the allowed CPU list for the task:
# grep ^Cpus_allowed_list /proc/32706/status
Example output
Cpus_allowed_list: 1
Verify that another pod (in this case, the pod in the
burstable
QoS tier) on the system cannot run on the core allocated for theGuaranteed
pod:# cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus 0 # oc describe node perf-node.example.com
Example output
... Capacity: attachable-volumes-aws-ebs: 39 cpu: 2 ephemeral-storage: 124768236Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 8162900Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 1500m ephemeral-storage: 124768236Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7548500Ki pods: 250 ------- ---- ------------ ---------- --------------- ------------- --- default cpumanager-6cqz7 1 (66%) 1 (66%) 1G (12%) 1G (12%) 29m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1440m (96%) 1 (66%)
This VM has two CPU cores. The
system-reserved
setting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at theNode Allocatable
amount. You can see thatAllocatable CPU
is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:NAME READY STATUS RESTARTS AGE cpumanager-6cqz7 1/1 Running 0 33m cpumanager-7qc2t 0/1 Pending 0 11s
7.2. Topology Manager policies
Topology Manager aligns Pod
resources of all Quality of Service (QoS) classes by collecting topology hints from Hint Providers, such as CPU Manager and Device Manager, and using the collected hints to align the Pod
resources.
Topology Manager supports four allocation policies, which you assign in the KubeletConfig
custom resource (CR) named cpumanager-enabled
:
none
policy- This is the default policy and does not perform any topology alignment.
best-effort
policy-
For each container in a pod with the
best-effort
topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node. restricted
policy-
For each container in a pod with the
restricted
topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager rejects this pod from the node, resulting in a pod in aTerminated
state with a pod admission failure. single-numa-node
policy-
For each container in a pod with the
single-numa-node
topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure.
7.3. Setting up Topology Manager
To use Topology Manager, you must configure an allocation policy in the KubeletConfig
custom resource (CR) named cpumanager-enabled
. This file might exist if you have set up CPU Manager. If the file does not exist, you can create the file.
Prerequisites
-
Configure the CPU Manager policy to be
static
.
Procedure
To activate Topology Manager:
Configure the Topology Manager allocation policy in the custom resource.
$ oc edit KubeletConfig cpumanager-enabled
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: cpumanager-enabled spec: machineConfigPoolSelector: matchLabels: custom-kubelet: cpumanager-enabled kubeletConfig: cpuManagerPolicy: static 1 cpuManagerReconcilePeriod: 5s topologyManagerPolicy: single-numa-node 2
7.4. Pod interactions with Topology Manager policies
The example Pod
specs below help illustrate pod interactions with Topology Manager.
The following pod runs in the BestEffort
QoS class because no resource requests or limits are specified.
spec: containers: - name: nginx image: nginx
The next pod runs in the Burstable
QoS class because requests are less than limits.
spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" requests: memory: "100Mi"
If the selected policy is anything other than none
, Topology Manager would not consider either of these Pod
specifications.
The last example pod below runs in the Guaranteed QoS class because requests are equal to limits.
spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" cpu: "2" example.com/device: "1" requests: memory: "200Mi" cpu: "2" example.com/device: "1"
Topology Manager would consider this pod. The Topology Manager would consult the hint providers, which are CPU Manager and Device Manager, to get topology hints for the pod.
Topology Manager will use this information to store the best topology for this container. In the case of this pod, CPU Manager and Device Manager will use this stored information at the resource allocation stage.
Chapter 8. Scheduling NUMA-aware workloads
Learn about NUMA-aware scheduling and how you can use it to deploy high performance workloads in an OpenShift Container Platform cluster.
The NUMA Resources Operator allows you to schedule high-performance workloads in the same NUMA zone. It deploys a node resources exporting agent that reports on available cluster node NUMA resources, and a secondary scheduler that manages the workloads.
8.1. About NUMA-aware scheduling
Introduction to NUMA
Non-Uniform Memory Access (NUMA) is a compute platform architecture that allows different CPUs to access different regions of memory at different speeds. NUMA resource topology refers to the locations of CPUs, memory, and PCI devices relative to each other in the compute node. Colocated resources are said to be in the same NUMA zone. For high-performance applications, the cluster needs to process pod workloads in a single NUMA zone.
Performance considerations
NUMA architecture allows a CPU with multiple memory controllers to use any available memory across CPU complexes, regardless of where the memory is located. This allows for increased flexibility at the expense of performance. A CPU processing a workload using memory that is outside its NUMA zone is slower than a workload processed in a single NUMA zone. Also, for I/O-constrained workloads, the network interface on a distant NUMA zone slows down how quickly information can reach the application. High-performance workloads, such as telecommunications workloads, cannot operate to specification under these conditions.
NUMA-aware scheduling
NUMA-aware scheduling aligns the requested cluster compute resources (CPUs, memory, devices) in the same NUMA zone to process latency-sensitive or high-performance workloads efficiently. NUMA-aware scheduling also improves pod density per compute node for greater resource efficiency.
Integration with Node Tuning Operator
By integrating the Node Tuning Operator’s performance profile with NUMA-aware scheduling, you can further configure CPU affinity to optimize performance for latency-sensitive workloads.
Default scheduling logic
The default OpenShift Container Platform pod scheduler scheduling logic considers the available resources of the entire compute node, not individual NUMA zones. If the most restrictive resource alignment is requested in the kubelet topology manager, error conditions can occur when admitting the pod to a node. Conversely, if the most restrictive resource alignment is not requested, the pod can be admitted to the node without proper resource alignment, leading to worse or unpredictable performance. For example, runaway pod creation with Topology Affinity Error
statuses can occur when the pod scheduler makes suboptimal scheduling decisions for guaranteed pod workloads without knowing if the pod’s requested resources are available. Scheduling mismatch decisions can cause indefinite pod startup delays. Also, depending on the cluster state and resource allocation, poor pod scheduling decisions can cause extra load on the cluster because of failed startup attempts.
NUMA-aware pod scheduling diagram
The NUMA Resources Operator deploys a custom NUMA resources secondary scheduler and other resources to mitigate against the shortcomings of the default OpenShift Container Platform pod scheduler. The following diagram provides a high-level overview of NUMA-aware pod scheduling.
Figure 8.1. NUMA-aware scheduling overview
- NodeResourceTopology API
-
The
NodeResourceTopology
API describes the available NUMA zone resources in each compute node. - NUMA-aware scheduler
-
The NUMA-aware secondary scheduler receives information about the available NUMA zones from the
NodeResourceTopology
API and schedules high-performance workloads on a node where it can be optimally processed. - Node topology exporter
-
The node topology exporter exposes the available NUMA zone resources for each compute node to the
NodeResourceTopology
API. The node topology exporter daemon tracks the resource allocation from the kubelet by using thePodResources
API. - PodResources API
The
PodResources
API is local to each node and exposes the resource topology and available resources to the kubelet.NoteThe
List
endpoint of thePodResources
API exposes exclusive CPUs allocated to a particular container. The API does not expose CPUs that belong to a shared pool.The
GetAllocatableResources
endpoint exposes allocatable resources available on a node.
Additional resources
- For more information about running secondary pod schedulers in your cluster and how to deploy pods with a secondary pod scheduler, see Scheduling pods using a secondary scheduler.
8.2. Installing the NUMA Resources Operator
NUMA Resources Operator deploys resources that allow you to schedule NUMA-aware workloads and deployments. You can install the NUMA Resources Operator using the OpenShift Container Platform CLI or the web console.
8.2.1. Installing the NUMA Resources Operator using the CLI
As a cluster administrator, you can install the Operator using the CLI.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Create a namespace for the NUMA Resources Operator:
Save the following YAML in the
nro-namespace.yaml
file:apiVersion: v1 kind: Namespace metadata: name: openshift-numaresources
Create the
Namespace
CR by running the following command:$ oc create -f nro-namespace.yaml
Create the Operator group for the NUMA Resources Operator:
Save the following YAML in the
nro-operatorgroup.yaml
file:apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: numaresources-operator namespace: openshift-numaresources spec: targetNamespaces: - openshift-numaresources
Create the
OperatorGroup
CR by running the following command:$ oc create -f nro-operatorgroup.yaml
Create the subscription for the NUMA Resources Operator:
Save the following YAML in the
nro-sub.yaml
file:apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: numaresources-operator namespace: openshift-numaresources spec: channel: "4.13" name: numaresources-operator source: redhat-operators sourceNamespace: openshift-marketplace
Create the
Subscription
CR by running the following command:$ oc create -f nro-sub.yaml
Verification
Verify that the installation succeeded by inspecting the CSV resource in the
openshift-numaresources
namespace. Run the following command:$ oc get csv -n openshift-numaresources
Example output
NAME DISPLAY VERSION REPLACES PHASE numaresources-operator.v4.13.2 numaresources-operator 4.13.2 Succeeded
8.2.2. Installing the NUMA Resources Operator using the web console
As a cluster administrator, you can install the NUMA Resources Operator using the web console.
Procedure
Create a namespace for the NUMA Resources Operator:
- In the OpenShift Container Platform web console, click Administration → Namespaces.
-
Click Create Namespace, enter
openshift-numaresources
in the Name field, and then click Create.
Install the NUMA Resources Operator:
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose numaresources-operator from the list of available Operators, and then click Install.
-
In the Installed Namespaces field, select the
openshift-numaresources
namespace, and then click Install.
Optional: Verify that the NUMA Resources Operator installed successfully:
- Switch to the Operators → Installed Operators page.
Ensure that NUMA Resources Operator is listed in the
openshift-numaresources
namespace with a Status of InstallSucceeded.NoteDuring installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.
If the Operator does not appear as installed, to troubleshoot further:
- Go to the Operators → Installed Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
-
Go to the Workloads → Pods page and check the logs for pods in the
default
project.
8.3. Scheduling NUMA-aware workloads
Clusters running latency-sensitive workloads typically feature performance profiles that help to minimize workload latency and optimize performance. The NUMA-aware scheduler deploys workloads based on available node NUMA resources and with respect to any performance profile settings applied to the node. The combination of NUMA-aware deployments, and the performance profile of the workload, ensures that workloads are scheduled in a way that maximizes performance.
For the NUMA Resources Operator to be fully operational, you must deploy the NUMAResourcesOperator
custom resource and the NUMA-aware secondary pod scheduler.
8.3.1. Creating the NUMAResourcesOperator custom resource
When you have installed the NUMA Resources Operator, then create the NUMAResourcesOperator
custom resource (CR) that instructs the NUMA Resources Operator to install all the cluster infrastructure needed to support the NUMA-aware scheduler, including daemon sets and APIs.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges. - Install the NUMA Resources Operator.
Procedure
Create the
NUMAResourcesOperator
custom resource:Save the following minimal required YAML file example as
nrop.yaml
:apiVersion: nodetopology.openshift.io/v1 kind: NUMAResourcesOperator metadata: name: numaresourcesoperator spec: nodeGroups: - machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" 1
- 1
- This should match the
MachineConfigPool
that you want to configure the NUMA Resources Operator on. For example, you might have created aMachineConfigPool
namedworker-cnf
that designates a set of nodes expected to run telecommunications workloads.
Create the
NUMAResourcesOperator
CR by running the following command:$ oc create -f nrop.yaml
NoteCreating the
NUMAResourcesOperator
triggers a reboot on the corresponding machine config pool and therefore the affected node.
Verification
Verify that the NUMA Resources Operator deployed successfully by running the following command:
$ oc get numaresourcesoperators.nodetopology.openshift.io
Example output
NAME AGE numaresourcesoperator 27s
After a few minutes, run the following command to verify that the required resources deployed successfully:
$ oc get all -n openshift-numaresources
Example output
NAME READY STATUS RESTARTS AGE pod/numaresources-controller-manager-7d9d84c58d-qk2mr 1/1 Running 0 12m pod/numaresourcesoperator-worker-7d96r 2/2 Running 0 97s pod/numaresourcesoperator-worker-crsht 2/2 Running 0 97s pod/numaresourcesoperator-worker-jp9mw 2/2 Running 0 97s
8.3.2. Deploying the NUMA-aware secondary pod scheduler
After you install the NUMA Resources Operator, do the following to deploy the NUMA-aware secondary pod scheduler:
Procedure
Create the
NUMAResourcesScheduler
custom resource that deploys the NUMA-aware custom pod scheduler:Save the following minimal required YAML in the
nro-scheduler.yaml
file:apiVersion: nodetopology.openshift.io/v1 kind: NUMAResourcesScheduler metadata: name: numaresourcesscheduler spec: imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-rhel9:v4.13"
Create the
NUMAResourcesScheduler
CR by running the following command:$ oc create -f nro-scheduler.yaml
After a few seconds, run the following command to confirm the successful deployment of the required resources:
$ oc get all -n openshift-numaresources
Example output
NAME READY STATUS RESTARTS AGE pod/numaresources-controller-manager-7d9d84c58d-qk2mr 1/1 Running 0 12m pod/numaresourcesoperator-worker-7d96r 2/2 Running 0 97s pod/numaresourcesoperator-worker-crsht 2/2 Running 0 97s pod/numaresourcesoperator-worker-jp9mw 2/2 Running 0 97s pod/secondary-scheduler-847cb74f84-9whlm 1/1 Running 0 10m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/numaresourcesoperator-worker 3 3 3 3 3 node-role.kubernetes.io/worker= 98s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/numaresources-controller-manager 1/1 1 1 12m deployment.apps/secondary-scheduler 1/1 1 1 10m NAME DESIRED CURRENT READY AGE replicaset.apps/numaresources-controller-manager-7d9d84c58d 1 1 1 12m replicaset.apps/secondary-scheduler-847cb74f84 1 1 1 10m
8.3.3. Configuring a single NUMA node policy
The NUMA Resources Operator requires a single NUMA node policy to be configured on the cluster. This can be achieved in two ways: by creating and applying a performance profile, or by configuring a KubeletConfig.
The preferred way to configure a single NUMA node policy is to apply a performance profile. You can use the Performance Profile Creator (PPC) tool to create the performance profile. If a performance profile is created on the cluster, it automatically creates other tuning components like KubeletConfig
and the tuned
profile.
For more information about creating a performance profile, see "About the Performance Profile Creator" in the "Additional resources" section.
Additional resources
8.3.4. Sample performance profile
This example YAML shows a performance profile created by using the performance profile creator (PPC) tool:
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: cpu: isolated: "3" reserved: 0-2 machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/worker: "" 1 nodeSelector: node-role.kubernetes.io/worker: "" numa: topologyPolicy: single-numa-node 2 realTimeKernel: enabled: true workloadHints: highPowerConsumption: true perPodPowerManagement: false realTime: true
- 1
- This should match the
MachineConfigPool
that you want to configure the NUMA Resources Operator on. For example, you might have created aMachineConfigPool
namedworker-cnf
that designates a set of nodes that run telecommunications workloads. - 2
- The
topologyPolicy
must be set tosingle-numa-node
. Ensure that this is the case by setting thetopology-manager-policy
argument tosingle-numa-node
when running the PPC tool.
8.3.5. Creating a KubeletConfig CRD
The recommended way to configure a single NUMA node policy is to apply a performance profile. Another way is by creating and applying a KubeletConfig
custom resource (CR), as shown in the following procedure.
Procedure
Create the
KubeletConfig
custom resource (CR) that configures the pod admittance policy for the machine profile:Save the following YAML in the
nro-kubeletconfig.yaml
file:apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: worker-tuning spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" 1 kubeletConfig: cpuManagerPolicy: "static" 2 cpuManagerReconcilePeriod: "5s" reservedSystemCPUs: "0,1" 3 memoryManagerPolicy: "Static" 4 evictionHard: memory.available: "100Mi" kubeReserved: memory: "512Mi" reservedMemory: - numaNode: 0 limits: memory: "1124Mi" systemReserved: memory: "512Mi" topologyManagerPolicy: "single-numa-node" 5
- 1
- Adjust this label to match the
machineConfigPoolSelector
in theNUMAResourcesOperator
CR. - 2
- For
cpuManagerPolicy
,static
must use a lowercases
. - 3
- Adjust this based on the CPU on your nodes.
- 4
- For
memoryManagerPolicy
,Static
must use an uppercaseS
. - 5
topologyManagerPolicy
must be set tosingle-numa-node
.
Create the
KubeletConfig
CR by running the following command:$ oc create -f nro-kubeletconfig.yaml
NoteApplying performance profile or
KubeletConfig
automatically triggers rebooting of the nodes. If no reboot is triggered, you can troubleshoot the issue by looking at the labels inKubeletConfig
that address the node group.
8.3.6. Scheduling workloads with the NUMA-aware scheduler
Now that topo-aware-scheduler
is installed, the NUMAResourcesOperator
and NUMAResourcesScheduler
CRs are applied and your cluster has a matching performance profile or kubeletconfig
, you can schedule workloads with the NUMA-aware scheduler using deployment CRs that specify the minimum required resources to process the workload.
The following example deployment uses NUMA-aware scheduling for a sample workload.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Get the name of the NUMA-aware scheduler that is deployed in the cluster by running the following command:
$ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
Example output
"topo-aware-scheduler"
Create a
Deployment
CR that uses scheduler namedtopo-aware-scheduler
, for example:Save the following YAML in the
nro-deployment.yaml
file:apiVersion: apps/v1 kind: Deployment metadata: name: numa-deployment-1 namespace: openshift-numaresources spec: replicas: 1 selector: matchLabels: app: test template: metadata: labels: app: test spec: schedulerName: topo-aware-scheduler 1 containers: - name: ctnr image: quay.io/openshifttest/hello-openshift:openshift imagePullPolicy: IfNotPresent resources: limits: memory: "100Mi" cpu: "10" requests: memory: "100Mi" cpu: "10" - name: ctnr2 image: registry.access.redhat.com/rhel:latest imagePullPolicy: IfNotPresent command: ["/bin/sh", "-c"] args: [ "while true; do sleep 1h; done;" ] resources: limits: memory: "100Mi" cpu: "8" requests: memory: "100Mi" cpu: "8"
- 1
schedulerName
must match the name of the NUMA-aware scheduler that is deployed in your cluster, for exampletopo-aware-scheduler
.
Create the
Deployment
CR by running the following command:$ oc create -f nro-deployment.yaml
Verification
Verify that the deployment was successful:
$ oc get pods -n openshift-numaresources
Example output
NAME READY STATUS RESTARTS AGE numa-deployment-1-6c4f5bdb84-wgn6g 2/2 Running 0 5m2s numaresources-controller-manager-7d9d84c58d-4v65j 1/1 Running 0 18m numaresourcesoperator-worker-7d96r 2/2 Running 4 43m numaresourcesoperator-worker-crsht 2/2 Running 2 43m numaresourcesoperator-worker-jp9mw 2/2 Running 2 43m secondary-scheduler-847cb74f84-fpncj 1/1 Running 0 18m
Verify that the
topo-aware-scheduler
is scheduling the deployed pod by running the following command:$ oc describe pod numa-deployment-1-6c4f5bdb84-wgn6g -n openshift-numaresources
Example output
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m45s topo-aware-scheduler Successfully assigned openshift-numaresources/numa-deployment-1-6c4f5bdb84-wgn6g to worker-1
NoteDeployments that request more resources than is available for scheduling will fail with a
MinimumReplicasUnavailable
error. The deployment succeeds when the required resources become available. Pods remain in thePending
state until the required resources are available.Verify that the expected allocated resources are listed for the node.
Identify the node that is running the deployment pod by running the following command:
$ oc get pods -n openshift-numaresources -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES numa-deployment-1-6c4f5bdb84-wgn6g 0/2 Running 0 82m 10.128.2.50 worker-1 <none> <none>
Run the following command with the name of that node that is running the deployment pod.
$ oc describe noderesourcetopologies.topology.node.k8s.io worker-1
Example output
... Zones: Costs: Name: node-0 Value: 10 Name: node-1 Value: 21 Name: node-0 Resources: Allocatable: 39 Available: 21 1 Capacity: 40 Name: cpu Allocatable: 6442450944 Available: 6442450944 Capacity: 6442450944 Name: hugepages-1Gi Allocatable: 134217728 Available: 134217728 Capacity: 134217728 Name: hugepages-2Mi Allocatable: 262415904768 Available: 262206189568 Capacity: 270146007040 Name: memory Type: Node
- 1
- The
Available
capacity is reduced because of the resources that have been allocated to the guaranteed pod.
Resources consumed by guaranteed pods are subtracted from the available node resources listed under
noderesourcetopologies.topology.node.k8s.io
.
Resource allocations for pods with a
Best-effort
orBurstable
quality of service (qosClass
) are not reflected in the NUMA node resources undernoderesourcetopologies.topology.node.k8s.io
. If a pod’s consumed resources are not reflected in the node resource calculation, verify that the pod hasqosClass
ofGuaranteed
and the CPU request is an integer value, not a decimal value. You can verify the that the pod has aqosClass
ofGuaranteed
by running the following command:$ oc get pod numa-deployment-1-6c4f5bdb84-wgn6g -n openshift-numaresources -o jsonpath="{ .status.qosClass }"
Example output
Guaranteed
8.4. Optional: Configuring polling operations for NUMA resources updates
The daemons controlled by the NUMA Resources Operator in their nodeGroup
poll resources to retrieve updates about available NUMA resources. You can fine-tune polling operations for these daemons by configuring the spec.nodeGroups
specification in the NUMAResourcesOperator
custom resource (CR). This provides advanced control of polling operations. Configure these specifications to improve scheduling behaviour and troubleshoot suboptimal scheduling decisions.
The configuration options are the following:
-
infoRefreshMode
: Determines the trigger condition for polling the kubelet. The NUMA Resources Operator reports the resulting information to the API server. -
infoRefreshPeriod
: Determines the duration between polling updates. podsFingerprinting
: Determines if point-in-time information for the current set of pods running on a node is exposed in polling updates.NotepodsFingerprinting
is enabled by default.podsFingerprinting
is a requirement for thecacheResyncPeriod
specification in theNUMAResourcesScheduler
CR. ThecacheResyncPeriod
specification helps to report more exact resource availability by monitoring pending resources on nodes.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges. - Install the NUMA Resources Operator.
Procedure
Configure the
spec.nodeGroups
specification in yourNUMAResourcesOperator
CR:apiVersion: nodetopology.openshift.io/v1 kind: NUMAResourcesOperator metadata: name: numaresourcesoperator spec: nodeGroups: - config: infoRefreshMode: Periodic 1 infoRefreshPeriod: 10s 2 podsFingerprinting: Enabled 3 name: worker
- 1
- Valid values are
Periodic
,Events
,PeriodicAndEvents
. UsePeriodic
to poll the kubelet at intervals that you define ininfoRefreshPeriod
. UseEvents
to poll the kubelet at every pod lifecycle event. UsePeriodicAndEvents
to enable both methods. - 2
- Define the polling interval for
Periodic
orPeriodicAndEvents
refresh modes. The field is ignored if the refresh mode isEvents
. - 3
- Valid values are
Enabled
,Disabled
, andEnabledExclusiveResources
. Setting toEnabled
is a requirement for thecacheResyncPeriod
specification in theNUMAResourcesScheduler
.
Verification
After you deploy the NUMA Resources Operator, verify that the node group configurations were applied by running the following command:
$ oc get numaresop numaresourcesoperator -o json | jq '.status'
Example output
... "config": { "infoRefreshMode": "Periodic", "infoRefreshPeriod": "10s", "podsFingerprinting": "Enabled" }, "name": "worker" ...
8.5. Troubleshooting NUMA-aware scheduling
To troubleshoot common problems with NUMA-aware pod scheduling, perform the following steps.
Prerequisites
-
Install the OpenShift Container Platform CLI (
oc
). - Log in as a user with cluster-admin privileges.
- Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.
Procedure
Verify that the
noderesourcetopologies
CRD is deployed in the cluster by running the following command:$ oc get crd | grep noderesourcetopologies
Example output
NAME CREATED AT noderesourcetopologies.topology.node.k8s.io 2022-01-18T08:28:06Z
Check that the NUMA-aware scheduler name matches the name specified in your NUMA-aware workloads by running the following command:
$ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
Example output
topo-aware-scheduler
Verify that NUMA-aware schedulable nodes have the
noderesourcetopologies
CR applied to them. Run the following command:$ oc get noderesourcetopologies.topology.node.k8s.io
Example output
NAME AGE compute-0.example.com 17h compute-1.example.com 17h
NoteThe number of nodes should equal the number of worker nodes that are configured by the machine config pool (
mcp
) worker definition.Verify the NUMA zone granularity for all schedulable nodes by running the following command:
$ oc get noderesourcetopologies.topology.node.k8s.io -o yaml
Example output
apiVersion: v1 items: - apiVersion: topology.node.k8s.io/v1 kind: NodeResourceTopology metadata: annotations: k8stopoawareschedwg/rte-update: periodic creationTimestamp: "2022-06-16T08:55:38Z" generation: 63760 name: worker-0 resourceVersion: "8450223" uid: 8b77be46-08c0-4074-927b-d49361471590 topologyPolicies: - SingleNUMANodeContainerLevel zones: - costs: - name: node-0 value: 10 - name: node-1 value: 21 name: node-0 resources: - allocatable: "38" available: "38" capacity: "40" name: cpu - allocatable: "134217728" available: "134217728" capacity: "134217728" name: hugepages-2Mi - allocatable: "262352048128" available: "262352048128" capacity: "270107316224" name: memory - allocatable: "6442450944" available: "6442450944" capacity: "6442450944" name: hugepages-1Gi type: Node - costs: - name: node-0 value: 21 - name: node-1 value: 10 name: node-1 resources: - allocatable: "268435456" available: "268435456" capacity: "268435456" name: hugepages-2Mi - allocatable: "269231067136" available: "269231067136" capacity: "270573244416" name: memory - allocatable: "40" available: "40" capacity: "40" name: cpu - allocatable: "1073741824" available: "1073741824" capacity: "1073741824" name: hugepages-1Gi type: Node - apiVersion: topology.node.k8s.io/v1 kind: NodeResourceTopology metadata: annotations: k8stopoawareschedwg/rte-update: periodic creationTimestamp: "2022-06-16T08:55:37Z" generation: 62061 name: worker-1 resourceVersion: "8450129" uid: e8659390-6f8d-4e67-9a51-1ea34bba1cc3 topologyPolicies: - SingleNUMANodeContainerLevel zones: 1 - costs: - name: node-0 value: 10 - name: node-1 value: 21 name: node-0 resources: 2 - allocatable: "38" available: "38" capacity: "40" name: cpu - allocatable: "6442450944" available: "6442450944" capacity: "6442450944" name: hugepages-1Gi - allocatable: "134217728" available: "134217728" capacity: "134217728" name: hugepages-2Mi - allocatable: "262391033856" available: "262391033856" capacity: "270146301952" name: memory type: Node - costs: - name: node-0 value: 21 - name: node-1 value: 10 name: node-1 resources: - allocatable: "40" available: "40" capacity: "40" name: cpu - allocatable: "1073741824" available: "1073741824" capacity: "1073741824" name: hugepages-1Gi - allocatable: "268435456" available: "268435456" capacity: "268435456" name: hugepages-2Mi - allocatable: "269192085504" available: "269192085504" capacity: "270534262784" name: memory type: Node kind: List metadata: resourceVersion: "" selfLink: ""
8.5.1. Reporting more exact resource availability
Enable the cacheResyncPeriod
specification to help the NUMA Resources Operator report more exact resource availability by monitoring pending resources on nodes and synchronizing this information in the scheduler cache at a defined interval. This also helps to minimize Topology Affinity Error errors because of sub-optimal scheduling decisions. The lower the interval, the greater the network load. The cacheResyncPeriod
specification is disabled by default.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Delete the currently running
NUMAResourcesScheduler
resource:Get the active
NUMAResourcesScheduler
by running the following command:$ oc get NUMAResourcesScheduler
Example output
NAME AGE numaresourcesscheduler 92m
Delete the secondary scheduler resource by running the following command:
$ oc delete NUMAResourcesScheduler numaresourcesscheduler
Example output
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
Save the following YAML in the file
nro-scheduler-cacheresync.yaml
. This example changes the log level toDebug
:apiVersion: nodetopology.openshift.io/v1 kind: NUMAResourcesScheduler metadata: name: numaresourcesscheduler spec: imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.13" cacheResyncPeriod: "5s" 1
- 1
- Enter an interval value in seconds for synchronization of the scheduler cache. A value of
5s
is typical for most implementations.
Create the updated
NUMAResourcesScheduler
resource by running the following command:$ oc create -f nro-scheduler-cacheresync.yaml
Example output
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
Verification steps
Check that the NUMA-aware scheduler was successfully deployed:
Run the following command to check that the CRD is created successfully:
$ oc get crd | grep numaresourcesschedulers
Example output
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
Check that the new custom scheduler is available by running the following command:
$ oc get numaresourcesschedulers.nodetopology.openshift.io
Example output
NAME AGE numaresourcesscheduler 3h26m
Check that the logs for the scheduler show the increased log level:
Get the list of pods running in the
openshift-numaresources
namespace by running the following command:$ oc get pods -n openshift-numaresources
Example output
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
Get the logs for the secondary scheduler pod by running the following command:
$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
Example output
... I0223 11:04:55.614788 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received I0223 11:04:56.609114 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received I0223 11:05:22.626818 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received I0223 11:05:31.610356 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received I0223 11:05:31.713032 1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq" I0223 11:05:53.461016 1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
8.5.2. Checking the NUMA-aware scheduler logs
Troubleshoot problems with the NUMA-aware scheduler by reviewing the logs. If required, you can increase the scheduler log level by modifying the spec.logLevel
field of the NUMAResourcesScheduler
resource. Acceptable values are Normal
, Debug
, and Trace
, with Trace
being the most verbose option.
To change the log level of the secondary scheduler, delete the running scheduler resource and re-deploy it with the changed log level. The scheduler is unavailable for scheduling new workloads during this downtime.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Delete the currently running
NUMAResourcesScheduler
resource:Get the active
NUMAResourcesScheduler
by running the following command:$ oc get NUMAResourcesScheduler
Example output
NAME AGE numaresourcesscheduler 90m
Delete the secondary scheduler resource by running the following command:
$ oc delete NUMAResourcesScheduler numaresourcesscheduler
Example output
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
Save the following YAML in the file
nro-scheduler-debug.yaml
. This example changes the log level toDebug
:apiVersion: nodetopology.openshift.io/v1 kind: NUMAResourcesScheduler metadata: name: numaresourcesscheduler spec: imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.13" logLevel: Debug
Create the updated
Debug
loggingNUMAResourcesScheduler
resource by running the following command:$ oc create -f nro-scheduler-debug.yaml
Example output
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
Verification steps
Check that the NUMA-aware scheduler was successfully deployed:
Run the following command to check that the CRD is created successfully:
$ oc get crd | grep numaresourcesschedulers
Example output
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
Check that the new custom scheduler is available by running the following command:
$ oc get numaresourcesschedulers.nodetopology.openshift.io
Example output
NAME AGE numaresourcesscheduler 3h26m
Check that the logs for the scheduler shows the increased log level:
Get the list of pods running in the
openshift-numaresources
namespace by running the following command:$ oc get pods -n openshift-numaresources
Example output
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
Get the logs for the secondary scheduler pod by running the following command:
$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
Example output
... I0223 11:04:55.614788 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received I0223 11:04:56.609114 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received I0223 11:05:22.626818 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received I0223 11:05:31.610356 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received I0223 11:05:31.713032 1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq" I0223 11:05:53.461016 1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
8.5.3. Troubleshooting the resource topology exporter
Troubleshoot noderesourcetopologies
objects where unexpected results are occurring by inspecting the corresponding resource-topology-exporter
logs.
It is recommended that NUMA resource topology exporter instances in the cluster are named for nodes they refer to. For example, a worker node with the name worker
should have a corresponding noderesourcetopologies
object called worker
.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Get the daemonsets managed by the NUMA Resources Operator. Each daemonset has a corresponding
nodeGroup
in theNUMAResourcesOperator
CR. Run the following command:$ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"
Example output
{"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}
Get the label for the daemonset of interest using the value for
name
from the previous step:$ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"
Example output
{"name":"resource-topology"}
Get the pods using the
resource-topology
label by running the following command:$ oc get pods -n openshift-numaresources -l name=resource-topology -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE numaresourcesoperator-worker-5wm2k 2/2 Running 0 2d1h 10.135.0.64 compute-0.example.com numaresourcesoperator-worker-pb75c 2/2 Running 0 2d1h 10.132.2.33 compute-1.example.com
Examine the logs of the
resource-topology-exporter
container running on the worker pod that corresponds to the node you are troubleshooting. Run the following command:$ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c
Example output
I0221 13:38:18.334140 1 main.go:206] using sysinfo: reservedCpus: 0,1 reservedMemory: "0": 1178599424 I0221 13:38:18.334370 1 main.go:67] === System information === I0221 13:38:18.334381 1 sysinfo.go:231] cpus: reserved "0-1" I0221 13:38:18.334493 1 sysinfo.go:237] cpus: online "0-103" I0221 13:38:18.546750 1 main.go:72] cpus: allocatable "2-103" hugepages-1Gi: numa cell 0 -> 6 numa cell 1 -> 1 hugepages-2Mi: numa cell 0 -> 64 numa cell 1 -> 128 memory: numa cell 0 -> 45758Mi numa cell 1 -> 48372Mi
8.5.4. Correcting a missing resource topology exporter config map
If you install the NUMA Resources Operator in a cluster with misconfigured cluster settings, in some circumstances, the Operator is shown as active but the logs of the resource topology exporter (RTE) daemon set pods show that the configuration for the RTE is missing, for example:
Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"
This log message indicates that the kubeletconfig
with the required configuration was not properly applied in the cluster, resulting in a missing RTE configmap
. For example, the following cluster is missing a numaresourcesoperator-worker
configmap
custom resource (CR):
$ oc get configmap
Example output
NAME DATA AGE 0e2a6bd3.openshift-kni.io 0 6d21h kube-root-ca.crt 1 6d21h openshift-service-ca.crt 1 6d21h topo-aware-scheduler-config 1 6d18h
In a correctly configured cluster, oc get configmap
also returns a numaresourcesoperator-worker
configmap
CR.
Prerequisites
-
Install the OpenShift Container Platform CLI (
oc
). - Log in as a user with cluster-admin privileges.
- Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.
Procedure
Compare the values for
spec.machineConfigPoolSelector.matchLabels
inkubeletconfig
andmetadata.labels
in theMachineConfigPool
(mcp
) worker CR using the following commands:Check the
kubeletconfig
labels by running the following command:$ oc get kubeletconfig -o yaml
Example output
machineConfigPoolSelector: matchLabels: cnf-worker-tuning: enabled
Check the
mcp
labels by running the following command:$ oc get mcp worker -o yaml
Example output
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: ""
The
cnf-worker-tuning: enabled
label is not present in theMachineConfigPool
object.
Edit the
MachineConfigPool
CR to include the missing label, for example:$ oc edit mcp worker -o yaml
Example output
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" cnf-worker-tuning: enabled
- Apply the label changes and wait for the cluster to apply the updated configuration. Run the following command:
Verification
Check that the missing
numaresourcesoperator-worker
configmap
CR is applied:$ oc get configmap
Example output
NAME DATA AGE 0e2a6bd3.openshift-kni.io 0 6d21h kube-root-ca.crt 1 6d21h numaresourcesoperator-worker 1 5m openshift-service-ca.crt 1 6d21h topo-aware-scheduler-config 1 6d18h
Chapter 9. Scalability and performance optimization
9.1. Optimizing storage
Optimizing storage helps to minimize storage use across all resources. By optimizing storage, administrators help ensure that existing storage resources are working in an efficient manner.
9.1.1. Available persistent storage options
Understand your persistent storage options so that you can optimize your OpenShift Container Platform environment.
Storage type | Description | Examples |
---|---|---|
Block |
| AWS EBS and VMware vSphere support dynamic persistent volume (PV) provisioning natively in the OpenShift Container Platform. |
File |
| RHEL NFS, NetApp NFS [1], and Vendor NFS |
Object |
| AWS S3 |
- NetApp NFS supports dynamic PV provisioning when using the Trident plugin.
9.1.2. Recommended configurable storage technology
The following table summarizes the recommended and configurable storage technologies for the given OpenShift Container Platform cluster application.
Storage type | Block | File | Object |
---|---|---|---|
1
2 3 Prometheus is the underlying technology used for metrics. 4 This does not apply to physical disk, VM physical disk, VMDK, loopback over NFS, AWS EBS, and Azure Disk.
5 For metrics, using file storage with the 6 For logging, review the recommended storage solution in Configuring persistent storage for the log store section. Using NFS storage as a persistent volume or through NAS, such as Gluster, can corrupt the data. Hence, NFS is not supported for Elasticsearch storage and LokiStack log store in OpenShift Container Platform Logging. You must use one persistent volume type per log store. 7 Object storage is not consumed through OpenShift Container Platform’s PVs or PVCs. Apps must integrate with the object storage REST API. | |||
ROX1 | Yes4 | Yes4 | Yes |
RWX2 | No | Yes | Yes |
Registry | Configurable | Configurable | Recommended |
Scaled registry | Not configurable | Configurable | Recommended |
Metrics3 | Recommended | Configurable5 | Not configurable |
Elasticsearch Logging | Recommended | Configurable6 | Not supported6 |
Loki Logging | Not configurable | Not configurable | Recommended |
Apps | Recommended | Recommended | Not configurable7 |
A scaled registry is an OpenShift image registry where two or more pod replicas are running.
9.1.2.1. Specific application storage recommendations
Testing shows issues with using the NFS server on Red Hat Enterprise Linux (RHEL) as a storage backend for core services. This includes the OpenShift Container Registry and Quay, Prometheus for monitoring storage, and Elasticsearch for logging storage. Therefore, using RHEL NFS to back PVs used by core services is not recommended.
Other NFS implementations in the marketplace might not have these issues. Contact the individual NFS implementation vendor for more information on any testing that was possibly completed against these OpenShift Container Platform core components.
9.1.2.1.1. Registry
In a non-scaled/high-availability (HA) OpenShift image registry cluster deployment:
- The storage technology does not have to support RWX access mode.
- The storage technology must ensure read-after-write consistency.
- The preferred storage technology is object storage followed by block storage.
- File storage is not recommended for OpenShift image registry cluster deployment with production workloads.
9.1.2.1.2. Scaled registry
In a scaled/HA OpenShift image registry cluster deployment:
- The storage technology must support RWX access mode.
- The storage technology must ensure read-after-write consistency.
- The preferred storage technology is object storage.
- Red Hat OpenShift Data Foundation (ODF), Amazon Simple Storage Service (Amazon S3), Google Cloud Storage (GCS), Microsoft Azure Blob Storage, and OpenStack Swift are supported.
- Object storage should be S3 or Swift compliant.
- For non-cloud platforms, such as vSphere and bare metal installations, the only configurable technology is file storage.
- Block storage is not configurable.
- The use of Network File System (NFS) storage with OpenShift Container Platform is supported. However, the use of NFS storage with a scaled registry can cause known issues. For more information, see the Red Hat Knowledgebase solution, Is NFS supported for OpenShift cluster internal components in Production?.
9.1.2.1.3. Metrics
In an OpenShift Container Platform hosted metrics cluster deployment:
- The preferred storage technology is block storage.
- Object storage is not configurable.
It is not recommended to use file storage for a hosted metrics cluster deployment with production workloads.
9.1.2.1.4. Logging
In an OpenShift Container Platform hosted logging cluster deployment:
Loki Operator:
- The preferred storage technology is S3 compatible Object storage.
- Block storage is not configurable.
OpenShift Elasticsearch Operator:
- The preferred storage technology is block storage.
- Object storage is not supported.
As of logging version 5.4.3 the OpenShift Elasticsearch Operator is deprecated and is planned to be removed in a future release. Red Hat will provide bug fixes and support for this feature during the current release lifecycle, but this feature will no longer receive enhancements and will be removed. As an alternative to using the OpenShift Elasticsearch Operator to manage the default log storage, you can use the Loki Operator.
9.1.2.1.5. Applications
Application use cases vary from application to application, as described in the following examples:
- Storage technologies that support dynamic PV provisioning have low mount time latencies, and are not tied to nodes to support a healthy cluster.
- Application developers are responsible for knowing and understanding the storage requirements for their application, and how it works with the provided storage to ensure that issues do not occur when an application scales or interacts with the storage layer.
9.1.2.2. Other specific application storage recommendations
It is not recommended to use RAID configurations on Write
intensive workloads, such as etcd
. If you are running etcd
with a RAID configuration, you might be at risk of encountering performance issues with your workloads.
- Red Hat OpenStack Platform (RHOSP) Cinder: RHOSP Cinder tends to be adept in ROX access mode use cases.
- Databases: Databases (RDBMSs, NoSQL DBs, etc.) tend to perform best with dedicated block storage.
- The etcd database must have enough storage and adequate performance capacity to enable a large cluster. Information about monitoring and benchmarking tools to establish ample storage and a high-performance environment is described in Recommended etcd practices.
9.1.3. Data storage management
The following table summarizes the main directories that OpenShift Container Platform components write data to.
Directory | Notes | Sizing | Expected growth |
---|---|---|---|
/var/log | Log files for all components. | 10 to 30 GB. | Log files can grow quickly; size can be managed by growing disks or by using log rotate. |
/var/lib/etcd | Used for etcd storage when storing the database. | Less than 20 GB. Database can grow up to 8 GB. | Will grow slowly with the environment. Only storing metadata. Additional 20-25 GB for every additional 8 GB of memory. |
/var/lib/containers | This is the mount point for the CRI-O runtime. Storage used for active container runtimes, including pods, and storage of local images. Not used for registry storage. | 50 GB for a node with 16 GB memory. Note that this sizing should not be used to determine minimum cluster requirements. Additional 20-25 GB for every additional 8 GB of memory. | Growth is limited by capacity for running containers. |
/var/lib/kubelet | Ephemeral volume storage for pods. This includes anything external that is mounted into a container at runtime. Includes environment variables, kube secrets, and data volumes not backed by persistent volumes. | Varies | Minimal if pods requiring storage are using persistent volumes. If using ephemeral storage, this can grow quickly. |
9.1.4. Optimizing storage performance for Microsoft Azure
OpenShift Container Platform and Kubernetes are sensitive to disk performance, and faster storage is recommended, particularly for etcd on the control plane nodes.
For production Azure clusters and clusters with intensive workloads, the virtual machine operating system disk for control plane machines should be able to sustain a tested and recommended minimum throughput of 5000 IOPS / 200MBps. This throughput can be provided by having a minimum of 1 TiB Premium SSD (P30). In Azure and Azure Stack Hub, disk performance is directly dependent on SSD disk sizes. To achieve the throughput supported by a Standard_D8s_v3
virtual machine, or other similar machine types, and the target of 5000 IOPS, at least a P30 disk is required.
Host caching must be set to ReadOnly
for low latency and high IOPS and throughput when reading data. Reading data from the cache, which is present either in the VM memory or in the local SSD disk, is much faster than reading from the disk, which is in the blob storage.
9.1.5. Additional resources
9.2. Optimizing routing
The OpenShift Container Platform HAProxy router can be scaled or configured to optimize performance.
9.2.1. Baseline Ingress Controller (router) performance
The OpenShift Container Platform Ingress Controller, or router, is the ingress point for ingress traffic for applications and services that are configured using routes and ingresses.
When evaluating a single HAProxy router performance in terms of HTTP requests handled per second, the performance varies depending on many factors. In particular:
- HTTP keep-alive/close mode
- Route type
- TLS session resumption client support
- Number of concurrent connections per target route
- Number of target routes
- Back end server page size
- Underlying infrastructure (network/SDN solution, CPU, and so on)
While performance in your specific environment will vary, Red Hat lab tests on a public cloud instance of size 4 vCPU/16GB RAM. A single HAProxy router handling 100 routes terminated by backends serving 1kB static pages is able to handle the following number of transactions per second.
In HTTP keep-alive mode scenarios:
Encryption | LoadBalancerService | HostNetwork |
---|---|---|
none | 21515 | 29622 |
edge | 16743 | 22913 |
passthrough | 36786 | 53295 |
re-encrypt | 21583 | 25198 |
In HTTP close (no keep-alive) scenarios:
Encryption | LoadBalancerService | HostNetwork |
---|---|---|
none | 5719 | 8273 |
edge | 2729 | 4069 |
passthrough | 4121 | 5344 |
re-encrypt | 2320 | 2941 |
The default Ingress Controller configuration was used with the spec.tuningOptions.threadCount
field set to 4
. Two different endpoint publishing strategies were tested: Load Balancer Service and Host Network. TLS session resumption was used for encrypted routes. With HTTP keep-alive, a single HAProxy router is capable of saturating a 1 Gbit NIC at page sizes as small as 8 kB.
When running on bare metal with modern processors, you can expect roughly twice the performance of the public cloud instance above. This overhead is introduced by the virtualization layer in place on public clouds and holds mostly true for private cloud-based virtualization as well. The following table is a guide to how many applications to use behind the router:
Number of applications | Application type |
---|---|
5-10 | static file/web server or caching proxy |
100-1000 | applications generating dynamic content |
In general, HAProxy can support routes for up to 1000 applications, depending on the technology in use. Ingress Controller performance might be limited by the capabilities and performance of the applications behind it, such as language or static versus dynamic content.
Ingress, or router, sharding should be used to serve more routes towards applications and help horizontally scale the routing tier.
For more information on Ingress sharding, see Configuring Ingress Controller sharding by using route labels and Configuring Ingress Controller sharding by using namespace labels.
You can modify the Ingress Controller deployment by using the information provided in Setting Ingress Controller thread count for threads and Ingress Controller configuration parameters for timeouts, and other tuning configurations in the Ingress Controller specification.
9.2.2. Configuring Ingress Controller liveness, readiness, and startup probes
Cluster administrators can configure the timeout values for the kubelet’s liveness, readiness, and startup probes for router deployments that are managed by the OpenShift Container Platform Ingress Controller (router). The liveness and readiness probes of the router use the default timeout value of 1 second, which is too brief when networking or runtime performance is severely degraded. Probe timeouts can cause unwanted router restarts that interrupt application connections. The ability to set larger timeout values can reduce the risk of unnecessary and unwanted restarts.
You can update the timeoutSeconds
value on the livenessProbe
, readinessProbe
, and startupProbe
parameters of the router container.
Parameter | Description |
---|---|
|
The |
|
The |
|
The |
The timeout configuration option is an advanced tuning technique that can be used to work around issues. However, these issues should eventually be diagnosed and possibly a support case or Jira issue opened for any issues that causes probes to time out.
The following example demonstrates how you can directly patch the default router deployment to set a 5-second timeout for the liveness and readiness probes:
$ oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}'
Verification
$ oc -n openshift-ingress describe deploy/router-default | grep -e Liveness: -e Readiness: Liveness: http-get http://:1936/healthz delay=0s timeout=5s period=10s #success=1 #failure=3 Readiness: http-get http://:1936/healthz/ready delay=0s timeout=5s period=10s #success=1 #failure=3
9.2.3. Configuring HAProxy reload interval
When you update a route or an endpoint associated with a route, the OpenShift Container Platform router updates the configuration for HAProxy. Then, HAProxy reloads the updated configuration for those changes to take effect. When HAProxy reloads, it generates a new process that handles new connections using the updated configuration.
HAProxy keeps the old process running to handle existing connections until those connections are all closed. When old processes have long-lived connections, these processes can accumulate and consume resources.
The default minimum HAProxy reload interval is five seconds. You can configure an Ingress Controller using its spec.tuningOptions.reloadInterval
field to set a longer minimum reload interval.
Setting a large value for the minimum HAProxy reload interval can cause latency in observing updates to routes and their endpoints. To lessen the risk, avoid setting a value larger than the tolerable latency for updates.
Procedure
Change the minimum HAProxy reload interval of the default Ingress Controller to 15 seconds by running the following command:
$ oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"tuningOptions":{"reloadInterval":"15s"}}}'
9.3. Optimizing networking
The OpenShift SDN uses OpenvSwitch, virtual extensible LAN (VXLAN) tunnels, OpenFlow rules, and iptables. This network can be tuned by using jumbo frames, multi-queue, and ethtool settings.
OVN-Kubernetes uses Generic Network Virtualization Encapsulation (Geneve) instead of VXLAN as the tunnel protocol. This network can be tuned by using network interface controller (NIC) offloads.
VXLAN provides benefits over VLANs, such as an increase in networks from 4096 to over 16 million, and layer 2 connectivity across physical networks. This allows for all pods behind a service to communicate with each other, even if they are running on different systems.
VXLAN encapsulates all tunneled traffic in user datagram protocol (UDP) packets. However, this leads to increased CPU utilization. Both these outer- and inner-packets are subject to normal checksumming rules to guarantee data is not corrupted during transit. Depending on CPU performance, this additional processing overhead can cause a reduction in throughput and increased latency when compared to traditional, non-overlay networks.
Cloud, VM, and bare metal CPU performance can be capable of handling much more than one Gbps network throughput. When using higher bandwidth links such as 10 or 40 Gbps, reduced performance can occur. This is a known issue in VXLAN-based environments and is not specific to containers or OpenShift Container Platform. Any network that relies on VXLAN tunnels will perform similarly because of the VXLAN implementation.
If you are looking to push beyond one Gbps, you can:
- Evaluate network plugins that implement different routing techniques, such as border gateway protocol (BGP).
- Use VXLAN-offload capable network adapters. VXLAN-offload moves the packet checksum calculation and associated CPU overhead off of the system CPU and onto dedicated hardware on the network adapter. This frees up CPU cycles for use by pods and applications, and allows users to utilize the full bandwidth of their network infrastructure.
VXLAN-offload does not reduce latency. However, CPU utilization is reduced even in latency tests.
9.3.1. Optimizing the MTU for your network
There are two important maximum transmission units (MTUs): the network interface controller (NIC) MTU and the cluster network MTU.
The NIC MTU is configured at the time of OpenShift Container Platform installation, and you can also change the cluster’s MTU as a Day 2 operation. See "Changing cluster network MTU" for more information. The MTU must be less than or equal to the maximum supported value of the NIC of your network. If you are optimizing for throughput, choose the largest possible value. If you are optimizing for lowest latency, choose a lower value.
The OpenShift SDN network plugin overlay MTU must be less than the NIC MTU by 50 bytes at a minimum. This accounts for the SDN overlay header. So, on a normal ethernet network, this should be set to 1450
. On a jumbo frame ethernet network, this should be set to 8950
. These values should be set automatically by the Cluster Network Operator based on the NIC’s configured MTU. Therefore, cluster administrators do not typically update these values. Amazon Web Services (AWS) and bare-metal environments support jumbo frame ethernet networks. This setting will help throughput, especially with transmission control protocol (TCP).
For OVN and Geneve, the MTU must be less than the NIC MTU by 100 bytes at a minimum.
This 50 byte overlay header is relevant to the OpenShift SDN network plugin. Other SDN solutions might require the value to be more or less.
Additional resources
9.3.2. Recommended practices for installing large scale clusters
When installing large clusters or scaling the cluster to larger node counts, set the cluster network cidr
accordingly in your install-config.yaml
file before you install the cluster:
networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.0.0.0/16 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16
The default cluster network cidr
10.128.0.0/14
cannot be used if the cluster size is more than 500 nodes. It must be set to 10.128.0.0/12
or 10.128.0.0/10
to get to larger node counts beyond 500 nodes.
9.3.3. Impact of IPsec
Because encrypting and decrypting node hosts uses CPU power, performance is affected both in throughput and CPU usage on the nodes when encryption is enabled, regardless of the IP security system being used.
IPSec encrypts traffic at the IP payload level, before it hits the NIC, protecting fields that would otherwise be used for NIC offloading. This means that some NIC acceleration features might not be usable when IPSec is enabled and will lead to decreased throughput and increased CPU usage.
9.3.4. Additional resources
9.4. Optimizing CPU usage with mount namespace encapsulation
You can optimize CPU usage in OpenShift Container Platform clusters by using mount namespace encapsulation to provide a private namespace for kubelet and CRI-O processes. This reduces the cluster CPU resources used by systemd with no difference in functionality.
Mount namespace encapsulation is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
9.4.1. Encapsulating mount namespaces
Mount namespaces are used to isolate mount points so that processes in different namespaces cannot view each others' files. Encapsulation is the process of moving Kubernetes mount namespaces to an alternative location where they will not be constantly scanned by the host operating system.
The host operating system uses systemd to constantly scan all mount namespaces: both the standard Linux mounts and the numerous mounts that Kubernetes uses to operate. The current implementation of kubelet and CRI-O both use the top-level namespace for all container runtime and kubelet mount points. However, encapsulating these container-specific mount points in a private namespace reduces systemd overhead with no difference in functionality. Using a separate mount namespace for both CRI-O and kubelet can encapsulate container-specific mounts from any systemd or other host operating system interaction.
This ability to potentially achieve major CPU optimization is now available to all OpenShift Container Platform administrators. Encapsulation can also improve security by storing Kubernetes-specific mount points in a location safe from inspection by unprivileged users.
The following diagrams illustrate a Kubernetes installation before and after encapsulation. Both scenarios show example containers which have mount propagation settings of bidirectional, host-to-container, and none.
Here we see systemd, host operating system processes, kubelet, and the container runtime sharing a single mount namespace.
- systemd, host operating system processes, kubelet, and the container runtime each have access to and visibility of all mount points.
-
Container 1, configured with bidirectional mount propagation, can access systemd and host mounts, kubelet and CRI-O mounts. A mount originating in Container 1, such as
/run/a
is visible to systemd, host operating system processes, kubelet, container runtime, and other containers with host-to-container or bidirectional mount propagation configured (as in Container 2). -
Container 2, configured with host-to-container mount propagation, can access systemd and host mounts, kubelet and CRI-O mounts. A mount originating in Container 2, such as
/run/b
, is not visible to any other context. -
Container 3, configured with no mount propagation, has no visibility of external mount points. A mount originating in Container 3, such as
/run/c
, is not visible to any other context.
The following diagram illustrates the system state after encapsulation.
- The main systemd process is no longer devoted to unnecessary scanning of Kubernetes-specific mount points. It only monitors systemd-specific and host mount points.
- The host operating system processes can access only the systemd and host mount points.
- Using a separate mount namespace for both CRI-O and kubelet completely separates all container-specific mounts away from any systemd or other host operating system interaction whatsoever.
-
The behavior of Container 1 is unchanged, except a mount it creates such as
/run/a
is no longer visible to systemd or host operating system processes. It is still visible to kubelet, CRI-O, and other containers with host-to-container or bidirectional mount propagation configured (like Container 2). - The behavior of Container 2 and Container 3 is unchanged.
9.4.2. Configuring mount namespace encapsulation
You can configure mount namespace encapsulation so that a cluster runs with less resource overhead.
Mount namespace encapsulation is a Technology Preview feature and it is disabled by default. To use it, you must enable the feature manually.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
Create a file called
mount_namespace_config.yaml
with the following YAML:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 99-kubens-master spec: config: ignition: version: 3.2.0 systemd: units: - enabled: true name: kubens.service --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-kubens-worker spec: config: ignition: version: 3.2.0 systemd: units: - enabled: true name: kubens.service
Apply the mount namespace
MachineConfig
CR by running the following command:$ oc apply -f mount_namespace_config.yaml
Example output
machineconfig.machineconfiguration.openshift.io/99-kubens-master created machineconfig.machineconfiguration.openshift.io/99-kubens-worker created
The
MachineConfig
CR can take up to 30 minutes to finish being applied in the cluster. You can check the status of theMachineConfig
CR by running the following command:$ oc get mcp
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-03d4bc4befb0f4ed3566a2c8f7636751 False True False 3 0 0 0 45m worker rendered-worker-10577f6ab0117ed1825f8af2ac687ddf False True False 3 1 1
Wait for the
MachineConfig
CR to be applied successfully across all control plane and worker nodes after running the following command:$ oc wait --for=condition=Updated mcp --all --timeout=30m
Example output
machineconfigpool.machineconfiguration.openshift.io/master condition met machineconfigpool.machineconfiguration.openshift.io/worker condition met
Verification
To verify encapsulation for a cluster host, run the following commands:
Open a debug shell to the cluster host:
$ oc debug node/<node_name>
Open a
chroot
session:sh-4.4# chroot /host
Check the systemd mount namespace:
sh-4.4# readlink /proc/1/ns/mnt
Example output
mnt:[4026531953]
Check kubelet mount namespace:
sh-4.4# readlink /proc/$(pgrep kubelet)/ns/mnt
Example output
mnt:[4026531840]
Check the CRI-O mount namespace:
sh-4.4# readlink /proc/$(pgrep crio)/ns/mnt
Example output
mnt:[4026531840]
These commands return the mount namespaces associated with systemd, kubelet, and the container runtime. In OpenShift Container Platform, the container runtime is CRI-O.
Encapsulation is in effect if systemd is in a different mount namespace to kubelet and CRI-O as in the above example. Encapsulation is not in effect if all three processes are in the same mount namespace.
9.4.3. Inspecting encapsulated namespaces
You can inspect Kubernetes-specific mount points in the cluster host operating system for debugging or auditing purposes by using the kubensenter
script that is available in Red Hat Enterprise Linux CoreOS (RHCOS).
SSH shell sessions to the cluster host are in the default namespace. To inspect Kubernetes-specific mount points in an SSH shell prompt, you need to run the kubensenter
script as root. The kubensenter
script is aware of the state of the mount encapsulation, and is safe to run even if encapsulation is not enabled.
oc debug
remote shell sessions start inside the Kubernetes namespace by default. You do not need to run kubensenter
to inspect mount points when you use oc debug
.
If the encapsulation feature is not enabled, the kubensenter findmnt
and findmnt
commands return the same output, regardless of whether they are run in an oc debug
session or in an SSH shell prompt.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges. - You have configured SSH access to the cluster host.
Procedure
Open a remote SSH shell to the cluster host. For example:
$ ssh core@<node_name>
Run commands using the provided
kubensenter
script as the root user. To run a single command inside the Kubernetes namespace, provide the command and any arguments to thekubensenter
script. For example, to run thefindmnt
command inside the Kubernetes namespace, run the following command:[core@control-plane-1 ~]$ sudo kubensenter findmnt
Example output
kubensenter: Autodetect: kubens.service namespace found at /run/kubens/mnt TARGET SOURCE FSTYPE OPTIONS / /dev/sda4[/ostree/deploy/rhcos/deploy/32074f0e8e5ec453e56f5a8a7bc9347eaa4172349ceab9c22b709d9d71a3f4b0.0] | xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota shm tmpfs ...
To start a new interactive shell inside the Kubernetes namespace, run the
kubensenter
script without any arguments:[core@control-plane-1 ~]$ sudo kubensenter
Example output
kubensenter: Autodetect: kubens.service namespace found at /run/kubens/mnt
9.4.4. Running additional services in the encapsulated namespace
Any monitoring tool that relies on the ability to run in the host operating system and have visibility of mount points created by kubelet, CRI-O, or containers themselves, must enter the container mount namespace to see these mount points. The kubensenter
script that is provided with OpenShift Container Platform executes another command inside the Kubernetes mount point and can be used to adapt any existing tools.
The kubensenter
script is aware of the state of the mount encapsulation feature status, and is safe to run even if encapsulation is not enabled. In that case the script executes the provided command in the default mount namespace.
For example, if a systemd service needs to run inside the new Kubernetes mount namespace, edit the service file and use the ExecStart=
command line with kubensenter
.
[Unit] Description=Example service [Service] ExecStart=/usr/bin/kubensenter /path/to/original/command arg1 arg2
9.4.5. Additional resources
Chapter 10. Managing bare metal hosts
When you install OpenShift Container Platform on a bare metal cluster, you can provision and manage bare metal nodes using machine
and machineset
custom resources (CRs) for bare metal hosts that exist in the cluster.
10.1. About bare metal hosts and nodes
To provision a Red Hat Enterprise Linux CoreOS (RHCOS) bare metal host as a node in your cluster, first create a MachineSet
custom resource (CR) object that corresponds to the bare metal host hardware. Bare metal host compute machine sets describe infrastructure components specific to your configuration. You apply specific Kubernetes labels to these compute machine sets and then update the infrastructure components to run on only those machines.
Machine
CR’s are created automatically when you scale up the relevant MachineSet
containing a metal3.io/autoscale-to-hosts
annotation. OpenShift Container Platform uses Machine
CR’s to provision the bare metal node that corresponds to the host as specified in the MachineSet
CR.
10.2. Maintaining bare metal hosts
You can maintain the details of the bare metal hosts in your cluster from the OpenShift Container Platform web console. Navigate to Compute → Bare Metal Hosts, and select a task from the Actions drop down menu. Here you can manage items such as BMC details, boot MAC address for the host, enable power management, and so on. You can also review the details of the network interfaces and drives for the host.
You can move a bare metal host into maintenance mode. When you move a host into maintenance mode, the scheduler moves all managed workloads off the corresponding bare metal node. No new workloads are scheduled while in maintenance mode.
You can deprovision a bare metal host in the web console. Deprovisioning a host does the following actions:
-
Annotates the bare metal host CR with
cluster.k8s.io/delete-machine: true
- Scales down the related compute machine set
Powering off the host without first moving the daemon set and unmanaged static pods to another node can cause service disruption and loss of data.
Additional resources
10.2.1. Adding a bare metal host to the cluster using the web console
You can add bare metal hosts to the cluster in the web console.
Prerequisites
- Install an RHCOS cluster on bare metal.
-
Log in as a user with
cluster-admin
privileges.
Procedure
- In the web console, navigate to Compute → Bare Metal Hosts.
- Select Add Host → New with Dialog.
- Specify a unique name for the new bare metal host.
- Set the Boot MAC address.
- Set the Baseboard Management Console (BMC) Address.
- Enter the user credentials for the host’s baseboard management controller (BMC).
- Select to power on the host after creation, and select Create.
- Scale up the number of replicas to match the number of available bare metal hosts. Navigate to Compute → MachineSets, and increase the number of machine replicas in the cluster by selecting Edit Machine count from the Actions drop-down menu.
You can also manage the number of bare metal nodes using the oc scale
command and the appropriate bare metal compute machine set.
10.2.2. Adding a bare metal host to the cluster using YAML in the web console
You can add bare metal hosts to the cluster in the web console using a YAML file that describes the bare metal host.
Prerequisites
- Install a RHCOS compute machine on bare metal infrastructure for use in the cluster.
-
Log in as a user with
cluster-admin
privileges. -
Create a
Secret
CR for the bare metal host.
Procedure
- In the web console, navigate to Compute → Bare Metal Hosts.
- Select Add Host → New from YAML.
Copy and paste the below YAML, modifying the relevant fields with the details of your host:
apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: <bare_metal_host_name> spec: online: true bmc: address: <bmc_address> credentialsName: <secret_credentials_name> 1 disableCertificateVerification: True 2 bootMACAddress: <host_boot_mac_address>
- 1
credentialsName
must reference a validSecret
CR. Thebaremetal-operator
cannot manage the bare metal host without a validSecret
referenced in thecredentialsName
. For more information about secrets and how to create them, see Understanding secrets.- 2
- Setting
disableCertificateVerification
totrue
disables TLS host validation between the cluster and the baseboard management controller (BMC).
- Select Create to save the YAML and create the new bare metal host.
Scale up the number of replicas to match the number of available bare metal hosts. Navigate to Compute → MachineSets, and increase the number of machines in the cluster by selecting Edit Machine count from the Actions drop-down menu.
NoteYou can also manage the number of bare metal nodes using the
oc scale
command and the appropriate bare metal compute machine set.
10.2.3. Automatically scaling machines to the number of available bare metal hosts
To automatically create the number of Machine
objects that matches the number of available BareMetalHost
objects, add a metal3.io/autoscale-to-hosts
annotation to the MachineSet
object.
Prerequisites
-
Install RHCOS bare metal compute machines for use in the cluster, and create corresponding
BareMetalHost
objects. -
Install the OpenShift Container Platform CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Annotate the compute machine set that you want to configure for automatic scaling by adding the
metal3.io/autoscale-to-hosts
annotation. Replace<machineset>
with the name of the compute machine set.$ oc annotate machineset <machineset> -n openshift-machine-api 'metal3.io/autoscale-to-hosts=<any_value>'
Wait for the new scaled machines to start.
When you use a BareMetalHost
object to create a machine in the cluster and labels or selectors are subsequently changed on the BareMetalHost
, the BareMetalHost
object continues be counted against the MachineSet
that the Machine
object was created from.
10.2.4. Removing bare metal hosts from the provisioner node
In certain circumstances, you might want to temporarily remove bare metal hosts from the provisioner node. For example, during provisioning when a bare metal host reboot is triggered by using the OpenShift Container Platform administration console or as a result of a Machine Config Pool update, OpenShift Container Platform logs into the integrated Dell Remote Access Controller (iDrac) and issues a delete of the job queue.
To prevent the management of the number of Machine
objects that matches the number of available BareMetalHost
objects, add a baremetalhost.metal3.io/detached
annotation to the MachineSet
object.
This annotation has an effect for only BareMetalHost
objects that are in either Provisioned
, ExternallyProvisioned
or Ready/Available
state.
Prerequisites
-
Install RHCOS bare metal compute machines for use in the cluster and create corresponding
BareMetalHost
objects. -
Install the OpenShift Container Platform CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Annotate the compute machine set that you want to remove from the provisioner node by adding the
baremetalhost.metal3.io/detached
annotation.$ oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached'
Wait for the new machines to start.
NoteWhen you use a
BareMetalHost
object to create a machine in the cluster and labels or selectors are subsequently changed on theBareMetalHost
, theBareMetalHost
object continues be counted against theMachineSet
that theMachine
object was created from.In the provisioning use case, remove the annotation after the reboot is complete by using the following command:
$ oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached-'
Additional resources
Chapter 11. Monitoring bare-metal events with the Bare Metal Event Relay
Bare Metal Event Relay is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
11.1. About bare-metal events
Use the Bare Metal Event Relay to subscribe applications that run in your OpenShift Container Platform cluster to events that are generated on the underlying bare-metal host. The Redfish service publishes events on a node and transmits them on an advanced message queue to subscribed applications.
Bare-metal events are based on the open Redfish standard that is developed under the guidance of the Distributed Management Task Force (DMTF). Redfish provides a secure industry-standard protocol with a REST API. The protocol is used for the management of distributed, converged or software-defined resources and infrastructure.
Hardware-related events published through Redfish includes:
- Breaches of temperature limits
- Server status
- Fan status
Begin using bare-metal events by deploying the Bare Metal Event Relay Operator and subscribing your application to the service. The Bare Metal Event Relay Operator installs and manages the lifecycle of the Redfish bare-metal event service.
The Bare Metal Event Relay works only with Redfish-capable devices on single-node clusters provisioned on bare-metal infrastructure.
11.2. How bare-metal events work
The Bare Metal Event Relay enables applications running on bare-metal clusters to respond quickly to Redfish hardware changes and failures such as breaches of temperature thresholds, fan failure, disk loss, power outages, and memory failure. These hardware events are delivered using an HTTP transport or AMQP mechanism. The latency of the messaging service is between 10 to 20 milliseconds.
The Bare Metal Event Relay provides a publish-subscribe service for the hardware events. Applications can use a REST API to subscribe to the events. The Bare Metal Event Relay supports hardware that complies with Redfish OpenAPI v1.8 or later.
11.2.1. Bare Metal Event Relay data flow
The following figure illustrates an example bare-metal events data flow:
Figure 11.1. Bare Metal Event Relay data flow
11.2.1.1. Operator-managed pod
The Operator uses custom resources to manage the pod containing the Bare Metal Event Relay and its components using the HardwareEvent
CR.
11.2.1.2. Bare Metal Event Relay
At startup, the Bare Metal Event Relay queries the Redfish API and downloads all the message registries, including custom registries. The Bare Metal Event Relay then begins to receive subscribed events from the Redfish hardware.
The Bare Metal Event Relay enables applications running on bare-metal clusters to respond quickly to Redfish hardware changes and failures such as breaches of temperature thresholds, fan failure, disk loss, power outages, and memory failure. The events are reported using the HardwareEvent
CR.
11.2.1.3. Cloud native event
Cloud native events (CNE) is a REST API specification for defining the format of event data.
11.2.1.4. CNCF CloudEvents
CloudEvents is a vendor-neutral specification developed by the Cloud Native Computing Foundation (CNCF) for defining the format of event data.
11.2.1.5. HTTP transport or AMQP dispatch router
The HTTP transport or AMQP dispatch router is responsible for the message delivery service between publisher and subscriber.
HTTP transport is the default transport for PTP and bare-metal events. Use HTTP transport instead of AMQP for PTP and bare-metal events where possible. AMQ Interconnect is EOL from 30 June 2024. Extended life cycle support (ELS) for AMQ Interconnect ends 29 November 2029. For more information see, Red Hat AMQ Interconnect support status.
11.2.1.6. Cloud event proxy sidecar
The cloud event proxy sidecar container image is based on the O-RAN API specification and provides a publish-subscribe event framework for hardware events.
11.2.2. Redfish message parsing service
In addition to handling Redfish events, the Bare Metal Event Relay provides message parsing for events without a Message
property. The proxy downloads all the Redfish message registries including vendor specific registries from the hardware when it starts. If an event does not contain a Message
property, the proxy uses the Redfish message registries to construct the Message
and Resolution
properties and add them to the event before passing the event to the cloud events framework. This service allows Redfish events to have smaller message size and lower transmission latency.
11.2.3. Installing the Bare Metal Event Relay using the CLI
As a cluster administrator, you can install the Bare Metal Event Relay Operator by using the CLI.
Prerequisites
- A cluster that is installed on bare-metal hardware with nodes that have a RedFish-enabled Baseboard Management Controller (BMC).
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Create a namespace for the Bare Metal Event Relay.
Save the following YAML in the
bare-metal-events-namespace.yaml
file:apiVersion: v1 kind: Namespace metadata: name: openshift-bare-metal-events labels: name: openshift-bare-metal-events openshift.io/cluster-monitoring: "true"
Create the
Namespace
CR:$ oc create -f bare-metal-events-namespace.yaml
Create an Operator group for the Bare Metal Event Relay Operator.
Save the following YAML in the
bare-metal-events-operatorgroup.yaml
file:apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: bare-metal-event-relay-group namespace: openshift-bare-metal-events spec: targetNamespaces: - openshift-bare-metal-events
Create the
OperatorGroup
CR:$ oc create -f bare-metal-events-operatorgroup.yaml
Subscribe to the Bare Metal Event Relay.
Save the following YAML in the
bare-metal-events-sub.yaml
file:apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: bare-metal-event-relay-subscription namespace: openshift-bare-metal-events spec: channel: "stable" name: bare-metal-event-relay source: redhat-operators sourceNamespace: openshift-marketplace
Create the
Subscription
CR:$ oc create -f bare-metal-events-sub.yaml
Verification
To verify that the Bare Metal Event Relay Operator is installed, run the following command:
$ oc get csv -n openshift-bare-metal-events -o custom-columns=Name:.metadata.name,Phase:.status.phase
11.2.4. Installing the Bare Metal Event Relay using the web console
As a cluster administrator, you can install the Bare Metal Event Relay Operator using the web console.
Prerequisites
- A cluster that is installed on bare-metal hardware with nodes that have a RedFish-enabled Baseboard Management Controller (BMC).
-
Log in as a user with
cluster-admin
privileges.
Procedure
Install the Bare Metal Event Relay using the OpenShift Container Platform web console:
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose Bare Metal Event Relay from the list of available Operators, and then click Install.
- On the Install Operator page, select or create a Namespace, select openshift-bare-metal-events, and then click Install.
Verification
Optional: You can verify that the Operator installed successfully by performing the following check:
- Switch to the Operators → Installed Operators page.
Ensure that Bare Metal Event Relay is listed in the project with a Status of InstallSucceeded.
NoteDuring installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.
If the Operator does not appear as installed, to troubleshoot further:
- Go to the Operators → Installed Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
- Go to the Workloads → Pods page and check the logs for pods in the project namespace.
11.3. Installing the AMQ messaging bus
To pass Redfish bare-metal event notifications between publisher and subscriber on a node, you can install and configure an AMQ messaging bus to run locally on the node. You do this by installing the AMQ Interconnect Operator for use in the cluster.
HTTP transport is the default transport for PTP and bare-metal events. Use HTTP transport instead of AMQP for PTP and bare-metal events where possible. AMQ Interconnect is EOL from 30 June 2024. Extended life cycle support (ELS) for AMQ Interconnect ends 29 November 2029. For more information see, Red Hat AMQ Interconnect support status.
Prerequisites
-
Install the OpenShift Container Platform CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
-
Install the AMQ Interconnect Operator to its own
amq-interconnect
namespace. See Installing the AMQ Interconnect Operator.
Verification
Verify that the AMQ Interconnect Operator is available and the required pods are running:
$ oc get pods -n amq-interconnect
Example output
NAME READY STATUS RESTARTS AGE amq-interconnect-645db76c76-k8ghs 1/1 Running 0 23h interconnect-operator-5cb5fc7cc-4v7qm 1/1 Running 0 23h
Verify that the required
bare-metal-event-relay
bare-metal event producer pod is running in theopenshift-bare-metal-events
namespace:$ oc get pods -n openshift-bare-metal-events
Example output
NAME READY STATUS RESTARTS AGE hw-event-proxy-operator-controller-manager-74d5649b7c-dzgtl 2/2 Running 0 25s
11.4. Subscribing to Redfish BMC bare-metal events for a cluster node
You can subscribe to Redfish BMC events generated on a node in your cluster by creating a BMCEventSubscription
custom resource (CR) for the node, creating a HardwareEvent
CR for the event, and creating a Secret
CR for the BMC.
11.4.1. Subscribing to bare-metal events
You can configure the baseboard management controller (BMC) to send bare-metal events to subscribed applications running in an OpenShift Container Platform cluster. Example Redfish bare-metal events include an increase in device temperature, or removal of a device. You subscribe applications to bare-metal events using a REST API.
You can only create a BMCEventSubscription
custom resource (CR) for physical hardware that supports Redfish and has a vendor interface set to redfish
or idrac-redfish
.
Use the BMCEventSubscription
CR to subscribe to predefined Redfish events. The Redfish standard does not provide an option to create specific alerts and thresholds. For example, to receive an alert event when an enclosure’s temperature exceeds 40° Celsius, you must manually configure the event according to the vendor’s recommendations.
Perform the following procedure to subscribe to bare-metal events for the node using a BMCEventSubscription
CR.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges. - Get the user name and password for the BMC.
Deploy a bare-metal node with a Redfish-enabled Baseboard Management Controller (BMC) in your cluster, and enable Redfish events on the BMC.
NoteEnabling Redfish events on specific hardware is outside the scope of this information. For more information about enabling Redfish events for your specific hardware, consult the BMC manufacturer documentation.
Procedure
Confirm that the node hardware has the Redfish
EventService
enabled by running the followingcurl
command:$ curl https://<bmc_ip_address>/redfish/v1/EventService --insecure -H 'Content-Type: application/json' -u "<bmc_username>:<password>"
where: