Scalability and performance
Scaling your OpenShift Container Platform cluster and tuning performance in production environments
Abstract
Chapter 1. OpenShift Container Platform scalability and performance overview Copy linkLink copied to clipboard!
OpenShift Container Platform provides best practices and tools to help you optimize the performance and scale of your clusters. The following documentation provides information on recommended performance and scalability practices, reference design specifications, optimization, and low latency tuning.
To contact Red Hat support, see Getting support.
Some performance and scalability Operators have release cycles that are independent from OpenShift Container Platform release cycles. For more information, see OpenShift Operators.
1.1. Recommended performance and scalability practices Copy linkLink copied to clipboard!
Recommended control plane practices
1.2. Telco reference design specifications Copy linkLink copied to clipboard!
1.3. Planning, optimization, and measurement Copy linkLink copied to clipboard!
Planning your environment according to object maximums
Recommended practices for IBM Z and IBM LinuxONE
Using the Node Tuning Operator
Using CPU Manager and Topology Manager
Scheduling NUMA-aware workloads
Optimizing storage, routing, networking and CPU usage
Managing bare metal hosts and events
What are huge pages and how are they used by apps
Low latency tuning for improving cluster stability and partitioning workload
Improving cluster stability in high latency environments using worker latency profiles
Chapter 2. Recommended performance and scalability practices Copy linkLink copied to clipboard!
2.1. Recommended control plane practices Copy linkLink copied to clipboard!
This topic provides recommended performance and scalability practices for control planes in OpenShift Container Platform.
2.1.1. Recommended practices for scaling the cluster Copy linkLink copied to clipboard!
The guidance in this section is only relevant for installations with cloud provider integration.
Apply the following best practices to scale the number of worker machines in your OpenShift Container Platform cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the worker machine set.
When scaling up the cluster to higher node counts:
- Spread nodes across all of the available zones for higher availability.
- Scale up by no more than 25 to 50 machines at once.
- Consider creating new compute machine sets in each available zone with alternative instance types of similar size to help mitigate any periodic provider capacity constraints. For example, on AWS, use m5.large and m5d.large.
Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.
The controller might not be able to create the machines if the replicas in the compute machine sets are set to higher numbers all at one time. The number of requests the cloud platform, which OpenShift Container Platform is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which OpenShift Container Platform is deployed has API request limits; excessive queries might lead to machine creation failures due to cloud platform limitations.
Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.
When scaling large and dense clusters to lower node counts, it might take large amounts of time because the process involves draining or evicting the objects running on the nodes being terminated in parallel. Also, the client might start to throttle the requests if there are too many objects to evict. The default client queries per second (QPS) and burst rates are currently set to 50 and 100 respectively. These values cannot be modified in OpenShift Container Platform.
2.1.2. Control plane node sizing Copy linkLink copied to clipboard!
The control plane node resource requirements depend on the number and type of nodes and objects in the cluster. The following control plane node size recommendations are based on the results of a control plane density focused testing, or Cluster-density. This test creates the following objects across a given number of namespaces:
- 1 image stream
- 1 build
-
5 deployments, with 2 pod replicas in a
sleepstate, mounting 4 secrets, 4 config maps, and 1 downward API volume each - 5 services, each one pointing to the TCP/8080 and TCP/8443 ports of one of the previous deployments
- 1 route pointing to the first of the previous services
- 10 secrets containing 2048 random string characters
- 10 config maps containing 2048 random string characters
| Number of worker nodes | Cluster-density (namespaces) | CPU cores | Memory (GB) |
|---|---|---|---|
| 24 | 500 | 4 | 16 |
| 120 | 1000 | 8 | 32 |
| 252 | 4000 | 16, but 24 if using the OVN-Kubernetes network plug-in | 64, but 128 if using the OVN-Kubernetes network plug-in |
| 501, but untested with the OVN-Kubernetes network plug-in | 4000 | 16 | 96 |
The data from the table above is based on an OpenShift Container Platform running on top of AWS, using r5.4xlarge instances as control-plane nodes and m5.2xlarge instances as worker nodes.
On a large and dense cluster with three control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted, or fails. The failures can be due to unexpected issues with power, network, underlying infrastructure, or intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available, which leads to increase in the resource usage. This is also expected during upgrades because the control plane nodes are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
The node sizing varies depending on the number of nodes and object counts in the cluster. It also depends on whether the objects are actively being created on the cluster. During object creation, the control plane is more active in terms of resource usage compared to when the objects are in the running phase.
Operator Lifecycle Manager (OLM ) runs on the control plane nodes and its memory footprint depends on the number of namespaces and user installed operators that OLM needs to manage on the cluster. Control plane nodes need to be sized accordingly to avoid OOM kills. Following data points are based on the results from cluster maximums testing.
| Number of namespaces | OLM memory at idle state (GB) | OLM memory with 5 user operators installed (GB) |
|---|---|---|
| 500 | 0.823 | 1.7 |
| 1000 | 1.2 | 2.5 |
| 1500 | 1.7 | 3.2 |
| 2000 | 2 | 4.4 |
| 3000 | 2.7 | 5.6 |
| 4000 | 3.8 | 7.6 |
| 5000 | 4.2 | 9.02 |
| 6000 | 5.8 | 11.3 |
| 7000 | 6.6 | 12.9 |
| 8000 | 6.9 | 14.8 |
| 9000 | 8 | 17.7 |
| 10,000 | 9.9 | 21.6 |
You can modify the control plane node size in a running OpenShift Container Platform 4.14 cluster for the following configurations only:
- Clusters installed with a user-provisioned installation method.
- AWS clusters installed with an installer-provisioned infrastructure installation method.
- Clusters that use a control plane machine set to manage control plane machines.
For all other configurations, you must estimate your total node count and use the suggested control plane node size during installation.
The recommendations are based on the data points captured on OpenShift Container Platform clusters with OpenShift SDN as the network plugin.
In OpenShift Container Platform 4.14, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. The sizes are determined taking that into consideration.
2.1.2.1. Selecting a larger Amazon Web Services instance type for control plane machines Copy linkLink copied to clipboard!
If the control plane machines in an Amazon Web Services (AWS) cluster require more resources, you can select a larger AWS instance type for the control plane machines to use.
The procedure for clusters that use a control plane machine set is different from the procedure for clusters that do not use a control plane machine set.
If you are uncertain about the state of the ControlPlaneMachineSet CR in your cluster, you can verify the CR status.
2.1.2.1.1. Changing the Amazon Web Services instance type by using a control plane machine set Copy linkLink copied to clipboard!
You can change the Amazon Web Services (AWS) instance type that your control plane machines use by updating the specification in the control plane machine set custom resource (CR).
Prerequisites
- Your AWS cluster uses a control plane machine set.
Procedure
Edit your control plane machine set CR by running the following command:
oc --namespace openshift-machine-api edit controlplanemachineset.machine.openshift.io cluster
$ oc --namespace openshift-machine-api edit controlplanemachineset.machine.openshift.io clusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the following line under the
providerSpecfield:providerSpec: value: ... instanceType: <compatible_aws_instance_type>providerSpec: value: ... instanceType: <compatible_aws_instance_type>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify a larger AWS instance type with the same base as the previous selection. For example, you can change
m6i.xlargetom6i.2xlargeorm6i.4xlarge.
Save your changes.
-
For clusters that use the default
RollingUpdateupdate strategy, the Operator automatically propagates the changes to your control plane configuration. -
For clusters that are configured to use the
OnDeleteupdate strategy, you must replace your control plane machines manually.
-
For clusters that use the default
2.1.2.1.2. Changing the Amazon Web Services instance type by using the AWS console Copy linkLink copied to clipboard!
You can change the Amazon Web Services (AWS) instance type that your control plane machines use by updating the instance type in the AWS console.
Prerequisites
- You have access to the AWS console with the permissions required to modify the EC2 Instance for your cluster.
-
You have access to the OpenShift Container Platform cluster as a user with the
cluster-adminrole.
Procedure
- Open the AWS console and fetch the instances for the control plane machines.
Choose one control plane machine instance.
- For the selected control plane machine, back up the etcd data by creating an etcd snapshot. For more information, see "Backing up etcd".
- In the AWS console, stop the control plane machine instance.
- Select the stopped instance, and click Actions → Instance Settings → Change instance type.
-
Change the instance to a larger type, ensuring that the type is the same base as the previous selection, and apply changes. For example, you can change
m6i.xlargetom6i.2xlargeorm6i.4xlarge. - Start the instance.
-
If your OpenShift Container Platform cluster has a corresponding
Machineobject for the instance, update the instance type of the object to match the instance type set in the AWS console.
- Repeat this process for each control plane machine.
2.2. Recommended infrastructure practices Copy linkLink copied to clipboard!
This topic provides recommended performance and scalability practices for infrastructure in OpenShift Container Platform.
2.2.1. Infrastructure node sizing Copy linkLink copied to clipboard!
Infrastructure nodes are nodes that are labeled to run pieces of the OpenShift Container Platform environment. The infrastructure node resource requirements depend on the cluster age, nodes, and objects in the cluster, as these factors can lead to an increase in the number of metrics or time series in Prometheus. The following infrastructure node size recommendations are based on the results observed in cluster-density testing detailed in the Control plane node sizing section, where the monitoring stack and the default ingress-controller were moved to these nodes.
| Number of worker nodes | Cluster density, or number of namespaces | CPU cores | Memory (GB) |
|---|---|---|---|
| 27 | 500 | 4 | 24 |
| 120 | 1000 | 8 | 48 |
| 252 | 4000 | 16 | 128 |
| 501 | 4000 | 32 | 128 |
In general, three infrastructure nodes are recommended per cluster.
These sizing recommendations should be used as a guideline. Prometheus is a highly memory intensive application; the resource usage depends on various factors including the number of nodes, objects, the Prometheus metrics scraping interval, metrics or time series, and the age of the cluster. In addition, the router resource usage can also be affected by the number of routes and the amount/type of inbound requests.
These recommendations apply only to infrastructure nodes hosting Monitoring, Ingress and Registry infrastructure components installed during cluster creation.
In OpenShift Container Platform 4.14, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. This influences the stated sizing recommendations.
2.2.2. Scaling the Cluster Monitoring Operator Copy linkLink copied to clipboard!
OpenShift Container Platform exposes metrics that the Cluster Monitoring Operator (CMO) collects and stores in the Prometheus-based monitoring stack. As an administrator, you can view dashboards for system resources, containers, and components metrics in the OpenShift Container Platform web console by navigating to Observe → Dashboards.
2.2.3. Prometheus database storage requirements Copy linkLink copied to clipboard!
Red Hat performed various tests for different scale sizes.
- The following Prometheus storage requirements are not prescriptive and should be used as a reference. Higher resource consumption might be observed in your cluster depending on workload activity and resource density, including the number of pods, containers, routes, or other resources exposing metrics collected by Prometheus.
- You can configure the size-based data retention policy to suit your storage requirements.
| Number of nodes | Number of pods (2 containers per pod) | Prometheus storage growth per day | Prometheus storage growth per 15 days | Network (per tsdb chunk) |
|---|---|---|---|---|
| 50 | 1800 | 6.3 GB | 94 GB | 16 MB |
| 100 | 3600 | 13 GB | 195 GB | 26 MB |
| 150 | 5400 | 19 GB | 283 GB | 36 MB |
| 200 | 7200 | 25 GB | 375 GB | 46 MB |
Approximately 20 percent of the expected size was added as overhead to ensure that the storage requirements do not exceed the calculated value.
The above calculation is for the default OpenShift Container Platform Cluster Monitoring Operator.
CPU utilization has minor impact. The ratio is approximately 1 core out of 40 per 50 nodes and 1800 pods.
Recommendations for OpenShift Container Platform
- Use at least two infrastructure (infra) nodes.
- Use at least three openshift-container-storage nodes with non-volatile memory express (SSD or NVMe) drives.
2.2.4. Configuring cluster monitoring Copy linkLink copied to clipboard!
You can increase the storage capacity for the Prometheus component in the cluster monitoring stack.
Procedure
To increase the storage capacity for Prometheus:
Create a YAML configuration file,
cluster-monitoring-config.yaml. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The default value of Prometheus retention is
PROMETHEUS_RETENTION_PERIOD=15d. Units are measured in time using one of these suffixes: s, m, h, d. - 2 4
- The storage class for your cluster.
- 3
- A typical value is
PROMETHEUS_STORAGE_SIZE=2000Gi. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki. - 5
- A typical value is
ALERTMANAGER_STORAGE_SIZE=20Gi. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
- Add values for the retention period, storage class, and storage sizes.
- Save the file.
Apply the changes by running:
oc create -f cluster-monitoring-config.yaml
$ oc create -f cluster-monitoring-config.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
2.3. Recommended etcd practices Copy linkLink copied to clipboard!
This topic provides recommended performance and scalability practices for etcd in OpenShift Container Platform.
2.3.1. Recommended etcd practices Copy linkLink copied to clipboard!
Because etcd writes data to disk and persists proposals on disk, its performance depends on disk performance. Although etcd is not particularly I/O intensive, it requires a low latency block device for optimal performance and stability. Because etcd’s consensus protocol depends on persistently storing metadata to a log (WAL), etcd is sensitive to disk-write latency. Slow disks and disk activity from other processes can cause long fsync latencies.
Those latencies can cause etcd to miss heartbeats, not commit new proposals to the disk on time, and ultimately experience request timeouts and temporary leader loss. High write latencies also lead to an OpenShift API slowness, which affects cluster performance. Because of these reasons, avoid colocating other workloads on the control-plane nodes that are I/O sensitive or intensive and share the same underlying I/O infrastructure.
In terms of latency, run etcd on top of a block device that can write at least 50 IOPS of 8000 bytes long sequentially. That is, with a latency of 10ms, keep in mind that uses fdatasync to synchronize each write in the WAL. For heavy loaded clusters, sequential 500 IOPS of 8000 bytes (2 ms) are recommended. To measure those numbers, you can use a benchmarking tool, such as fio.
To achieve such performance, run etcd on machines that are backed by SSD or NVMe disks with low latency and high throughput. Consider single-level cell (SLC) solid-state drives (SSDs), which provide 1 bit per memory cell, are durable and reliable, and are ideal for write-intensive workloads.
The load on etcd arises from static factors, such as the number of nodes and pods, and dynamic factors, including changes in endpoints due to pod autoscaling, pod restarts, job executions, and other workload-related events. To accurately size your etcd setup, you must analyze the specific requirements of your workload. Consider the number of nodes, pods, and other relevant factors that impact the load on etcd.
The following hard drive practices provide optimal etcd performance:
- Use dedicated etcd drives. Avoid drives that communicate over the network, such as iSCSI. Do not place log files or other heavy workloads on etcd drives.
- Prefer drives with low latency to support fast read and write operations.
- Prefer high-bandwidth writes for faster compactions and defragmentation.
- Prefer high-bandwidth reads for faster recovery from failures.
- Use solid state drives as a minimum selection. Prefer NVMe drives for production environments.
- Use server-grade hardware for increased reliability.
Avoid NAS or SAN setups and spinning drives. Ceph Rados Block Device (RBD) and other types of network-attached storage can result in unpredictable network latency. To provide fast storage to etcd nodes at scale, use PCI passthrough to pass NVM devices directly to the nodes.
Always benchmark by using utilities such as fio. You can use such utilities to continuously monitor the cluster performance as it increases.
Avoid using the Network File System (NFS) protocol or other network based file systems.
Some key metrics to monitor on a deployed OpenShift Container Platform cluster are p99 of etcd disk write ahead log duration and the number of etcd leader changes. Use Prometheus to track these metrics.
The etcd member database sizes can vary in a cluster during normal operations. This difference does not affect cluster upgrades, even if the leader size is different from the other members.
To validate the hardware for etcd before or after you create the OpenShift Container Platform cluster, you can use fio.
Prerequisites
- Container runtimes such as Podman or Docker are installed on the machine that you’re testing.
-
Data is written to the
/var/lib/etcdpath.
Procedure
Run fio and analyze the results:
If you use Podman, run this command:
sudo podman run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/cloud-bulldozer/etcd-perf
$ sudo podman run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/cloud-bulldozer/etcd-perfCopy to Clipboard Copied! Toggle word wrap Toggle overflow If you use Docker, run this command:
sudo docker run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/cloud-bulldozer/etcd-perf
$ sudo docker run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/cloud-bulldozer/etcd-perfCopy to Clipboard Copied! Toggle word wrap Toggle overflow
The output reports whether the disk is fast enough to host etcd by comparing the 99th percentile of the fsync metric captured from the run to see if it is less than 10 ms. A few of the most important etcd metrics that might affected by I/O performance are as follow:
-
etcd_disk_wal_fsync_duration_seconds_bucketmetric reports the etcd’s WAL fsync duration -
etcd_disk_backend_commit_duration_seconds_bucketmetric reports the etcd backend commit latency duration -
etcd_server_leader_changes_seen_totalmetric reports the leader changes
Because etcd replicates the requests among all the members, its performance strongly depends on network input/output (I/O) latency. High network latencies result in etcd heartbeats taking longer than the election timeout, which results in leader elections that are disruptive to the cluster. A key metric to monitor on a deployed OpenShift Container Platform cluster is the 99th percentile of etcd network peer latency on each etcd cluster member. Use Prometheus to track the metric.
The histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[2m])) metric reports the round trip time for etcd to finish replicating the client requests between the members. Ensure that it is less than 50 ms.
2.3.2. Moving etcd to a different disk Copy linkLink copied to clipboard!
You can move etcd from a shared disk to a separate disk to prevent or resolve performance issues.
The Machine Config Operator (MCO) is responsible for mounting a secondary disk for OpenShift Container Platform 4.14 container storage.
This encoded script only supports device names for the following device types:
- SCSI or SATA
-
/dev/sd* - Virtual device
-
/dev/vd* - NVMe
-
/dev/nvme*[0-9]*n*
Limitations
-
When the new disk is attached to the cluster, the etcd database is part of the root mount. It is not part of the secondary disk or the intended disk when the primary node is recreated. As a result, the primary node will not create a separate
/var/lib/etcdmount.
Prerequisites
- You have a backup of your cluster’s etcd data.
-
You have installed the OpenShift CLI (
oc). -
You have access to the cluster with
cluster-adminprivileges. - Add additional disks before uploading the machine configuration.
-
The
MachineConfigPoolmust matchmetadata.labels[machineconfiguration.openshift.io/role]. This applies to a controller, worker, or a custom pool.
This procedure does not move parts of the root file system, such as /var/, to another disk or partition on an installed node.
This procedure is not supported when using control plane machine sets.
Procedure
Attach the new disk to the cluster and verify that the disk is detected in the node by running the
lsblkcommand in a debug shell:oc debug node/<node_name>
$ oc debug node/<node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow lsblk
# lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow Note the device name of the new disk reported by the
lsblkcommand.Create the following script and name it
etcd-find-secondary-device.sh:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Replace
<device_type_glob>with a shell glob for your block device type. For SCSI or SATA drives, use/dev/sd*; for virtual drives, use/dev/vd*; for NVMe drives, use/dev/nvme*[0-9]*n*.
Create a base64-encoded string from the
etcd-find-secondary-device.shscript and note its contents:base64 -w0 etcd-find-secondary-device.sh
$ base64 -w0 etcd-find-secondary-device.shCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
MachineConfigYAML file namedetcd-mc.ymlwith contents such as the following:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Replace
<encoded_etcd_find_secondary_device_script>with the encoded script contents that you noted.
Verification steps
Run the
grep /var/lib/etcd /proc/mountscommand in a debug shell for the node to ensure that the disk is mounted:oc debug node/<node_name>
$ oc debug node/<node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow grep -w "/var/lib/etcd" /proc/mounts
# grep -w "/var/lib/etcd" /proc/mountsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
/dev/sdb /var/lib/etcd xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/sdb /var/lib/etcd xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.3.3. Defragmenting etcd data Copy linkLink copied to clipboard!
For large and dense clusters, etcd can suffer from poor performance if the keyspace grows too large and exceeds the space quota. Periodically maintain and defragment etcd to free up space in the data store. Monitor Prometheus for etcd metrics and defragment it when required; otherwise, etcd can raise a cluster-wide alarm that puts the cluster into a maintenance mode that accepts only key reads and deletes.
Monitor these key metrics:
-
etcd_server_quota_backend_bytes, which is the current quota limit -
etcd_mvcc_db_total_size_in_use_in_bytes, which indicates the actual database usage after a history compaction -
etcd_mvcc_db_total_size_in_bytes, which shows the database size, including free space waiting for defragmentation
Defragment etcd data to reclaim disk space after events that cause disk fragmentation, such as etcd history compaction.
History compaction is performed automatically every five minutes and leaves gaps in the back-end database. This fragmented space is available for use by etcd, but is not available to the host file system. You must defragment etcd to make this space available to the host file system.
Defragmentation occurs automatically, but you can also trigger it manually.
Automatic defragmentation is good for most cases, because the etcd operator uses cluster information to determine the most efficient operation for the user.
2.3.3.1. Automatic defragmentation Copy linkLink copied to clipboard!
The etcd Operator automatically defragments disks. No manual intervention is needed.
Verify that the defragmentation process is successful by viewing one of these logs:
- etcd logs
- cluster-etcd-operator pod
- operator status error log
Automatic defragmentation can cause leader election failure in various OpenShift core components, such as the Kubernetes controller manager, which triggers a restart of the failing component. The restart is harmless and either triggers failover to the next running instance or the component resumes work again after the restart.
Example log output for successful defragmentation
etcd member has been defragmented: <member_name>, memberID: <member_id>
etcd member has been defragmented: <member_name>, memberID: <member_id>
Example log output for unsuccessful defragmentation
failed defrag on member: <member_name>, memberID: <member_id>: <error_message>
failed defrag on member: <member_name>, memberID: <member_id>: <error_message>
2.3.3.2. Manual defragmentation Copy linkLink copied to clipboard!
A Prometheus alert indicates when you need to use manual defragmentation. The alert is displayed in two cases:
- When etcd uses more than 50% of its available space for more than 10 minutes
- When etcd is actively using less than 50% of its total database size for more than 10 minutes
You can also determine whether defragmentation is needed by checking the etcd database size in MB that will be freed by defragmentation with the PromQL expression: (etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)/1024/1024
Defragmenting etcd is a blocking action. The etcd member will not respond until defragmentation is complete. For this reason, wait at least one minute between defragmentation actions on each of the pods to allow the cluster to recover.
Follow this procedure to defragment etcd data on each etcd member.
Prerequisites
-
You have access to the cluster as a user with the
cluster-adminrole.
Procedure
Determine which etcd member is the leader, because the leader should be defragmented last.
Get the list of etcd pods:
oc -n openshift-etcd get pods -l k8s-app=etcd -o wide
$ oc -n openshift-etcd get pods -l k8s-app=etcd -o wideCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
etcd-ip-10-0-159-225.example.redhat.com 3/3 Running 0 175m 10.0.159.225 ip-10-0-159-225.example.redhat.com <none> <none> etcd-ip-10-0-191-37.example.redhat.com 3/3 Running 0 173m 10.0.191.37 ip-10-0-191-37.example.redhat.com <none> <none> etcd-ip-10-0-199-170.example.redhat.com 3/3 Running 0 176m 10.0.199.170 ip-10-0-199-170.example.redhat.com <none> <none>
etcd-ip-10-0-159-225.example.redhat.com 3/3 Running 0 175m 10.0.159.225 ip-10-0-159-225.example.redhat.com <none> <none> etcd-ip-10-0-191-37.example.redhat.com 3/3 Running 0 173m 10.0.191.37 ip-10-0-191-37.example.redhat.com <none> <none> etcd-ip-10-0-199-170.example.redhat.com 3/3 Running 0 176m 10.0.199.170 ip-10-0-199-170.example.redhat.com <none> <none>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Choose a pod and run the following command to determine which etcd member is the leader:
oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com etcdctl endpoint status --cluster -w table
$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com etcdctl endpoint status --cluster -w tableCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Based on the
IS LEADERcolumn of this output, thehttps://10.0.199.170:2379endpoint is the leader. Matching this endpoint with the output of the previous step, the pod name of the leader isetcd-ip-10-0-199-170.example.redhat.com.
Defragment an etcd member.
Connect to the running etcd container, passing in the name of a pod that is not the leader:
oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com
$ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Unset the
ETCDCTL_ENDPOINTSenvironment variable:unset ETCDCTL_ENDPOINTS
sh-4.4# unset ETCDCTL_ENDPOINTSCopy to Clipboard Copied! Toggle word wrap Toggle overflow Defragment the etcd member:
etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defragCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Finished defragmenting etcd member[https://localhost:2379]
Finished defragmenting etcd member[https://localhost:2379]Copy to Clipboard Copied! Toggle word wrap Toggle overflow If a timeout error occurs, increase the value for
--command-timeoutuntil the command succeeds.Verify that the database size was reduced:
etcdctl endpoint status -w table --cluster
sh-4.4# etcdctl endpoint status -w table --clusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This example shows that the database size for this etcd member is now 41 MB as opposed to the starting size of 104 MB.
Repeat these steps to connect to each of the other etcd members and defragment them. Always defragment the leader last.
Wait at least one minute between defragmentation actions to allow the etcd pod to recover. Until the etcd pod recovers, the etcd member will not respond.
If any
NOSPACEalarms were triggered due to the space quota being exceeded, clear them.Check if there are any
NOSPACEalarms:etcdctl alarm list
sh-4.4# etcdctl alarm listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
memberID:12345678912345678912 alarm:NOSPACE
memberID:12345678912345678912 alarm:NOSPACECopy to Clipboard Copied! Toggle word wrap Toggle overflow Clear the alarms:
etcdctl alarm disarm
sh-4.4# etcdctl alarm disarmCopy to Clipboard Copied! Toggle word wrap Toggle overflow
2.3.4. Setting tuning parameters for etcd Copy linkLink copied to clipboard!
You can set the control plane hardware speed to "Standard", "Slower", or the default, which is "".
The default setting allows the system to decide which speed to use. This value enables upgrades from versions where this feature does not exist, as the system can select values from previous versions.
By selecting one of the other values, you are overriding the default. If you see many leader elections due to timeouts or missed heartbeats and your system is set to "" or "Standard", set the hardware speed to "Slower" to make the system more tolerant to the increased latency.
Tuning etcd latency tolerances is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
2.3.4.1. Changing hardware speed tolerance Copy linkLink copied to clipboard!
To change the hardware speed tolerance for etcd, complete the following steps.
Prerequisites
-
You have edited the cluster instance to enable
TechPreviewNoUpgradefeatures. For more information, see "Understanding feature gates" in the Additional resources.
Procedure
Check to see what the current value is by entering the following command:
oc describe etcd/cluster | grep "Control Plane Hardware Speed"
$ oc describe etcd/cluster | grep "Control Plane Hardware Speed"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Control Plane Hardware Speed: <VALUE>
Control Plane Hardware Speed: <VALUE>Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the output is empty, the field has not been set and should be considered as the default ("").
Change the value by entering the following command. Replace
<value>with one of the valid values:"","Standard", or"Slower":oc patch etcd/cluster --type=merge -p '{"spec": {"controlPlaneHardwareSpeed": "<value>"}}'$ oc patch etcd/cluster --type=merge -p '{"spec": {"controlPlaneHardwareSpeed": "<value>"}}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following table indicates the heartbeat interval and leader election timeout for each profile. These values are subject to change.
Expand Profile
ETCD_HEARTBEAT_INTERVAL
ETCD_LEADER_ELECTION_TIMEOUT
""Varies depending on platform
Varies depending on platform
Standard100
1000
Slower500
2500
Review the output:
Example output
etcd.operator.openshift.io/cluster patched
etcd.operator.openshift.io/cluster patchedCopy to Clipboard Copied! Toggle word wrap Toggle overflow If you enter any value besides the valid values, error output is displayed. For example, if you entered
"Faster"as the value, the output is as follows:Example output
The Etcd "cluster" is invalid: spec.controlPlaneHardwareSpeed: Unsupported value: "Faster": supported values: "", "Standard", "Slower"
The Etcd "cluster" is invalid: spec.controlPlaneHardwareSpeed: Unsupported value: "Faster": supported values: "", "Standard", "Slower"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the value was changed by entering the following command:
oc describe etcd/cluster | grep "Control Plane Hardware Speed"
$ oc describe etcd/cluster | grep "Control Plane Hardware Speed"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Control Plane Hardware Speed: ""
Control Plane Hardware Speed: ""Copy to Clipboard Copied! Toggle word wrap Toggle overflow Wait for etcd pods to roll out:
oc get pods -n openshift-etcd -w
$ oc get pods -n openshift-etcd -wCopy to Clipboard Copied! Toggle word wrap Toggle overflow The following output shows the expected entries for master-0. Before you continue, wait until all masters show a status of
4/4 Running.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enter the following command to review to the values:
oc describe -n openshift-etcd pod/<ETCD_PODNAME> | grep -e HEARTBEAT_INTERVAL -e ELECTION_TIMEOUT
$ oc describe -n openshift-etcd pod/<ETCD_PODNAME> | grep -e HEARTBEAT_INTERVAL -e ELECTION_TIMEOUTCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThese values might not have changed from the default.
Chapter 3. Reference design specifications Copy linkLink copied to clipboard!
3.1. Telco core and RAN DU reference design specifications Copy linkLink copied to clipboard!
The telco core reference design specification (RDS) describes OpenShift Container Platform 4.14 clusters running on commodity hardware that can support large scale telco applications including control plane and some centralized data plane functions.
The telco RAN RDS describes the configuration for clusters running on commodity hardware to host 5G workloads in the Radio Access Network (RAN).
3.1.1. Reference design specifications for telco 5G deployments Copy linkLink copied to clipboard!
Red Hat and certified partners offer deep technical expertise and support for networking and operational capabilities required to run telco applications on OpenShift Container Platform 4.14 clusters.
Red Hat’s telco partners require a well-integrated, well-tested, and stable environment that can be replicated at scale for enterprise 5G solutions. The telco core and RAN DU reference design specifications (RDS) outline the recommended solution architecture based on a specific version of OpenShift Container Platform. Each RDS describes a tested and validated platform configuration for telco core and RAN DU use models. The RDS ensures an optimal experience when running your applications by defining the set of critical KPIs for telco 5G core and RAN DU. Following the RDS minimizes high severity escalations and improves application stability.
5G use cases are evolving and your workloads are continually changing. Red Hat is committed to iterating over the telco core and RAN DU RDS to support evolving requirements based on customer and partner feedback.
3.1.2. Reference design scope Copy linkLink copied to clipboard!
The telco core and telco RAN reference design specifications (RDS) capture the recommended, tested, and supported configurations to get reliable and repeatable performance for clusters running the telco core and telco RAN profiles.
Each RDS includes the released features and supported configurations that are engineered and validated for clusters to run the individual profiles. The configurations provide a baseline OpenShift Container Platform installation that meets feature and KPI targets. Each RDS also describes expected variations for each individual configuration. Validation of each RDS includes many long duration and at-scale tests.
The validated reference configurations are updated for each major Y-stream release of OpenShift Container Platform. Z-stream patch releases are periodically re-tested against the reference configurations.
3.1.3. Deviations from the reference design Copy linkLink copied to clipboard!
Deviating from the validated telco core and telco RAN DU reference design specifications (RDS) can have significant impact beyond the specific component or feature that you change. Deviations require analysis and engineering in the context of the complete solution.
All deviations from the RDS should be analyzed and documented with clear action tracking information. Due diligence is expected from partners to understand how to bring deviations into line with the reference design. This might require partners to provide additional resources to engage with Red Hat to work towards enabling their use case to achieve a best in class outcome with the platform. This is critical for the supportability of the solution and ensuring alignment across Red Hat and with partners.
Deviation from the RDS can have some or all of the following consequences:
- It can take longer to resolve issues.
- There is a risk of missing project service-level agreements (SLAs), project deadlines, end provider performance requirements, and so on.
Unapproved deviations may require escalation at executive levels.
NoteRed Hat prioritizes the servicing of requests for deviations based on partner engagement priorities.
3.2. Telco RAN DU reference design specification Copy linkLink copied to clipboard!
3.2.1. Telco RAN DU 4.14 reference design overview Copy linkLink copied to clipboard!
The Telco RAN distributed unit (DU) 4.14 reference design configures an OpenShift Container Platform 4.14 cluster running on commodity hardware to host telco RAN DU workloads. It captures the recommended, tested, and supported configurations to get reliable and repeatable performance for a cluster running the telco RAN DU profile.
3.2.1.1. OpenShift Container Platform 4.14 features for telco RAN DU Copy linkLink copied to clipboard!
The following features that are included in OpenShift Container Platform 4.14 and are leveraged by the telco RAN DU reference design specification (RDS) have been added or updated.
| Feature | Description |
|---|---|
| GitOps ZTP independence from managed cluster version | You can now use GitOps ZTP to manage clusters that are running different versions of OpenShift Container Platform compared to the version that is running on the hub cluster. You can also have a mix of OpenShift Container Platform versions in the deployed fleet of clusters. |
| Using custom CRs alongside the reference CRs in GitOps ZTP |
You can now use custom CRs alongside the reference configuration CRs provided in the |
|
Using custom node labels in the |
You can now use the |
| Intel Westport Channel e810 NIC as PTP Grandmaster clock (Technology Preview) | You can use the Intel Westport Channel E810-XXVDA4T as a GNSS-sourced grandmaster clock. The NIC is automatically configured by the PTP Operator with the E810 hardware plugin. |
| PTP Operator hardware specific functionality plugin (Technology Preview) | A new E810 NIC hardware plugin is now available in the PTP Operator. You can use the E810 plugin to configure the NIC directly. |
| PTP events and metrics |
The |
| Precaching user-specified images | You can now precache application workload images before upgrading your applications on single-node OpenShift clusters with Topology Aware Lifecycle Manager. |
| Using OpenShift capabilities to further reduce the single-node OpenShift DU footprint |
Use cluster capabilities to enable or disable optional components before you install the cluster. In OpenShift Container Platform 4.14, the following optional capabilities are available: |
|
Set | single-node OpenShift clusters that run DU workloads require logging and log forwarding. |
3.2.1.2. Deployment architecture overview Copy linkLink copied to clipboard!
You deploy the telco RAN DU 4.14 reference configuration to managed clusters from a centrally managed RHACM hub cluster. The reference design specification (RDS) includes configuration of the managed clusters and the hub cluster components.
Figure 3.1. Telco RAN DU deployment architecture overview
3.2.2. Telco RAN DU use model overview Copy linkLink copied to clipboard!
Use the following information to plan telco RAN DU workloads, cluster resources, and hardware specifications for the hub cluster and managed single-node OpenShift clusters.
3.2.2.1. Telco RAN DU application workloads Copy linkLink copied to clipboard!
DU worker nodes must have 3rd Generation Xeon (Ice Lake) 2.20 GHz or better CPUs with firmware tuned for maximum performance.
5G RAN DU user applications and workloads should conform to the following best practices and application limits:
- Develop cloud-native network functions (CNFs) that conform to the latest version of the CNF best practices guide.
- Use SR-IOV for high performance networking.
Use exec probes sparingly and only when no other suitable options are available
-
Do not use exec probes if a CNF uses CPU pinning. Use other probe implementations, for example,
httpGetortcpSocket. - When you need to use exec probes, limit the exec probe frequency and quantity. The maximum number of exec probes must be kept below 10, and frequency must not be set to less than 10 seconds.
-
Do not use exec probes if a CNF uses CPU pinning. Use other probe implementations, for example,
Startup probes require minimal resources during steady-state operation. The limitation on exec probes applies primarily to liveness and readiness probes.
3.2.2.2. Telco RAN DU representative reference application workload characteristics Copy linkLink copied to clipboard!
The representative reference application workload has the following characteristics:
- Has a maximum of 15 pods and 30 containers for the vRAN application including its management and control functions
-
Uses a maximum of 2
ConfigMapand 4SecretCRs per pod - Uses a maximum of 10 exec probes with a frequency of not less than 10 seconds
Incremental application load on the
kube-apiserveris less than 10% of the cluster platform usageNoteYou can extract CPU load can from the platform metrics. For example:
query=avg_over_time(pod:container_cpu_usage:sum{namespace="openshift-kube-apiserver"}[30m])query=avg_over_time(pod:container_cpu_usage:sum{namespace="openshift-kube-apiserver"}[30m])Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Application logs are not collected by the platform log collector
- Aggregate traffic on the primary CNI is less than 1 MBps
3.2.2.3. Telco RAN DU worker node cluster resource utilization Copy linkLink copied to clipboard!
The maximum number of running pods in the system, inclusive of application workloads and OpenShift Container Platform pods, is 120.
- Resource utilization
OpenShift Container Platform resource utilization varies depending on many factors including application workload characteristics such as:
- Pod count
- Type and frequency of probes
- Messaging rates on primary CNI or secondary CNI with kernel networking
- API access rate
- Logging rates
- Storage IOPS
Cluster resource requirements are applicable under the following conditions:
- The cluster is running the described representative application workload.
- The cluster is managed with the constraints described in "Telco RAN DU worker node cluster resource utilization".
- Components noted as optional in the RAN DU use model configuration are not applied.
You will need to do additional analysis to determine the impact on resource utilization and ability to meet KPI targets for configurations outside the scope of the Telco RAN DU reference design. You might have to allocate additional resources in the cluster depending on your requirements.
3.2.2.4. Hub cluster management characteristics Copy linkLink copied to clipboard!
Red Hat Advanced Cluster Management (RHACM) is the recommended cluster management solution. Configure it to the following limits on the hub cluster:
- Configure a maximum of 5 RHACM policies with a compliant evaluation interval of at least 10 minutes.
- Use a maximum of 10 managed cluster templates in policies. Where possible, use hub-side templating.
Disable all RHACM add-ons except for the
policy-controllerandobservability-controlleradd-ons. SetObservabilityto the default configuration.ImportantConfiguring optional components or enabling additional features will result in additional resource usage and can reduce overall system performance.
For more information, see Reference design deployment components.
| Metric | Limit | Notes |
|---|---|---|
| CPU usage | Less than 4000 mc – 2 cores (4 hyperthreads) | Platform CPU is pinned to reserved cores, including both hyperthreads in each reserved core. The system is engineered to use 3 CPUs (3000mc) at steady-state to allow for periodic system tasks and spikes. |
| Memory used | Less than 16G |
3.2.2.5. Telco RAN DU RDS components Copy linkLink copied to clipboard!
The following sections describe the various OpenShift Container Platform components and configurations that you use to configure and deploy clusters to run telco RAN DU workloads.
Figure 3.2. Telco RAN DU reference design components
Ensure that components that are not included in the telco RAN DU profile do not affect the CPU resources allocated to workload applications.
Out of tree drivers are not supported.
3.2.3. Telco RAN DU 4.14 reference design components Copy linkLink copied to clipboard!
The following sections describe the various OpenShift Container Platform components and configurations that you use to configure and deploy clusters to run RAN DU workloads.
3.2.3.1. Host firmware tuning Copy linkLink copied to clipboard!
- New in this release
- No reference design updates in this release
- Description
Configure system level performance. See Configuring host firmware for low latency and high performance for recommended settings.
If Ironic inspection is enabled, the firmware setting values are available from the per-cluster
BareMetalHostCR on the hub cluster. You enable Ironic inspection with a label in thespec.clusters.nodesfield in theSiteConfigCR that you use to install the cluster. For example:nodes: - hostName: "example-node1.example.com" ironicInspect: "enabled"nodes: - hostName: "example-node1.example.com" ironicInspect: "enabled"Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe telco RAN DU reference
SiteConfigdoes not enable theironicInspectfield by default.- Limits and requirements
- Hyperthreading must be enabled
- Engineering considerations
Tune all settings for maximum performance
NoteYou can tune firmware selections for power savings at the expense of performance as required.
3.2.3.2. Node Tuning Operator Copy linkLink copied to clipboard!
- New in this release
- No reference design updates in this release
- Description
You tune the cluster performance by creating a performance profile. Settings that you configure with a performance profile include:
- Selecting the realtime or non-realtime kernel.
-
Allocating cores to a reserved or isolated
cpuset. OpenShift Container Platform processes allocated to the management workload partition are pinned to reserved set. - Enabling kubelet features (CPU manager, topology manager, and memory manager).
- Configuring huge pages.
- Setting additional kernel arguments.
- Setting per-core power tuning and max CPU frequency.
- Limits and requirements
The Node Tuning Operator uses the
PerformanceProfileCR to configure the cluster. You need to configure the following settings in the RAN DU profilePerformanceProfileCR:- Select reserved and isolated cores and ensure that you allocate at least 4 hyperthreads (equivalent to 2 cores) on Intel 3rd Generation Xeon (Ice Lake) 2.20 GHz CPUs or better with firmware tuned for maximum performance.
-
Set the reserved
cpusetto include both hyperthread siblings for each included core. Unreserved cores are available as allocatable CPU for scheduling workloads. Ensure that hyperthread siblings are not split across reserved and isolated cores. - Configure reserved and isolated CPUs to include all threads in all cores based on what you have set as reserved and isolated CPUs.
- Set core 0 of each NUMA node to be included in the reserved CPU set.
- Set the huge page size to 1G.
You should not add additional workloads to the management partition. Only those pods which are part of the OpenShift management platform should be annotated into the management partition.
- Engineering considerations
You should use the RT kernel to meet performance requirements.
NoteYou can use the non-RT kernel if required.
- The number of huge pages that you configure depends on the application workload requirements. Variation in this parameter is expected and allowed.
- Variation is expected in the configuration of reserved and isolated CPU sets based on selected hardware and additional components in use on the system. Variation must still meet the specified limits.
- Hardware without IRQ affinity support impacts isolated CPUs. To ensure that pods with guaranteed whole CPU QoS have full use of the allocated CPU, all hardware in the server must support IRQ affinity. For more information, see About support of IRQ affinity setting.
In OpenShift Container Platform 4.14, any PerformanceProfile CR configured on the cluster causes the Node Tuning Operator to automatically set all cluster nodes to use cgroup v1.
For more information about cgroups, see Configuring Linux cgroup.
3.2.3.3. PTP Operator Copy linkLink copied to clipboard!
- New in this release
- PTP grandmaster clock (T-GM) GPS timing with Intel E810-XXV-4T Westport Channel NIC – minimum firmware version 4.30 (Technology Preview)
- PTP events and metrics for grandmaster (T-GM) are new in OpenShift Container Platform 4.14 (Technology Preview)
- Description
Configure of PTP timing support for cluster nodes. The DU node can run in the following modes:
- As an ordinary clock synced to a T-GM or boundary clock (T-BC)
- As dual boundary clocks, one per NIC (high availability is not supported)
- As grandmaster clock with support for E810 Westport Channel NICs (Technology Preview)
- Optionally as a boundary clock for radio units (RUs)
Optional: subscribe applications to PTP events that happen on the node that the application is running. You subscribe the application to events via HTTP.
- Limits and requirements
- High availability is not supported with dual NIC configurations.
- Westport Channel NICs configured as T-GM do not support DPLL with the current ice driver version.
- GPS offsets are not reported. Use a default offset of less than or equal to 5.
- DPLL offsets are not reported. Use a default offset of less than or equal to 5.
- Engineering considerations
- Configurations are provided for ordinary clock, boundary clock, or grandmaster clock
-
PTP fast event notifications uses
ConfigMapCRs to store PTP event subscriptions - Use Intel E810-XXV-4T Westport Channel NICs for PTP grandmaster clocks with GPS timing, minimum firmware version 4.40
3.2.3.4. SR-IOV Operator Copy linkLink copied to clipboard!
- New in this release
- No reference design updates in this release
- Description
-
The SR-IOV Operator provisions and configures the SR-IOV CNI and device plugins. Both
netdevice(kernel VFs) andvfio(DPDK) devices are supported. - Engineering considerations
-
Customer variation on the configuration and number of
SriovNetworkandSriovNetworkNodePolicycustom resources (CRs) is expected. -
IOMMU kernel command-line settings are applied with a
MachineConfigCR at install time. This ensures that theSriovOperatorCR does not cause a reboot of the node when adding them.
-
Customer variation on the configuration and number of
3.2.3.5. Logging Copy linkLink copied to clipboard!
- New in this release
- Vector is now the recommended log collector.
- Description
- Use logging to collect logs from the far edge node for remote analysis.
- Engineering considerations
- Handling logs beyond the infrastructure and audit logs, for example, from the application workload requires additional CPU and network bandwidth based on additional logging rate.
As of OpenShift Container Platform 4.14, vector is the reference log collector.
NoteUse of fluentd in the RAN use model is deprecated.
3.2.3.6. SRIOV-FEC Operator Copy linkLink copied to clipboard!
- New in this release
- No reference design updates in this release
- Description
- SRIOV-FEC Operator is an optional 3rd party Certified Operator supporting FEC accelerator hardware.
- Limits and requirements
Starting with FEC Operator v2.7.0:
-
SecureBootis supported -
The
vfiodriver for thePFrequires the usage ofvfio-tokenthat is injected into Pods. TheVFtoken can be passed to DPDK by using the EAL parameter--vfio-vf-token.
-
- Engineering considerations
-
The SRIOV-FEC Operator uses CPU cores from the
isolatedCPU set. - You can validate FEC readiness as part of the pre-checks for application deployment, for example, by extending the validation policy.
-
The SRIOV-FEC Operator uses CPU cores from the
3.2.3.7. Local Storage Operator Copy linkLink copied to clipboard!
- New in this release
- No reference design updates in this release
- Description
-
You can create persistent volumes that can be used as
PVCresources by applications with the Local Storage Operator. The number and type ofPVresources that you create depends on your requirements. - Engineering considerations
-
Create backing storage for
PVCRs before creating thePV. This can be a partition, a local volume, LVM volume, or full disk. Refer to the device listing in
LocalVolumeCRs by the hardware path used to access each device to ensure correct allocation of disks and partitions. Logical names (for example,/dev/sda) are not guaranteed to be consistent across node reboots.For more information, see the RHEL 9 documentation on device identifiers.
-
Create backing storage for
3.2.3.8. LVMS Operator Copy linkLink copied to clipboard!
- New in this release
- No reference design updates in this release
- New in this release
-
Simplified LVMS
deviceSelectorlogic -
LVM Storage with
ext4andPVresources
-
Simplified LVMS
LVMS Operator is an optional component.
- Description
The LVMS Operator provides dynamic provisioning of block and file storage. The LVMS Operator creates logical volumes from local devices that can be used as
PVCresources by applications. Volume expansion and snapshots are also possible.The following example configuration creates a
vg1volume group that leverages all available disks on the node except the installation disk:StorageLVMCluster.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Limits and requirements
- In single-node OpenShift clusters, persistent storage must be provided by either LVMS or Local Storage, not both.
- Engineering considerations
- The LVMS Operator is not the reference storage solution for the DU use case. If you require LVMS Operator for application workloads, the resource use is accounted for against the application cores.
- Ensure that sufficient disks or partitions are available for storage requirements.
3.2.3.9. Workload partitioning Copy linkLink copied to clipboard!
- New in this release
- No reference design updates in this release
- Description
Workload partitioning pins OpenShift platform and Day 2 Operator pods that are part of the DU profile to the reserved
cpusetand removes the reserved CPU from node accounting. This leaves all unreserved CPU cores available for user workloads.The method of enabling and configuring workload partitioning changed in OpenShift Container Platform 4.14.
- 4.14 and later
Configure partitions by setting installation parameters:
cpuPartitioningMode: AllNodes
cpuPartitioningMode: AllNodesCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Configure management partition cores with the reserved CPU set in the
PerformanceProfileCR
- 4.13 and earlier
-
Configure partitions with extra
MachineConfigurationCRs applied at install-time
-
Configure partitions with extra
- Limits and requirements
-
NamespaceandPodCRs must be annotated to allow the pod to be applied to the management partition - Pods with CPU limits cannot be allocated to the partition. This is because mutation can change the pod QoS.
- For more information about the minimum number of CPUs that can be allocated to the management partition, see Node Tuning Operator.
-
- Engineering considerations
- Workload Partitioning pins all management pods to reserved cores. A sufficient number of cores must be allocated to the reserved set to account for operating system, management pods, and expected spikes in CPU use that occur when the workload starts, the node reboots, or other system events happen.
3.2.3.10. Cluster tuning Copy linkLink copied to clipboard!
- New in this release
You can remove the Image Registry Operator by using the cluster capabilities feature.
NoteYou configure cluster capabilities by using the
spec.clusters.installConfigOverridesfield in theSiteConfigCR that you use to install the cluster.
- Description
The cluster capabilities feature now includes a
MachineAPIcomponent which, when excluded, disables the following Operators and their resources in the cluster:-
openshift/cluster-autoscaler-operator -
openshift/cluster-control-plane-machine-set-operator -
openshift/machine-api-operator
-
- Limits and requirements
- Cluster capabilities are not available for installer-provisioned installation methods.
You must apply all platform tuning configurations. The following table lists the required platform tuning configurations:
Expand Table 3.3. Cluster capabilities configurations Feature Description Remove optional cluster capabilities
Reduce the OpenShift Container Platform footprint by disabling optional cluster Operators on single-node OpenShift clusters only.
- Remove all optional Operators except the Marketplace and Node Tuning Operators.
Configure cluster monitoring
Configure the monitoring stack for reduced footprint by doing the following:
-
Disable the local
alertmanagerandtelemetercomponents. -
If you use RHACM observability, the CR must be augmented with appropriate
additionalAlertManagerConfigsCRs to forward alerts to the hub cluster. Reduce the
Prometheusretention period to 24h.NoteThe RHACM hub cluster aggregates managed cluster metrics.
Disable networking diagnostics
Disable networking diagnostics for single-node OpenShift because they are not required.
Configure a single Operator Hub catalog source
Configure the cluster to use a single catalog source that contains only the Operators required for a RAN DU deployment. Each catalog source increases the CPU use on the cluster. Using a single
CatalogSourcefits within the platform CPU budget.
3.2.3.11. Machine configuration Copy linkLink copied to clipboard!
- New in this release
-
Set
rcu_normalafter node recovery
-
Set
- Limits and requirements
The CRI-O wipe disable
MachineConfigassumes that images on disk are static other than during scheduled maintenance in defined maintenance windows. To ensure the images are static, do not set the podimagePullPolicyfield toAlways.Expand Table 3.4. Machine configuration options Feature Description Container runtime
Sets the container runtime to
crunfor all node roles.kubelet config and container mount hiding
Reduces the frequency of kubelet housekeeping and eviction monitoring to reduce CPU usage. Create a container mount namespace, visible to kubelet and CRI-O, to reduce system mount scanning resource usage.
SCTP
Optional configuration (enabled by default) Enables SCTP. SCTP is required by RAN applications but disabled by default in RHCOS.
kdump
Optional configuration (enabled by default) Enables kdump to capture debug information when a kernel panic occurs.
CRI-O wipe disable
Disables automatic wiping of the CRI-O image cache after unclean shutdown.
SR-IOV-related kernel arguments
Includes additional SR-IOV related arguments in the kernel command line.
RCU Normal systemd service
Sets
rcu_normalafter the system is fully started.One-shot time sync
Runs a one-time system time synchronization job for control plane or worker nodes.
3.2.3.12. Reference design deployment components Copy linkLink copied to clipboard!
The following sections describe the various OpenShift Container Platform components and configurations that you use to configure the hub cluster with Red Hat Advanced Cluster Management (RHACM).
3.2.3.12.1. Red Hat Advanced Cluster Management (RHACM) Copy linkLink copied to clipboard!
- New in this release
- Additional node labels can be configured during installation.
- Description
RHACM provides Multi Cluster Engine (MCE) installation and ongoing lifecycle management functionality for deployed clusters. You declaratively specify configurations and upgrades with
PolicyCRs and apply the policies to clusters with the RHACM policy controller as managed by Topology Aware Lifecycle Manager.- GitOps Zero Touch Provisioning (ZTP) uses the MCE feature of RHACM
- Configuration, upgrades, and cluster status are managed with the RHACM policy controller
- Limits and requirements
-
A single hub cluster supports up to 3500 deployed single-node OpenShift clusters with 5
PolicyCRs bound to each cluster.
-
A single hub cluster supports up to 3500 deployed single-node OpenShift clusters with 5
- Engineering considerations
-
Cluster specific configuration: managed clusters typically have some number of configuration values that are specific to the individual cluster. These configurations should be managed using RHACM policy hub-side templating with values pulled from
ConfigMapCRs based on the cluster name. - To save CPU resources on managed clusters, policies that apply static configurations should be unbound from managed clusters after GitOps ZTP installation of the cluster. For more information, see Release a persistent volume.
-
Cluster specific configuration: managed clusters typically have some number of configuration values that are specific to the individual cluster. These configurations should be managed using RHACM policy hub-side templating with values pulled from
3.2.3.12.2. Topology Aware Lifecycle Manager (TALM) Copy linkLink copied to clipboard!
- New in this release
- Added support for pre-caching additional user-specified images
- Description
- Managed updates
TALM is an Operator that runs only on the hub cluster for managing how changes (including cluster and Operator upgrades, configuration, and so on) are rolled out to the network. TALM does the following:
-
Progressively applies updates to fleets of clusters in user-configurable batches by using
PolicyCRs. -
Adds
ztp-donelabels or other user configurable labels on a per-cluster basis
-
Progressively applies updates to fleets of clusters in user-configurable batches by using
- Precaching for single-node OpenShift clusters
TALM supports optional precaching of OpenShift Container Platform, OLM Operator, and additional user images to single-node OpenShift clusters before initiating an upgrade.
A new
PreCachingConfigcustom resource is available for specifying optional pre-caching configurations. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Backup and restore for single-node OpenShift
- TALM supports taking a snapshot of the cluster operating system and configuration to a dedicated partition on a local disk. A restore script is provided that returns the cluster to the backed up state.
- Limits and requirements
- TALM supports concurrent cluster deployment in batches of 400
- Precaching and backup features are for single-node OpenShift clusters only.
- Engineering considerations
-
The
PreCachingConfigCR is optional and does not need to be created if you just wants to precache platform related (OpenShift and OLM Operator) images. ThePreCachingConfigCR must be applied before referencing it in theClusterGroupUpgradeCR. - Create a recovery partition during installation if you opt to use the TALM backup and restore feature.
-
The
3.2.3.12.3. GitOps and GitOps ZTP plugins Copy linkLink copied to clipboard!
- New in this release
- GA support for inclusion of user-provided CRs in Git for GitOps ZTP deployments
- GitOps ZTP independence from the deployed cluster version
- Description
GitOps and GitOps ZTP plugins provide a GitOps-based infrastructure for managing cluster deployment and configuration. Cluster definitions and configurations are maintained as a declarative state in Git. ZTP plugins provide support for generating installation CRs from the
SiteConfigCR and automatic wrapping of configuration CRs in policies based onPolicyGenTemplateCRs.You can deploy and manage multiple versions of OpenShift Container Platform on managed clusters with the baseline reference configuration CRs in a
/source-crssubdirectory provided that subdirectory also contains thekustomization.yamlfile. You add user-provided CRs to this subdirectory that you use with the predefined CRs that are specified in thePolicyGenTemplateCRs. This allows you to tailor your configurations to suit your specific requirements and provides GitOps ZTP version independence between managed clusters and the hub cluster.For more information, see the following:
- Limits
-
300
SiteConfigCRs per ArgoCD application. You can use multiple applications to achieve the maximum number of clusters supported by a single hub cluster. -
Content in the
/source-crsfolder in Git overrides content provided in the GitOps ZTP plugin container. Git takes precedence in the search path. Add the
/source-crsfolder in the same directory as thekustomization.yamlfile, which includes thePolicyGenTemplateas a generator.NoteAlternative locations for the
/source-crsdirectory are not supported in this context.
-
300
- Engineering considerations
-
To avoid confusion or unintentional overwriting of files when updating content, use unique and distinguishable names for user-provided CRs in the
/source-crsfolder and extra manifests in Git. -
The
SiteConfigCR allows multiple extra-manifest paths. When files with the same name are found in multiple directory paths, the last file found takes precedence. This allows the full set of version specific Day 0 manifests (extra-manifests) to be placed in Git and referenced from theSiteConfig. With this feature, you can deploy multiple OpenShift Container Platform versions to managed clusters simultaneously. -
The
extraManifestPathfield of theSiteConfigCR is deprecated from OpenShift Container Platform 4.15 and later. Use the newextraManifests.searchPathsfield instead.
-
To avoid confusion or unintentional overwriting of files when updating content, use unique and distinguishable names for user-provided CRs in the
3.2.3.12.4. Agent-based installer Copy linkLink copied to clipboard!
- New in this release
- No reference design updates in this release
- Description
Agent-based installer (ABI) provides installation capabilities without centralized infrastructure. The installation program creates an ISO image that you mount to the server. When the server boots it installs OpenShift Container Platform and supplied extra manifests.
NoteYou can also use ABI to install OpenShift Container Platform clusters without a hub cluster. An image registry is still required when you use ABI in this manner.
Agent-based installer (ABI) is an optional component.
- Limits and requirements
- You can supply a limited set of additional manifests at installation time.
-
You must include
MachineConfigurationCRs that are required by the RAN DU use case.
- Engineering considerations
- ABI provides a baseline OpenShift Container Platform installation.
- You install Day 2 Operators and the remainder of the RAN DU use case configurations after installation.
3.2.3.13. Additional components Copy linkLink copied to clipboard!
3.2.3.13.1. Bare Metal Event Relay Copy linkLink copied to clipboard!
The Bare Metal Event Relay is an optional Operator that runs exclusively on the managed spoke cluster. It relays Redfish hardware events to cluster applications.
The Bare Metal Event Relay is not included in the RAN DU use model reference configuration and is an optional feature. If you want to use the Bare Metal Event Relay, assign additional CPU resources from the application CPU budget.
3.2.4. Telco RAN distributed unit (DU) reference configuration CRs Copy linkLink copied to clipboard!
Use the following custom resources (CRs) to configure and deploy OpenShift Container Platform clusters with the telco RAN DU profile. Some of the CRs are optional depending on your requirements. CR fields you can change are annotated in the CR with YAML comments.
You can extract the complete set of RAN DU CRs from the ztp-site-generate container image. See Preparing the GitOps ZTP site configuration repository for more information.
3.2.4.1. Day 2 Operators reference CRs Copy linkLink copied to clipboard!
| Component | Reference CR | Optional | New in this release |
|---|---|---|---|
| Cluster logging | No | No | |
| Cluster logging | No | No | |
| Cluster logging | No | No | |
| Cluster logging | No | No | |
| Cluster logging | No | No | |
| Local Storage Operator | Yes | No | |
| Local Storage Operator | Yes | No | |
| Local Storage Operator | Yes | No | |
| Local Storage Operator | Yes | No | |
| Local Storage Operator | Yes | No | |
| Node Tuning Operator | No | No | |
| Node Tuning Operator | No | No | |
| PTP fast event notifications | Yes | No | |
| PTP Operator | No | No | |
| PTP Operator | No | Yes | |
| PTP Operator | No | No | |
| PTP Operator | No | No | |
| PTP Operator | No | No | |
| PTP Operator | No | No | |
| PTP Operator | No | No | |
| SR-IOV FEC Operator | Yes | No | |
| SR-IOV FEC Operator | Yes | No | |
| SR-IOV FEC Operator | Yes | No | |
| SR-IOV FEC Operator | Yes | No | |
| SR-IOV Operator | No | No | |
| SR-IOV Operator | No | No | |
| SR-IOV Operator | No | No | |
| SR-IOV Operator | No | No | |
| SR-IOV Operator | No | No | |
| SR-IOV Operator | No | No |
3.2.4.2. Cluster tuning reference CRs Copy linkLink copied to clipboard!
| Component | Reference CR | Optional | New in this release |
|---|---|---|---|
| Cluster capabilities | No | No | |
| Disabling network diagnostics | No | No | |
| Monitoring configuration | No | No | |
| OperatorHub | No | No | |
| OperatorHub | No | No | |
| OperatorHub | No | No | |
| OperatorHub | No | No |
3.2.4.3. Machine configuration reference CRs Copy linkLink copied to clipboard!
| Component | Reference CR | Optional | New in this release |
|---|---|---|---|
| Container runtime (crun) | No | No | |
| Container runtime (crun) | No | No | |
| Disabling CRI-O wipe | No | No | |
| Disabling CRI-O wipe | No | No | |
| Enabling kdump | No | Yes | |
| Enabling kdump | No | Yes | |
| Enabling kdump | No | No | |
| Enabling kdump | No | No | |
| Kubelet configuration and container mount hiding | No | No | |
| Kubelet configuration and container mount hiding | No | No | |
| One-shot time sync | No | Yes | |
| One-shot time sync | No | Yes | |
| SCTP | No | No | |
| SCTP | No | No | |
| Set RCU Normal | No | No | |
| Set RCU Normal | No | No | |
| SR-IOV related kernel arguments | No | Yes | |
| SR-IOV related kernel arguments | No | No |
3.2.4.4. YAML reference Copy linkLink copied to clipboard!
The following is a complete reference for all the custom resources (CRs) that make up the telco RAN DU 4.14 reference configuration.
3.2.4.4.1. Day 2 Operators reference YAML Copy linkLink copied to clipboard!
ClusterLogForwarder.yaml
ClusterLogging.yaml
ClusterLogNS.yaml
ClusterLogOperGroup.yaml
ClusterLogSubscription.yaml
StorageClass.yaml
StorageLV.yaml
StorageNS.yaml
StorageOperGroup.yaml
StorageSubscription.yaml
PerformanceProfile.yaml
TunedPerformancePatch.yaml
PtpOperatorConfigForEvent.yaml
PtpConfigBoundary.yaml
PtpConfigGmWpc.yaml
PtpConfigSlave.yaml
PtpOperatorConfig.yaml
PtpSubscription.yaml
PtpSubscriptionNS.yaml
PtpSubscriptionOperGroup.yaml
AcceleratorsNS.yaml
apiVersion: v1
kind: Namespace
metadata:
name: vran-acceleration-operators
annotations: {}
apiVersion: v1
kind: Namespace
metadata:
name: vran-acceleration-operators
annotations: {}
AcceleratorsOperGroup.yaml
AcceleratorsSubscription.yaml
SriovFecClusterConfig.yaml
SriovNetwork.yaml
SriovNetworkNodePolicy.yaml
SriovOperatorConfig.yaml
SriovSubscription.yaml
SriovSubscriptionNS.yaml
SriovSubscriptionOperGroup.yaml
3.2.4.4.2. Cluster tuning reference YAML Copy linkLink copied to clipboard!
example-sno.yaml
DisableSnoNetworkDiag.yaml
ReduceMonitoringFootprint.yaml
DisableOLMPprof.yaml
DefaultCatsrc.yaml
DisconnectedICSP.yaml
OperatorHub.yaml
3.2.4.4.3. Machine configuration reference YAML Copy linkLink copied to clipboard!
enable-crun-master.yaml
enable-crun-worker.yaml
99-crio-disable-wipe-master.yaml
99-crio-disable-wipe-worker.yaml
05-kdump-config-master.yaml
05-kdump-config-worker.yaml
06-kdump-master.yaml
06-kdump-worker.yaml
01-container-mount-ns-and-kubelet-conf-master.yaml
01-container-mount-ns-and-kubelet-conf-worker.yaml
99-sync-time-once-master.yaml
99-sync-time-once-worker.yaml
03-sctp-machine-config-master.yaml
03-sctp-machine-config-worker.yaml
08-set-rcu-normal-master.yaml
08-set-rcu-normal-worker.yaml
3.2.5. Telco RAN DU reference configuration software specifications Copy linkLink copied to clipboard!
The following information describes the telco RAN DU reference design specification (RDS) validated software versions.
3.2.5.1. Telco RAN DU 4.14 validated software components Copy linkLink copied to clipboard!
The Red Hat telco RAN DU 4.14 solution has been validated using the following Red Hat software products for OpenShift Container Platform managed clusters and hub clusters.
| Component | Software version |
|---|---|
| Managed cluster version | 4.14 |
| Cluster Logging Operator | 5.8 |
| Local Storage Operator | 4.14 |
| PTP Operator | 4.14 |
| SRIOV Operator | 4.14 |
| Node Tuning Operator | 4.14 |
| Logging Operator | 4.14 |
| SRIOV-FEC Operator | 2.7 |
| Component | Software version |
|---|---|
| Hub cluster version | 4.14 |
| GitOps ZTP plugin | 4.14 |
| Red Hat Advanced Cluster Management (RHACM) | 2.9, 2.10 |
| Red Hat OpenShift GitOps | 1.16 |
| Topology Aware Lifecycle Manager (TALM) | 4.14 |
3.3. Telco core reference design specification Copy linkLink copied to clipboard!
3.3.1. Telco core 4.14 reference design overview Copy linkLink copied to clipboard!
The telco core reference design specification (RDS) configures a OpenShift Container Platform cluster running on commodity hardware to host telco core workloads.
3.3.1.1. OpenShift Container Platform 4.14 features for telco core Copy linkLink copied to clipboard!
The following features that are included in OpenShift Container Platform 4.14 and are leveraged by the telco core reference design specification (RDS) have been added or updated.
| Feature | Description |
|---|---|
| Support for running rootless Data Plane Development Kit (DPDK) workloads with kernel access by using the TAP CNI plugin | DPDK applications that inject traffic into the kernel can run in non-privileged pods with the help of the TAP CNI plugin. |
| Dynamic use of non-reserved CPUs for OVS |
With this release, the Open vSwitch (OVS) networking stack can dynamically use non-reserved CPUs. The dynamic use of non-reserved CPUs occurs by default in performance-tuned clusters with a CPU manager policy set to |
| Enabling more control over the C-states for each pod |
The |
| Exclude SR-IOV network topology for NUMA-aware scheduling | You can exclude advertising Non-Uniform Memory Access (NUMA) nodes for the SR-IOV network to the Topology Manager. By not advertising NUMA nodes for the SR-IOV network, you can permit more flexible SR-IOV network deployments during NUMA-aware pod scheduling. For example, in some scenarios, you want flexibility for how a pod is deployed. By not providing a NUMA node hint to the Topology Manager for the pod’s SR-IOV network resource, the Topology Manager can deploy the SR-IOV network resource and the pod CPU and memory resources to different NUMA nodes. In previous OpenShift Container Platform releases, the Topology Manager attempted to place all resources on the same NUMA node. |
| Egress service resource to manage egress traffic for pods behind a load balancer (Technology Preview) |
With this update, you can use an
You can use the
|
3.3.2. Telco core 4.14 use model overview Copy linkLink copied to clipboard!
The Telco core reference design specification (RDS) describes a platform that supports large-scale telco applications including control plane functions such as signaling and aggregation. It also includes some centralized data plane functions, for example, user plane functions (UPF). These functions generally require scalability, complex networking support, resilient software-defined storage, and support performance requirements that are less stringent and constrained than far-edge deployments like RAN.
Telco core use model architecture
The networking prerequisites for telco core functions are diverse and encompass an array of networking attributes and performance benchmarks. IPv6 is mandatory, with dual-stack configurations being prevalent. Certain functions demand maximum throughput and transaction rates, necessitating user plane networking support such as DPDK. Other functions adhere to conventional cloud-native patterns and can use solutions such as OVN-K, kernel networking, and load balancing.
Telco core clusters are configured as standard three control plane clusters with worker nodes configured with the stock non real-time (RT) kernel. To support workloads with varying networking and performance requirements, worker nodes are segmented using MachineConfigPool CRs. For example, this is done to separate non-user data plane nodes from high-throughput nodes. To support the required telco operational features, the clusters have a standard set of Operator Lifecycle Manager (OLM) Day 2 Operators installed.
3.3.2.1. Common baseline model Copy linkLink copied to clipboard!
The following configurations and use model description are applicable to all telco core use cases.
- Cluster
The cluster conforms to these requirements:
- High-availability (3+ supervisor nodes) control plane
- Non-schedulable supervisor nodes
- Storage
- Core use cases require persistent storage as provided by external OpenShift Data Foundation. For more information, see the "Storage" subsection in "Reference core design components".
- Networking
Telco core clusters networking conforms to these requirements:
- Dual stack IPv4/IPv6
- Fully disconnected: Clusters do not have access to public networking at any point in their lifecycle.
- Multiple networks: Segmented networking provides isolation between OAM, signaling, and storage traffic.
- Cluster network type: OVN-Kubernetes is required for IPv6 support.
Core clusters have multiple layers of networking supported by underlying RHCOS, SR-IOV Operator, Load Balancer, and other components detailed in the following "Networking" section. At a high level these layers include:
Cluster networking: The cluster network configuration is defined and applied through the installation configuration. Updates to the configuration can be done at day-2 through the NMState Operator. Initial configuration can be used to establish:
- Host interface configuration
- A/A Bonding (Link Aggregation Control Protocol (LACP))
Secondary or additional networks: OpenShift CNI is configured through the Network
additionalNetworksor NetworkAttachmentDefinition CRs.- MACVLAN
- Application Workload: User plane networking is running in cloud-native network functions (CNFs).
- Service Mesh
- Use of Service Mesh by telco CNFs is very common. It is expected that all core clusters will include a Service Mesh implementation. Service Mesh implementation and configuration is outside the scope of this specification.
3.3.2.1.1. Engineering Considerations common use model Copy linkLink copied to clipboard!
The following engineering considerations are relevant for the common use model.
- Worker nodes
- Worker nodes run on Intel 3rd Generation Xeon (IceLake) processors or newer. Alternatively, if using Skylake or earlier processors, the mitigations for silicon security vulnerabilities such as Spectre must be disabled; failure to do so may result in a significant 40 percent decrease in transaction performance.
-
IRQ Balancing is enabled on worker nodes. The
PerformanceProfilesetsgloballyDisableIrqLoadBalancing: false. Guaranteed QoS Pods are annotated to ensure isolation as described in "CPU partitioning and performance tuning" subsection in "Reference core design components" section.
- All nodes
- Hyper-Threading is enabled on all nodes
-
CPU architecture is
x86_64only - Nodes are running the stock (non-RT) kernel
- Nodes are not configured for workload partitioning
The balance of node configuration between power management and maximum performance varies between MachineConfigPools in the cluster. This configuration is consistent for all nodes within a MachineConfigPool.
- CPU partitioning
-
CPU partitioning is configured using the PerformanceProfile and applied on a per
MachineConfigPoolbasis. See the "CPU partitioning and performance tuning" subsection in "Reference core design components".
3.3.2.1.2. Application workloads Copy linkLink copied to clipboard!
Application workloads running on core clusters might include a mix of high-performance networking CNFs and traditional best-effort or burstable pod workloads.
Guaranteed QoS scheduling is available to pods that require exclusive or dedicated use of CPUs due to performance or security requirements. Typically pods hosting high-performance and low-latency-sensitive Cloud Native Functions (CNFs) utilizing user plane networking with DPDK necessitate the exclusive utilization of entire CPUs. This is accomplished through node tuning and guaranteed Quality of Service (QoS) scheduling. For pods that require exclusive use of CPUs, be aware of the potential implications of hyperthreaded systems and configure them to request multiples of 2 CPUs when the entire core (2 hyperthreads) must be allocated to the pod.
Pods running network functions that do not require the high throughput and low latency networking are typically scheduled with best-effort or burstable QoS and do not require dedicated or isolated CPU cores.
- Description of limits
- CNF applications should conform to the latest version of the Red Hat Best Practices for Kubernetes guide.
For a mix of best-effort and burstable QoS pods.
-
Guaranteed QoS pods might be used but require correct configuration of reserved and isolated CPUs in the
PerformanceProfile. - Guaranteed QoS Pods must include annotations for fully isolating CPUs.
- Best effort and burstable pods are not guaranteed exclusive use of a CPU. Workloads might be preempted by other workloads, operating system daemons, or kernel tasks.
-
Guaranteed QoS pods might be used but require correct configuration of reserved and isolated CPUs in the
Exec probes should be avoided unless there is no viable alternative.
- Do not use exec probes if a CNF is using CPU pinning.
-
Other probe implementations, for example
httpGet/tcpSocket, should be used.
NoteStartup probes require minimal resources during steady-state operation. The limitation on exec probes applies primarily to liveness and readiness probes.
- Signaling workload
- Signaling workloads typically use SCTP, REST, gRPC, or similar TCP or UDP protocols.
- The transactions per second (TPS) is in the order of hundreds of thousands using secondary CNI (multus) configured as MACVLAN or SR-IOV.
- Signaling workloads run in pods with either guaranteed or burstable QoS.
3.3.3. Telco core reference design components Copy linkLink copied to clipboard!
The following sections describe the various OpenShift Container Platform components and configurations that you use to configure and deploy clusters to run telco core workloads.
3.3.3.1. CPU partitioning and performance tuning Copy linkLink copied to clipboard!
- New in this release
- Open vSwitch (OVS) is removed from CPU partitioning. OVS manages its cpuset dynamically to automatically adapt to network traffic needs. Users no longer need to reserve additional CPUs for handling high network throughput on the primary container network interface (CNI). There is no impact on the configuration needed to benefit from this change.
- Description
CPU partitioning allows for the separation of sensitive workloads from generic purposes, auxiliary processes, interrupts, and driver work queues to achieve improved performance and latency. The CPUs allocated to those auxiliary processes are referred to as
reservedin the following sections. In hyperthreaded systems, a CPU is one hyperthread.For more information, see Restricting CPUs for infra and application containers.
Configure system level performance. For recommended settings, see Configuring host firmware for low latency and high performance.
- Limits and requirements
The operating system needs a certain amount of CPU to perform all the support tasks including kernel networking.
- A system with just user plane networking applications (DPDK) needs at least one Core (2 hyperthreads when enabled) reserved for the operating system and the infrastructure components.
- A system with Hyper-Threading enabled must always put all core sibling threads to the same pool of CPUs.
- The set of reserved and isolated cores must include all CPU cores.
- Core 0 of each NUMA node must be included in the reserved CPU set.
Isolated cores might be impacted by interrupts. The following annotations must be attached to the pod if guaranteed QoS pods require full use of the CPU:
cpu-load-balancing.crio.io: "disable" cpu-quota.crio.io: "disable" irq-load-balancing.crio.io: "disable"
cpu-load-balancing.crio.io: "disable" cpu-quota.crio.io: "disable" irq-load-balancing.crio.io: "disable"Copy to Clipboard Copied! Toggle word wrap Toggle overflow When per-pod power management is enabled with
PerformanceProfile.workloadHints.perPodPowerManagementthe following annotations must also be attached to the pod if guaranteed QoS pods require full use of the CPU:cpu-c-states.crio.io: "disable" cpu-freq-governor.crio.io: "performance"
cpu-c-states.crio.io: "disable" cpu-freq-governor.crio.io: "performance"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Engineering considerations
-
The minimum reserved capacity (
systemReserved) required can be found by following the guidance in "Which amount of CPU and memory are recommended to reserve for the system in OCP 4 nodes?" - The actual required reserved CPU capacity depends on the cluster configuration and workload attributes.
- This reserved CPU value must be rounded up to a full core (2 hyper-thread) alignment.
- Changes to the CPU partitioning will drain and reboot the nodes in the MCP.
- The reserved CPUs reduce the pod density, as the reserved CPUs are removed from the allocatable capacity of the OpenShift node.
- The real-time workload hint should be enabled if the workload is real-time capable.
- Hardware without Interrupt Request (IRQ) affinity support will impact isolated CPUs. To ensure that pods with guaranteed CPU QoS have full use of allocated CPU, all hardware in the server must support IRQ affinity.
-
The minimum reserved capacity (
3.3.3.2. Service Mesh Copy linkLink copied to clipboard!
- Description
- Telco core CNFs typically require a service mesh implementation. The specific features and performance required are dependent on the application. The selection of service mesh implementation and configuration is outside the scope of this documentation. The impact of service mesh on cluster resource utilization and performance, including additional latency introduced into pod networking, must be accounted for in the overall solution engineering.
3.3.3.3. Networking Copy linkLink copied to clipboard!
OpenShift Container Platform networking is an ecosystem of features, plugins, and advanced networking capabilities that extend Kubernetes networking with the advanced networking-related features that your cluster needs to manage its network traffic for one or multiple hybrid clusters.
3.3.3.3.1. Cluster Network Operator (CNO) Copy linkLink copied to clipboard!
- New in this release
- Not applicable.
- Description
The CNO deploys and manages the cluster network components including the default OVN-Kubernetes network plugin during OpenShift Container Platform cluster installation. It allows configuring primary interface MTU settings, OVN gateway modes to use node routing tables for pod egress, and additional secondary networks such as MACVLAN.
In support of network traffic segregation, multiple network interfaces are configured through the CNO. Traffic steering to these interfaces is configured through static routes applied by using the NMState Operator. To ensure that pod traffic is properly routed, OVN-K is configured with the
routingViaHostoption enabled. This setting uses the kernel routing table and the applied static routes rather than OVN for pod egress traffic.The Whereabouts CNI plugin is used to provide dynamic IPv4 and IPv6 addressing for additional pod network interfaces without the use of a DHCP server.
- Limits and requirements
- OVN-Kubernetes is required for IPv6 support.
- Large MTU cluster support requires connected network equipment to be set to the same or larger value.
- Engineering considerations
-
Pod egress traffic is handled by kernel routing table with the
routingViaHostoption. Appropriate static routes must be configured in the host.
-
Pod egress traffic is handled by kernel routing table with the
3.3.3.3.2. Load Balancer Copy linkLink copied to clipboard!
- New in this release
- Not applicable.
- Description
MetalLB is a load-balancer implementation for bare metal Kubernetes clusters using standard routing protocols. It enables a Kubernetes service to get an external IP address which is also added to the host network for the cluster.
Some use cases might require features not available in MetalLB, for example stateful load balancing. Where necessary, you can use an external third party load balancer. Selection and configuration of an external load balancer is outside the scope of this specification. When an external third party load balancer is used, the integration effort must include enough analysis to ensure all performance and resource utilization requirements are met.
- Limits and requirements
- Stateful load balancing is not supported by MetalLB. An alternate load balancer implementation must be used if this is a requirement for workload CNFs.
- The networking infrastructure must ensure that the external IP address is routable from clients to the host network for the cluster.
- Engineering considerations
- MetalLB is used in BGP mode only for core use case models.
-
For core use models, MetalLB is supported with only the OVN-Kubernetes network provider used in local gateway mode. See
routingViaHostin the "Cluster Network Operator" section. - BGP configuration in MetalLB varies depending on the requirements of the network and peers.
- Address pools can be configured as needed, allowing variation in addresses, aggregation length, auto assignment, and other relevant parameters.
- The values of parameters in the Bi-Directional Forwarding Detection (BFD) profile should remain close to the defaults. Shorter values might lead to false negatives and impact performance.
3.3.3.3.3. SR-IOV Copy linkLink copied to clipboard!
- New in this release
- Not applicable
- Description
- SR-IOV enables physical network interfaces (PFs) to be divided into multiple virtual functions (VFs). VFs can then be assigned to multiple pods to achieve higher throughput performance while keeping the pods isolated. The SR-IOV Network Operator provisions and manages SR-IOV CNI, network device plugin, and other components of the SR-IOV stack.
- Limits and requirements
- The network interface controllers supported are listed in OCP supported SR-IOV devices
- SR-IOV and IOMMU enablement in BIOS: The SR-IOV Network Operator automatically enables IOMMU on the kernel command line.
- SR-IOV VFs do not receive link state updates from PF. If link down detection is needed, it must be done at the protocol level.
- Engineering considerations
-
SR-IOV interfaces in
vfiomode are typically used to enable additional secondary networks for applications that require high throughput or low latency.
-
SR-IOV interfaces in
3.3.3.3.4. NMState Operator Copy linkLink copied to clipboard!
- New in this release
- Not applicable
- Description
- The NMState Operator provides a Kubernetes API for performing network configurations across the cluster’s nodes. It enables network interface configurations, static IPs and DNS, VLANs, trunks, bonding, static routes, MTU, and enabling promiscuous mode on the secondary interfaces. The cluster nodes periodically report on the state of each node’s network interfaces to the API server.
- Limits and requirements
- Not applicable
- Engineering considerations
-
The initial networking configuration is applied using
NMStateConfigcontent in the installation CRs. The NMState Operator is used only when needed for network updates. -
When SR-IOV virtual functions are used for host networking, the NMState Operator using
NodeNetworkConfigurationPolicyis used to configure those VF interfaces, for example, VLANs and the MTU.
-
The initial networking configuration is applied using
3.3.3.4. Logging Copy linkLink copied to clipboard!
- New in this release
- Not applicable
- Description
- The ClusterLogging Operator enables collection and shipping of logs off the node for remote archival and analysis. The reference configuration ships audit and infrastructure logs to a remote archive by using Kafka.
- Limits and requirements
- Not applicable
- Engineering considerations
- The impact of cluster CPU use is based on the number or size of logs generated and the amount of log filtering configured.
- The reference configuration does not include shipping of application logs. Inclusion of application logs in the configuration requires evaluation of the application logging rate and sufficient additional CPU resources allocated to the reserved set.
3.3.3.5. Power Management Copy linkLink copied to clipboard!
- New in this release
- You can specify a maximum latency that is C-state for a low latency pod when using per-pod power management. Previously, C-states could only be disabled completely on a per pod basis.
- Description
- The Performance Profile can be used to configure a cluster in a high power, low power or mixed (per-pod power management) mode. The choice of power mode depends on the characteristics of the workloads running on the cluster particularly how sensitive they are to latency.
- Limits and requirements
- Power configuration relies on appropriate BIOS configuration, for example, enabling C-states and P-states. Configuration varies between hardware vendors.
- Engineering considerations
-
Latency: To ensure that latency-sensitive workloads meet their requirements, you will need either a high-power configuration or a per-pod power management configuration. Per-pod power management is only available for
GuaranteedQoS Pods with dedicated pinned CPUs.
-
Latency: To ensure that latency-sensitive workloads meet their requirements, you will need either a high-power configuration or a per-pod power management configuration. Per-pod power management is only available for
3.3.3.6. Storage Copy linkLink copied to clipboard!
- Overview
Cloud native storage services can be provided by multiple solutions including OpenShift Data Foundation from Red Hat or third parties.
OpenShift Data Foundation is a Ceph based software-defined storage solution for containers. It provides block storage, file system storage, and on-premises object storage, which can be dynamically provisioned for both persistent and non-persistent data requirements. Telco core applications require persistent storage.
NoteAll storage data may not be encrypted in flight. To reduce risk, isolate the storage network from other cluster networks. The storage network must not be reachable, or routable, from other cluster networks. Only nodes directly attached to the storage network should be allowed to gain access to it.
3.3.3.6.1. OpenShift Data Foundation Copy linkLink copied to clipboard!
- New in this release
- Not applicable
- Description
- Red Hat OpenShift Data Foundation is a software-defined storage service for containers. For Telco core clusters, storage support is provided by OpenShift Data Foundation storage services running externally to the application workload cluster. OpenShift Data Foundation supports separation of storage traffic using secondary CNI networks.
- Limits and requirements
- In an IPv4/IPv6 dual-stack networking environment, OpenShift Data Foundation uses IPv4 addressing. For more information, see Support OpenShift dual stack with ODF using IPv4.
- Engineering considerations
- OpenShift Data Foundation network traffic should be isolated from other traffic on a dedicated network, for example, by using VLAN isolation.
3.3.3.6.2. Other Storage Copy linkLink copied to clipboard!
Other storage solutions can be used to provide persistent storage for core clusters. The configuration and integration of these solutions is outside the scope of the telco core RDS. Integration of the storage solution into the core cluster must include correct sizing and performance analysis to ensure the storage meets overall performance and resource utilization requirements.
3.3.3.7. Monitoring Copy linkLink copied to clipboard!
- New in this release
- Not applicable
- Description
The Cluster Monitoring Operator (CMO) is included by default on all OpenShift clusters and provides monitoring (metrics, dashboards, and alerting) for the platform components and optionally user projects as well.
Configuration of the monitoring operator allows for customization, including:
- Default retention period
- Custom alert rules
The default handling of pod CPU and memory metrics is based on upstream Kubernetes
cAdvisorand makes a tradeoff that prefers handling of stale data over metric accuracy. This leads to spiky data that will create false triggers of alerts over user-specified thresholds. OpenShift supports an opt-in dedicated service monitor feature creating an additional set of pod CPU and memory metrics that do not suffer from the spiky behavior. For additional information, see this solution guide.In addition to default configuration, the following metrics are expected to be configured for telco core clusters:
- Pod CPU and memory metrics and alerts for user workloads
- Limits and requirements
- Monitoring configuration must enable the dedicated service monitor feature for accurate representation of pod metrics
- Engineering considerations
- The Prometheus retention period is specified by the user. The value used is a tradeoff between operational requirements for maintaining historical data on the cluster against CPU and storage resources. Longer retention periods increase the need for storage and require additional CPU to manage the indexing of data.
3.3.3.8. Scheduling Copy linkLink copied to clipboard!
- New in this release
- NUMA-aware scheduling with the NUMA Resources Operator is now generally available in OpenShift Container Platform 4.14.
-
With this release, you can exclude advertising the Non-Uniform Memory Access (NUMA) node for the SR-IOV network to the Topology Manager. By not advertising the NUMA node for the SR-IOV network, you can permit more flexible SR-IOV network deployments during NUMA-aware pod scheduling. To exclude advertising the NUMA node for the SR-IOV network resource to the Topology Manager, set the value
excludeTopologytotruein theSriovNetworkNodePolicyCR. For more information, see Exclude the SR-IOV network topology for NUMA-aware scheduling.
- Description
- The scheduler is a cluster-wide component responsible for selecting the right node for a given workload. It is a core part of the platform and does not require any specific configuration in the common deployment scenarios. However, there are few specific use cases described in the following section.
- Limits and requirements
The default scheduler does not understand the NUMA locality of workloads. It only knows about the sum of all free resources on a worker node. This might cause workloads to be rejected when scheduled to a node with Topology manager policy set to
single-numa-nodeorrestricted.- For example, consider a pod requesting 6 CPUs and being scheduled to an empty node that has 4 CPUs per NUMA node. The total allocatable capacity of the node is 8 CPUs and the scheduler will place the pod there. The node local admission will fail, however, as there are only 4 CPUs available in each of the NUMA nodes.
-
All clusters with multi-NUMA nodes are required to use the NUMA Resources Operator. The
machineConfigPoolSelectorof the NUMA Resources Operator must select all nodes where NUMA aligned scheduling is needed.
- All machine config pools must have consistent hardware configuration for example all nodes are expected to have the same NUMA zone count.
- Engineering considerations
- Pods might require annotations for correct scheduling and isolation. For more information on annotations, see the "CPU Partitioning and performance tuning" section.
3.3.3.9. Installation Copy linkLink copied to clipboard!
- New in this release, Description
Telco core clusters can be installed using the Agent Based Installer (ABI). This method allows users to install OpenShift Container Platform on bare metal servers without requiring additional servers or VMs for managing the installation. The ABI installer can be run on any system for example a laptop to generate an ISO installation image. This ISO is used as the installation media for the cluster supervisor nodes. Progress can be monitored using the ABI tool from any system with network connectivity to the supervisor node’s API interfaces.
- Installation from declarative CRs
- Does not require additional servers to support installation
- Supports install in disconnected environment
- Limits and requirements
- Disconnected installation requires a reachable registry with all required content mirrored.
- Engineering considerations
- Networking configuration should be applied as NMState configuration during installation in preference to day-2 configuration by using the NMState Operator.
3.3.3.10. Security Copy linkLink copied to clipboard!
- New in this release
- DPDK applications that need to inject traffic to the kernel can run in non-privileged pods with the help of the TAP CNI plugin. Furthermore, in this 4.14 release that ability to create a MAC-VLAN, IP-VLAN, and VLAN subinterface based on a master interface in a container namespace is generally available.
- Description
Telco operators are security conscious and require clusters to be hardened against multiple attack vectors. Within OpenShift Container Platform, there is no single component or feature responsible for securing a cluster. This section provides details of security-oriented features and configuration for the use models covered in this specification.
- SecurityContextConstraints: All workload pods should be run with restricted-v2 or restricted SCC.
-
Seccomp: All pods should be run with the
RuntimeDefault(or stronger) seccomp profile. - Rootless DPDK pods: Many user-plane networking (DPDK) CNFs require pods to run with root privileges. With this feature, a conformant DPDK pod can be run without requiring root privileges.
- Storage: The storage network should be isolated and non-routable to other cluster networks. See the "Storage" section for additional details.
- Limits and requirements
Rootless DPDK pods requires the following additional configuration steps:
-
Configure the TAP plugin with the
container_tSELinux context. -
Enable the
container_use_devicesSELinux boolean on the hosts.
-
Configure the TAP plugin with the
- Engineering considerations
-
For rootless DPDK pod support, the SELinux boolean
container_use_devicesmust be enabled on the host for the TAP device to be created. This introduces a security risk that is acceptable for short to mid-term use. Other solutions will be explored.
-
For rootless DPDK pod support, the SELinux boolean
3.3.3.11. Scalability Copy linkLink copied to clipboard!
- New in this release
- Not applicable
- Description
Clusters will scale to the sizing listed in the limits and requirements section.
Scaling of workloads is described in the use model section.
- Limits and requirements
- Cluster scales to at least 120 nodes
- Engineering considerations
- Not applicable
3.3.3.12. Additional configuration Copy linkLink copied to clipboard!
3.3.3.12.1. Disconnected environment Copy linkLink copied to clipboard!
- Description
Telco core clusters are expected to be installed in networks without direct access to the internet. All container images needed to install, configure, and operator the cluster must be available in a disconnected registry. This includes OpenShift Container Platform images, day-2 Operator Lifecycle Manager (OLM) Operator images, and application workload images. The use of a disconnected environment provides multiple benefits, for example:
- Limiting access to the cluster for security
- Curated content: The registry is populated based on curated and approved updates for the clusters
- Limits and requirements
- A unique name is required for all custom CatalogSources. Do not reuse the default catalog names.
- A valid time source must be configured as part of cluster installation.
- Engineering considerations
- Not applicable
3.3.3.12.2. Kernel Copy linkLink copied to clipboard!
- New in this release
- Not applicable
- Description
The user can install the following kernel modules by using
MachineConfigto provide extended kernel functionality to CNFs:- sctp
- ip_gre
- ip6_tables
- ip6t_REJECT
- ip6table_filter
- ip6table_mangle
- iptable_filter
- iptable_mangle
- iptable_nat
- xt_multiport
- xt_owner
- xt_REDIRECT
- xt_statistic
- xt_TCPMSS
- Limits and requirements
- Use of functionality available through these kernel modules must be analyzed by the user to determine the impact on CPU load, system performance, and ability to sustain KPI.
NoteOut of tree drivers are not supported.
- Engineering considerations
- Not applicable
3.3.4. Telco core 4.14 reference configuration CRs Copy linkLink copied to clipboard!
Use the following custom resources (CRs) to configure and deploy OpenShift Container Platform clusters with the telco core profile. Use the CRs to form the common baseline used in all the specific use models unless otherwise indicated.
3.3.4.1. Resource Tuning reference CRs Copy linkLink copied to clipboard!
| Component | Reference CR | Optional | New in this release |
|---|---|---|---|
| System reserved capacity | Yes | No | |
| System reserved capacity | Yes | No |
3.3.4.2. Storage reference CRs Copy linkLink copied to clipboard!
| Component | Reference CR | Optional | New in this release |
|---|---|---|---|
| External ODF configuration | No | Yes | |
| External ODF configuration | No | No | |
| External ODF configuration | No | No | |
| External ODF configuration | No | No |
3.3.4.3. Networking reference CRs Copy linkLink copied to clipboard!
| Component | Reference CR | Optional | New in this release |
|---|---|---|---|
| Baseline | No | No | |
| Baseline | Yes | Yes | |
| Load balancer | No | No | |
| Load balancer | No | No | |
| Load balancer | No | No | |
| Load balancer | No | No | |
| Load balancer | No | No | |
| Load balancer | Yes | No | |
| Load balancer | Yes | No | |
| Load balancer | No | No | |
| Multus - Tap CNI for rootless DPDK pod | No | No | |
| SR-IOV Network Operator | Yes | No | |
| SR-IOV Network Operator | No | Yes | |
| SR-IOV Network Operator | No | Yes | |
| SR-IOV Network Operator | No | No | |
| SR-IOV Network Operator | No | No | |
| SR-IOV Network Operator | No | No |
3.3.4.4. Scheduling reference CRs Copy linkLink copied to clipboard!
| Component | Reference CR | Optional | New in this release |
|---|---|---|---|
| NUMA-aware scheduler | No | No | |
| NUMA-aware scheduler | No | No | |
| NUMA-aware scheduler | No | No | |
| NUMA-aware scheduler | No | No | |
| NUMA-aware scheduler | No | No |
3.3.4.5. Other reference CRs Copy linkLink copied to clipboard!
| Component | Reference CR | Optional | New in this release |
|---|---|---|---|
| Additional kernel modules | Yes | No | |
| Additional kernel modules | Yes | No | |
| Additional kernel modules | Yes | No | |
| Cluster logging | No | No | |
| Cluster logging | No | No | |
| Cluster logging | No | No | |
| Cluster logging | No | No | |
| Cluster logging | No | Yes | |
| Disconnected configuration | No | No | |
| Disconnected configuration | No | No | |
| Disconnected configuration | No | No | |
| Monitoring and observability | Yes | No | |
| Power management | No | No |
3.3.4.6. YAML reference Copy linkLink copied to clipboard!
3.3.4.6.1. Resource tuning reference YAML Copy linkLink copied to clipboard!
control-plane-system-reserved.yaml
pid-limits-cr.yaml
3.3.4.6.2. Storage reference YAML Copy linkLink copied to clipboard!
01-rook-ceph-external-cluster-details.secret.yaml
02-ocs-external-storagecluster.yaml
odfNS.yaml
odfOperGroup.yaml
3.3.4.6.3. Networking reference YAML Copy linkLink copied to clipboard!
Network.yaml
networkAttachmentDefinition.yaml
addr-pool.yaml
bfd-profile.yaml
bgp-advr.yaml
bgp-peer.yaml
metallb.yaml
metallbNS.yaml
metallbOperGroup.yaml
metallbSubscription.yaml
mc_rootless_pods_selinux.yaml
sriovNetwork.yaml
sriovNetworkNodePolicy.yaml
SriovOperatorConfig.yaml
SriovSubscription.yaml
SriovSubscriptionNS.yaml
SriovSubscriptionOperGroup.yaml
3.3.4.6.4. Scheduling reference YAML Copy linkLink copied to clipboard!
nrop.yaml
NROPSubscription.yaml
NROPSubscriptionNS.yaml
NROPSubscriptionOperGroup.yaml
sched.yaml
3.3.4.6.5. Other reference YAML Copy linkLink copied to clipboard!
control-plane-load-kernel-modules.yaml
sctp_module_mc.yaml
worker-load-kernel-modules.yaml
ClusterLogForwarder.yaml
ClusterLogging.yaml
ClusterLogNS.yaml
ClusterLogOperGroup.yaml
ClusterLogSubscription.yaml
catalog-source.yaml
icsp.yaml
operator-hub.yaml
monitoring-config-cm.yaml
PerformanceProfile.yaml
Chapter 4. Planning your environment according to object maximums Copy linkLink copied to clipboard!
Consider the following tested object maximums when you plan your OpenShift Container Platform cluster.
These guidelines are based on the largest possible cluster. For smaller clusters, the maximums are lower. There are many factors that influence the stated thresholds, including the etcd version or storage data format.
In most cases, exceeding these numbers results in lower overall performance. It does not necessarily mean that the cluster will fail.
Clusters that experience rapid change, such as those with many starting and stopping pods, can have a lower practical maximum size than documented.
4.1. OpenShift Container Platform tested cluster maximums for major releases Copy linkLink copied to clipboard!
Red Hat does not provide direct guidance on sizing your OpenShift Container Platform cluster. This is because determining whether your cluster is within the supported bounds of OpenShift Container Platform requires careful consideration of all the multidimensional factors that limit the cluster scale.
OpenShift Container Platform supports tested cluster maximums rather than absolute cluster maximums. Not every combination of OpenShift Container Platform version, control plane workload, and network plugin are tested, so the following table does not represent an absolute expectation of scale for all deployments. It might not be possible to scale to a maximum on all dimensions simultaneously. The table contains tested maximums for specific workload and deployment configurations, and serves as a scale guide as to what can be expected with similar deployments.
| Maximum type | 4.x tested maximum |
|---|---|
| Number of nodes | 2,000 [1] |
| Number of pods [2] | 150,000 |
| Number of pods per node | 2,500 [3][4] |
| Number of pods per core | There is no default value. |
| Number of namespaces [5] | 10,000 |
| Number of builds | 10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy |
| Number of pods per namespace [6] | 25,000 |
| Number of routes and back ends per Ingress Controller | 2,000 per router |
| Number of secrets | 80,000 |
| Number of config maps | 90,000 |
| Number of services [7] | 10,000 |
| Number of services per namespace | 5,000 |
| Number of back-ends per service | 5,000 |
| Number of deployments per namespace [6] | 2,000 |
| Number of build configs | 12,000 |
| Number of custom resource definitions (CRD) | 1,024 [8] |
- Pause pods were deployed to stress the control plane components of OpenShift Container Platform at 2000 node scale. The ability to scale to similar numbers will vary depending upon specific deployment and workload parameters.
- The pod count displayed here is the number of test pods. The actual number of pods depends on the application’s memory, CPU, and storage requirements.
-
This was tested on a cluster with 31 servers: 3 control planes, 2 infrastructure nodes, and 26 worker nodes. If you need 2,500 user pods, you need both a
hostPrefixof20, which allocates a network large enough for each node to contain more than 2000 pods, and a custom kubelet config withmaxPodsset to2500. For more information, see Running 2500 pods per node on OCP 4.13. -
The maximum tested pods per node is 2,500 for clusters using the
OVNKubernetesnetwork plugin. The maximum tested pods per node for theOpenShiftSDNnetwork plugin is 500 pods. - When there are a large number of active projects, etcd might suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentation, is highly recommended to free etcd storage.
- There are several control loops in the system that must iterate over all objects in a given namespace as a reaction to some changes in state. Having a large number of objects of a given type in a single namespace can make those loops expensive and slow down processing given state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy the application requirements.
-
Each service port and each service back-end has a corresponding entry in
iptables. The number of back-ends of a given service impact the size of theEndpointsobjects, which impacts the size of data that is being sent all over the system. -
Tested on a cluster with 29 servers: 3 control planes, 2 infrastructure nodes, and 24 worker nodes. The cluster had 500 namespaces. OpenShift Container Platform has a limit of 1,024 total custom resource definitions (CRD), including those installed by OpenShift Container Platform, products integrating with OpenShift Container Platform and user-created CRDs. If there are more than 1,024 CRDs created, then there is a possibility that
occommand requests might be throttled.
4.1.1. Example scenario Copy linkLink copied to clipboard!
As an example, 500 worker nodes (m5.2xl) were tested, and are supported, using OpenShift Container Platform 4.14, the OVN-Kubernetes network plugin, and the following workload objects:
- 200 namespaces, in addition to the defaults
- 60 pods per node; 30 server and 30 client pods (30k total)
- 57 image streams/ns (11.4k total)
- 15 services/ns backed by the server pods (3k total)
- 15 routes/ns backed by the previous services (3k total)
- 20 secrets/ns (4k total)
- 10 config maps/ns (2k total)
- 6 network policies/ns, including deny-all, allow-from ingress and intra-namespace rules
- 57 builds/ns
The following factors are known to affect cluster workload scaling, positively or negatively, and should be factored into the scale numbers when planning a deployment. For additional information and guidance, contact your sales representative or Red Hat support.
- Number of pods per node
- Number of containers per pod
- Type of probes used (for example, liveness/readiness, exec/http)
- Number of network policies
- Number of projects, or namespaces
- Number of image streams per project
- Number of builds per project
- Number of services/endpoints and type
- Number of routes
- Number of shards
- Number of secrets
- Number of config maps
Rate of API calls, or the cluster “churn”, which is an estimation of how quickly things change in the cluster configuration.
-
Prometheus query for pod creation requests per second over 5 minute windows:
sum(irate(apiserver_request_count{resource="pods",verb="POST"}[5m])) -
Prometheus query for all API requests per second over 5 minute windows:
sum(irate(apiserver_request_count{}[5m]))
-
Prometheus query for pod creation requests per second over 5 minute windows:
- Cluster node resource consumption of CPU
- Cluster node resource consumption of memory
4.2. OpenShift Container Platform environment and configuration on which the cluster maximums are tested Copy linkLink copied to clipboard!
4.2.1. AWS cloud platform Copy linkLink copied to clipboard!
| Node | Flavor | vCPU | RAM(GiB) | Disk type | Disk size(GiB)/IOS | Count | Region |
|---|---|---|---|---|---|---|---|
| Control plane/etcd [1] | r5.4xlarge | 16 | 128 | gp3 | 220 | 3 | us-west-2 |
| Infra [2] | m5.12xlarge | 48 | 192 | gp3 | 100 | 3 | us-west-2 |
| Workload [3] | m5.4xlarge | 16 | 64 | gp3 | 500 [4] | 1 | us-west-2 |
| Compute | m5.2xlarge | 8 | 32 | gp3 | 100 | 3/25/250/500 [5] | us-west-2 |
- gp3 disks with a baseline performance of 3000 IOPS and 125 MiB per second are used for control plane/etcd nodes because etcd is latency sensitive. gp3 volumes do not use burst performance.
- Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.
- Workload node is dedicated to run performance and scalability workload generators.
- Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.
- Cluster is scaled in iterations and performance and scalability tests are executed at the specified node counts.
4.2.2. IBM Power platform Copy linkLink copied to clipboard!
| Node | vCPU | RAM(GiB) | Disk type | Disk size(GiB)/IOS | Count |
|---|---|---|---|---|---|
| Control plane/etcd [1] | 16 | 32 | io1 | 120 / 10 IOPS per GiB | 3 |
| Infra [2] | 16 | 64 | gp2 | 120 | 2 |
| Workload [3] | 16 | 256 | gp2 | 120 [4] | 1 |
| Compute | 16 | 64 | gp2 | 120 | 2 to 100 [5] |
- io1 disks with 120 / 10 IOPS per GiB are used for control plane/etcd nodes as etcd is I/O intensive and latency sensitive.
- Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.
- Workload node is dedicated to run performance and scalability workload generators.
- Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.
- Cluster is scaled in iterations.
4.2.3. IBM Z platform Copy linkLink copied to clipboard!
| Node | vCPU [4] | RAM(GiB)[5] | Disk type | Disk size(GiB)/IOS | Count |
|---|---|---|---|---|---|
| Control plane/etcd [1,2] | 8 | 32 | ds8k | 300 / LCU 1 | 3 |
| Compute [1,3] | 8 | 32 | ds8k | 150 / LCU 2 | 4 nodes (scaled to 100/250/500 pods per node) |
- Nodes are distributed between two logical control units (LCUs) to optimize disk I/O load of the control plane/etcd nodes as etcd is I/O intensive and latency sensitive. Etcd I/O demand should not interfere with other workloads.
- Four compute nodes are used for the tests running several iterations with 100/250/500 pods at the same time. First, idling pods were used to evaluate if pods can be instanced. Next, a network and CPU demanding client/server workload were used to evaluate the stability of the system under stress. Client and server pods were pairwise deployed and each pair was spread over two compute nodes.
- No separate workload node was used. The workload simulates a microservice workload between two compute nodes.
- Physical number of processors used is six Integrated Facilities for Linux (IFLs).
- Total physical memory used is 512 GiB.
4.3. How to plan your environment according to tested cluster maximums Copy linkLink copied to clipboard!
Oversubscribing the physical resources on a node affects resource guarantees the Kubernetes scheduler makes during pod placement. Learn what measures you can take to avoid memory swapping.
Some of the tested maximums are stretched only in a single dimension. They will vary when many objects are running on the cluster.
The numbers noted in this documentation are based on Red Hat’s test methodology, setup, configuration, and tunings. These numbers can vary based on your own individual setup and environments.
While planning your environment, determine how many pods are expected to fit per node:
required pods per cluster / pods per node = total number of nodes needed
required pods per cluster / pods per node = total number of nodes needed
The default maximum number of pods per node is 250. However, the number of pods that fit on a node is dependent on the application itself. Consider the application’s memory, CPU, and storage requirements, as described in "How to plan your environment according to application requirements".
Example scenario
If you want to scope your cluster for 2200 pods per cluster, you would need at least five nodes, assuming that there are 500 maximum pods per node:
2200 / 500 = 4.4
2200 / 500 = 4.4
If you increase the number of nodes to 20, then the pod distribution changes to 110 pods per node:
2200 / 20 = 110
2200 / 20 = 110
Where:
required pods per cluster / total number of nodes = expected pods per node
required pods per cluster / total number of nodes = expected pods per node
OpenShift Container Platform comes with several system pods, such as SDN, DNS, Operators, and others, which run across every worker node by default. Therefore, the result of the above formula can vary.
4.4. How to plan your environment according to application requirements Copy linkLink copied to clipboard!
Consider an example application environment:
| Pod type | Pod quantity | Max memory | CPU cores | Persistent storage |
|---|---|---|---|---|
| apache | 100 | 500 MB | 0.5 | 1 GB |
| node.js | 200 | 1 GB | 1 | 1 GB |
| postgresql | 100 | 1 GB | 2 | 10 GB |
| JBoss EAP | 100 | 1 GB | 1 | 1 GB |
Extrapolated requirements: 550 CPU cores, 450GB RAM, and 1.4TB storage.
Instance size for nodes can be modulated up or down, depending on your preference. Nodes are often resource overcommitted. In this deployment scenario, you can choose to run additional smaller nodes or fewer larger nodes to provide the same amount of resources. Factors such as operational agility and cost-per-instance should be considered.
| Node type | Quantity | CPUs | RAM (GB) |
|---|---|---|---|
| Nodes (option 1) | 100 | 4 | 16 |
| Nodes (option 2) | 50 | 8 | 32 |
| Nodes (option 3) | 25 | 16 | 64 |
Some applications lend themselves well to overcommitted environments, and some do not. Most Java applications and applications that use huge pages are examples of applications that would not allow for overcommitment. That memory can not be used for other applications. In the example above, the environment would be roughly 30 percent overcommitted, a common ratio.
The application pods can access a service either by using environment variables or DNS. If using environment variables, for each active service the variables are injected by the kubelet when a pod is run on a node. A cluster-aware DNS server watches the Kubernetes API for new services and creates a set of DNS records for each one. If DNS is enabled throughout your cluster, then all pods should automatically be able to resolve services by their DNS name. Service discovery using DNS can be used in case you must go beyond 5000 services. When using environment variables for service discovery, the argument list exceeds the allowed length after 5000 services in a namespace, then the pods and deployments will start failing. Disable the service links in the deployment’s service specification file to overcome this:
The number of application pods that can run in a namespace is dependent on the number of services and the length of the service name when the environment variables are used for service discovery. ARG_MAX on the system defines the maximum argument length for a new process and it is set to 2097152 bytes (2 MiB) by default. The Kubelet injects environment variables in to each pod scheduled to run in the namespace including:
-
<SERVICE_NAME>_SERVICE_HOST=<IP> -
<SERVICE_NAME>_SERVICE_PORT=<PORT> -
<SERVICE_NAME>_PORT=tcp://<IP>:<PORT> -
<SERVICE_NAME>_PORT_<PORT>_TCP=tcp://<IP>:<PORT> -
<SERVICE_NAME>_PORT_<PORT>_TCP_PROTO=tcp -
<SERVICE_NAME>_PORT_<PORT>_TCP_PORT=<PORT> -
<SERVICE_NAME>_PORT_<PORT>_TCP_ADDR=<ADDR>
The pods in the namespace will start to fail if the argument length exceeds the allowed value and the number of characters in a service name impacts it. For example, in a namespace with 5000 services, the limit on the service name is 33 characters, which enables you to run 5000 pods in the namespace.
Chapter 5. Using quotas and limit ranges Copy linkLink copied to clipboard!
A resource quota, defined by a ResourceQuota object, provides constraints that limit aggregate resource consumption per project. It can limit the quantity of objects that can be created in a project by type, as well as the total amount of compute resources and storage that may be consumed by resources in that project.
Using quotas and limit ranges, cluster administrators can set constraints to limit the number of objects or amount of compute resources that are used in your project. This helps cluster administrators better manage and allocate resources across all projects, and ensure that no projects are using more than is appropriate for the cluster size.
Quotas are set by cluster administrators and are scoped to a given project. OpenShift Container Platform project owners can change quotas for their project, but not limit ranges. OpenShift Container Platform users cannot modify quotas or limit ranges.
The following sections help you understand how to check on your quota and limit range settings, what sorts of things they can constrain, and how you can request or limit compute resources in your own pods and containers.
5.1. Resources managed by quota Copy linkLink copied to clipboard!
A resource quota, defined by a ResourceQuota object, provides constraints that limit aggregate resource consumption per project. It can limit the quantity of objects that can be created in a project by type, as well as the total amount of compute resources and storage that may be consumed by resources in that project.
The following describes the set of compute resources and object types that may be managed by a quota.
A pod is in a terminal state if status.phase is Failed or Succeeded.
| Resource Name | Description |
|---|---|
|
|
The sum of CPU requests across all pods in a non-terminal state cannot exceed this value. |
|
|
The sum of memory requests across all pods in a non-terminal state cannot exceed this value. |
|
|
The sum of local ephemeral storage requests across all pods in a non-terminal state cannot exceed this value. |
|
|
The sum of CPU requests across all pods in a non-terminal state cannot exceed this value. |
|
|
The sum of memory requests across all pods in a non-terminal state cannot exceed this value. |
|
|
The sum of ephemeral storage requests across all pods in a non-terminal state cannot exceed this value. |
|
| The sum of CPU limits across all pods in a non-terminal state cannot exceed this value. |
|
| The sum of memory limits across all pods in a non-terminal state cannot exceed this value. |
|
| The sum of ephemeral storage limits across all pods in a non-terminal state cannot exceed this value. This resource is available only if you enabled the ephemeral storage technology preview. This feature is disabled by default. |
| Resource Name | Description |
|---|---|
|
| The sum of storage requests across all persistent volume claims in any state cannot exceed this value. |
|
| The total number of persistent volume claims that can exist in the project. |
|
| The sum of storage requests across all persistent volume claims in any state that have a matching storage class, cannot exceed this value. |
|
| The total number of persistent volume claims with a matching storage class that can exist in the project. |
| Resource Name | Description |
|---|---|
|
| The total number of pods in a non-terminal state that can exist in the project. |
|
| The total number of replication controllers that can exist in the project. |
|
| The total number of resource quotas that can exist in the project. |
|
| The total number of services that can exist in the project. |
|
| The total number of secrets that can exist in the project. |
|
|
The total number of |
|
| The total number of persistent volume claims that can exist in the project. |
|
| The total number of image streams that can exist in the project. |
You can configure an object count quota for these standard namespaced resource types using the count/<resource>.<group> syntax.
oc create quota <name> --hard=count/<resource>.<group>=<quota>
$ oc create quota <name> --hard=count/<resource>.<group>=<quota>
- 1
<resource>is the name of the resource, and<group>is the API group, if applicable. Use thekubectl api-resourcescommand for a list of resources and their associated API groups.
5.1.1. Setting resource quota for extended resources Copy linkLink copied to clipboard!
Overcommitment of resources is not allowed for extended resources, so you must specify requests and limits for the same extended resource in a quota. Currently, only quota items with the prefix requests. are allowed for extended resources. The following is an example scenario of how to set resource quota for the GPU resource nvidia.com/gpu.
Procedure
To determine how many GPUs are available on a node in your cluster, use the following command:
oc describe node ip-172-31-27-209.us-west-2.compute.internal | egrep 'Capacity|Allocatable|gpu'
$ oc describe node ip-172-31-27-209.us-west-2.compute.internal | egrep 'Capacity|Allocatable|gpu'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, 2 GPUs are available.
Use this command to set a quota in the namespace
nvidia. In this example, the quota is1:cat gpu-quota.yaml
$ cat gpu-quota.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the quota with the following command:
oc create -f gpu-quota.yaml
$ oc create -f gpu-quota.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
resourcequota/gpu-quota created
resourcequota/gpu-quota createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the namespace has the correct quota set using the following command:
oc describe quota gpu-quota -n nvidia
$ oc describe quota gpu-quota -n nvidiaCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Name: gpu-quota Namespace: nvidia Resource Used Hard -------- ---- ---- requests.nvidia.com/gpu 0 1
Name: gpu-quota Namespace: nvidia Resource Used Hard -------- ---- ---- requests.nvidia.com/gpu 0 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run a pod that asks for a single GPU with the following command:
oc create pod gpu-pod.yaml
$ oc create pod gpu-pod.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the pod is running bwith the following command:
oc get pods
$ oc get podsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE gpu-pod-s46h7 1/1 Running 0 1m
NAME READY STATUS RESTARTS AGE gpu-pod-s46h7 1/1 Running 0 1mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the quota
Usedcounter is correct by running the following command:oc describe quota gpu-quota -n nvidia
$ oc describe quota gpu-quota -n nvidiaCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Name: gpu-quota Namespace: nvidia Resource Used Hard -------- ---- ---- requests.nvidia.com/gpu 1 1
Name: gpu-quota Namespace: nvidia Resource Used Hard -------- ---- ---- requests.nvidia.com/gpu 1 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Using the following command, attempt to create a second GPU pod in the
nvidianamespace. This is technically available on the node because it has 2 GPUs:oc create -f gpu-pod.yaml
$ oc create -f gpu-pod.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Error from server (Forbidden): error when creating "gpu-pod.yaml": pods "gpu-pod-f7z2w" is forbidden: exceeded quota: gpu-quota, requested: requests.nvidia.com/gpu=1, used: requests.nvidia.com/gpu=1, limited: requests.nvidia.com/gpu=1
Error from server (Forbidden): error when creating "gpu-pod.yaml": pods "gpu-pod-f7z2w" is forbidden: exceeded quota: gpu-quota, requested: requests.nvidia.com/gpu=1, used: requests.nvidia.com/gpu=1, limited: requests.nvidia.com/gpu=1Copy to Clipboard Copied! Toggle word wrap Toggle overflow This
Forbiddenerror message occurs because you have a quota of 1 GPU and this pod tried to allocate a second GPU, which exceeds its quota.
5.1.2. Quota scopes Copy linkLink copied to clipboard!
Each quota can have an associated set of scopes. A quota only measures usage for a resource if it matches the intersection of enumerated scopes.
Adding a scope to a quota restricts the set of resources to which that quota can apply. Specifying a resource outside of the allowed set results in a validation error.
| Scope | Description |
|---|---|
|
|
Match pods where |
|
|
Match pods where |
|
|
Match pods that have best effort quality of service for either |
|
|
Match pods that do not have best effort quality of service for |
A BestEffort scope restricts a quota to limiting the following resources:
-
pods
A Terminating, NotTerminating, and NotBestEffort scope restricts a quota to tracking the following resources:
-
pods -
memory -
requests.memory -
limits.memory -
cpu -
requests.cpu -
limits.cpu -
ephemeral-storage -
requests.ephemeral-storage -
limits.ephemeral-storage
Ephemeral storage requests and limits apply only if you enabled the ephemeral storage technology preview. This feature is disabled by default.
Additional resources
See Resources managed by quotas for more on compute resources.
See Quality of Service Classes for more on committing compute resources.
5.2. Admin quota usage Copy linkLink copied to clipboard!
5.2.1. Quota enforcement Copy linkLink copied to clipboard!
After a resource quota for a project is first created, the project restricts the ability to create any new resources that can violate a quota constraint until it has calculated updated usage statistics.
After a quota is created and usage statistics are updated, the project accepts the creation of new content. When you create or modify resources, your quota usage is incremented immediately upon the request to create or modify the resource.
When you delete a resource, your quota use is decremented during the next full recalculation of quota statistics for the project.
A configurable amount of time determines how long it takes to reduce quota usage statistics to their current observed system value.
If project modifications exceed a quota usage limit, the server denies the action, and an appropriate error message is returned to the user explaining the quota constraint violated, and what their currently observed usage stats are in the system.
5.2.2. Requests compared to limits Copy linkLink copied to clipboard!
When allocating compute resources by quota, each container can specify a request and a limit value each for CPU, memory, and ephemeral storage. Quotas can restrict any of these values.
If the quota has a value specified for requests.cpu or requests.memory, then it requires that every incoming container make an explicit request for those resources. If the quota has a value specified for limits.cpu or limits.memory, then it requires that every incoming container specify an explicit limit for those resources.
5.2.3. Sample resource quota definitions Copy linkLink copied to clipboard!
Example core-object-counts.yaml
- 1
- The total number of
ConfigMapobjects that can exist in the project. - 2
- The total number of persistent volume claims (PVCs) that can exist in the project.
- 3
- The total number of replication controllers that can exist in the project.
- 4
- The total number of secrets that can exist in the project.
- 5
- The total number of services that can exist in the project.
Example openshift-object-counts.yaml
- 1
- The total number of image streams that can exist in the project.
Example compute-resources.yaml
- 1
- The total number of pods in a non-terminal state that can exist in the project.
- 2
- Across all pods in a non-terminal state, the sum of CPU requests cannot exceed 1 core.
- 3
- Across all pods in a non-terminal state, the sum of memory requests cannot exceed 1Gi.
- 4
- Across all pods in a non-terminal state, the sum of ephemeral storage requests cannot exceed 2Gi.
- 5
- Across all pods in a non-terminal state, the sum of CPU limits cannot exceed 2 cores.
- 6
- Across all pods in a non-terminal state, the sum of memory limits cannot exceed 2Gi.
- 7
- Across all pods in a non-terminal state, the sum of ephemeral storage limits cannot exceed 4Gi.
Example besteffort.yaml
Example compute-resources-long-running.yaml
- 1
- The total number of pods in a non-terminal state.
- 2
- Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value.
- 3
- Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value.
- 4
- Across all pods in a non-terminal state, the sum of ephemeral storage limits cannot exceed this value.
- 5
- Restricts the quota to only matching pods where
spec.activeDeadlineSecondsis set tonil. Build pods will fall underNotTerminatingunless theRestartNeverpolicy is applied.
Example compute-resources-time-bound.yaml
- 1
- The total number of pods in a non-terminal state.
- 2
- Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value.
- 3
- Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value.
- 4
- Across all pods in a non-terminal state, the sum of ephemeral storage limits cannot exceed this value.
- 5
- Restricts the quota to only matching pods where
spec.activeDeadlineSeconds >=0. For example, this quota would charge for build pods, but not long running pods such as a web server or database.
Example storage-consumption.yaml
- 1
- The total number of persistent volume claims in a project
- 2
- Across all persistent volume claims in a project, the sum of storage requested cannot exceed this value.
- 3
- Across all persistent volume claims in a project, the sum of storage requested in the gold storage class cannot exceed this value.
- 4
- Across all persistent volume claims in a project, the sum of storage requested in the silver storage class cannot exceed this value.
- 5
- Across all persistent volume claims in a project, the total number of claims in the silver storage class cannot exceed this value.
- 6
- Across all persistent volume claims in a project, the sum of storage requested in the bronze storage class cannot exceed this value. When this is set to
0, it means bronze storage class cannot request storage. - 7
- Across all persistent volume claims in a project, the sum of storage requested in the bronze storage class cannot exceed this value. When this is set to
0, it means bronze storage class cannot create claims.
5.2.4. Creating a quota Copy linkLink copied to clipboard!
To create a quota, first define the quota in a file. Then use that file to apply it to a project. See the Additional resources section for a link describing this.
oc create -f <resource_quota_definition> [-n <project_name>]
$ oc create -f <resource_quota_definition> [-n <project_name>]
Here is an example using the core-object-counts.yaml resource quota definition and the demoproject project name:
oc create -f core-object-counts.yaml -n demoproject
$ oc create -f core-object-counts.yaml -n demoproject
5.2.5. Creating object count quotas Copy linkLink copied to clipboard!
You can create an object count quota for all OpenShift Container Platform standard namespaced resource types, such as BuildConfig, and DeploymentConfig. An object quota count places a defined quota on all standard namespaced resource types.
When using a resource quota, an object is charged against the quota if it exists in server storage. These types of quotas are useful to protect against exhaustion of storage resources.
To configure an object count quota for a resource, run the following command:
oc create quota <name> --hard=count/<resource>.<group>=<quota>,count/<resource>.<group>=<quota>
$ oc create quota <name> --hard=count/<resource>.<group>=<quota>,count/<resource>.<group>=<quota>
Example showing object count quota:
This example limits the listed resources to the hard limit in each project in the cluster.
5.2.6. Viewing a quota Copy linkLink copied to clipboard!
You can view usage statistics related to any hard limits defined in a project’s quota by navigating in the web console to the project’s Quota page.
You can also use the CLI to view quota details:
First, get the list of quotas defined in the project. For example, for a project called
demoproject:oc get quota -n demoproject
$ oc get quota -n demoproject NAME AGE besteffort 11m compute-resources 2m core-object-counts 29mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Describe the quota you are interested in, for example the
core-object-countsquota:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.2.7. Configuring quota synchronization period Copy linkLink copied to clipboard!
When a set of resources are deleted, the synchronization time frame of resources is determined by the resource-quota-sync-period setting in the /etc/origin/master/master-config.yaml file.
Before quota usage is restored, a user can encounter problems when attempting to reuse the resources. You can change the resource-quota-sync-period setting to have the set of resources regenerate in the needed amount of time (in seconds) for the resources to be once again available:
Example resource-quota-sync-period setting
After making any changes, restart the controller services to apply them.
master-restart api master-restart controllers
$ master-restart api
$ master-restart controllers
Adjusting the regeneration time can be helpful for creating resources and determining resource usage when automation is used.
The resource-quota-sync-period setting balances system performance. Reducing the sync period can result in a heavy load on the controller.
5.2.8. Explicit quota to consume a resource Copy linkLink copied to clipboard!
If a resource is not managed by quota, a user has no restriction on the amount of resource that can be consumed. For example, if there is no quota on storage related to the gold storage class, the amount of gold storage a project can create is unbounded.
For high-cost compute or storage resources, administrators can require an explicit quota be granted to consume a resource. For example, if a project was not explicitly given quota for storage related to the gold storage class, users of that project would not be able to create any storage of that type.
In order to require explicit quota to consume a particular resource, the following stanza should be added to the master-config.yaml.
In the above example, the quota system intercepts every operation that creates or updates a PersistentVolumeClaim. It checks what resources controlled by quota would be consumed. If there is no covering quota for those resources in the project, the request is denied. In this example, if a user creates a PersistentVolumeClaim that uses storage associated with the gold storage class and there is no matching quota in the project, the request is denied.
Additional resources
For examples of how to create the file needed to set quotas, see Resources managed by quotas.
A description of how to allocate compute resources managed by quota.
For information on managing limits and quota on project resources, see Working with projects.
If a quota has been defined for your project, see Understanding deployments for considerations in cluster configurations.
5.3. Setting limit ranges Copy linkLink copied to clipboard!
A limit range, defined by a LimitRange object, defines compute resource constraints at the pod, container, image, image stream, and persistent volume claim level. The limit range specifies the amount of resources that a pod, container, image, image stream, or persistent volume claim can consume.
All requests to create and modify resources are evaluated against each LimitRange object in the project. If the resource violates any of the enumerated constraints, the resource is rejected. If the resource does not set an explicit value, and if the constraint supports a default value, the default value is applied to the resource.
For CPU and memory limits, if you specify a maximum value but do not specify a minimum limit, the resource can consume more CPU and memory resources than the maximum value.
Core limit range object definition
- 1
- The name of the limit range object.
- 2
- The maximum amount of CPU that a pod can request on a node across all containers.
- 3
- The maximum amount of memory that a pod can request on a node across all containers.
- 4
- The minimum amount of CPU that a pod can request on a node across all containers. If you do not set a
minvalue or you setminto0, the result is no limit and the pod can consume more than themaxCPU value. - 5
- The minimum amount of memory that a pod can request on a node across all containers. If you do not set a
minvalue or you setminto0, the result is no limit and the pod can consume more than themaxmemory value. - 6
- The maximum amount of CPU that a single container in a pod can request.
- 7
- The maximum amount of memory that a single container in a pod can request.
- 8
- The minimum amount of CPU that a single container in a pod can request. If you do not set a
minvalue or you setminto0, the result is no limit and the pod can consume more than themaxCPU value. - 9
- The minimum amount of memory that a single container in a pod can request. If you do not set a
minvalue or you setminto0, the result is no limit and the pod can consume more than themaxmemory value. - 10
- The default CPU limit for a container if you do not specify a limit in the pod specification.
- 11
- The default memory limit for a container if you do not specify a limit in the pod specification.
- 12
- The default CPU request for a container if you do not specify a request in the pod specification.
- 13
- The default memory request for a container if you do not specify a request in the pod specification.
- 14
- The maximum limit-to-request ratio for a container.
OpenShift Container Platform Limit range object definition
- 1
- The maximum size of an image that can be pushed to an internal registry.
- 2
- The maximum number of unique image tags as defined in the specification for the image stream.
- 3
- The maximum number of unique image references as defined in the specification for the image stream status.
- 4
- The maximum amount of CPU that a pod can request on a node across all containers.
- 5
- The maximum amount of memory that a pod can request on a node across all containers.
- 6
- The maximum amount of ephemeral storage that a pod can request on a node across all containers.
- 7
- The minimum amount of CPU that a pod can request on a node across all containers. See the Supported Constraints table for important information.
- 8
- The minimum amount of memory that a pod can request on a node across all containers. If you do not set a
minvalue or you setminto0, the result` is no limit and the pod can consume more than themaxmemory value.
You can specify both core and OpenShift Container Platform resources in one limit range object.
5.3.1. Container limits Copy linkLink copied to clipboard!
Supported Resources:
- CPU
- Memory
Supported Constraints
Per container, the following must hold true if specified:
Container
| Constraint | Behavior |
|---|---|
|
|
If the configuration defines a |
|
|
If the configuration defines a |
|
|
If the limit range defines a
For example, if a container has |
Supported Defaults:
Default[<resource>]-
Defaults
container.resources.limit[<resource>]to specified value if none. Default Requests[<resource>]-
Defaults
container.resources.requests[<resource>]to specified value if none.
5.3.2. Pod limits Copy linkLink copied to clipboard!
Supported Resources:
- CPU
- Memory
Supported Constraints:
Across all containers in a pod, the following must hold true:
| Constraint | Enforced Behavior |
|---|---|
|
|
|
|
|
|
|
|
|
5.3.3. Image limits Copy linkLink copied to clipboard!
Supported Resources:
- Storage
Resource type name:
-
openshift.io/Image
Per image, the following must hold true if specified:
| Constraint | Behavior |
|---|---|
|
|
|
To prevent blobs that exceed the limit from being uploaded to the registry, the registry must be configured to enforce quota. The REGISTRY_MIDDLEWARE_REPOSITORY_OPENSHIFT_ENFORCEQUOTA environment variable must be set to true. By default, the environment variable is set to true for new deployments.
5.3.4. Image stream limits Copy linkLink copied to clipboard!
Supported Resources:
-
openshift.io/image-tags -
openshift.io/images
Resource type name:
-
openshift.io/ImageStream
Per image stream, the following must hold true if specified:
| Constraint | Behavior |
|---|---|
|
|
|
|
|
|
5.3.5. Counting of image references Copy linkLink copied to clipboard!
The openshift.io/image-tags resource represents unique stream limits. Possible references are an ImageStreamTag, an ImageStreamImage, or a DockerImage. Tags can be created by using the oc tag and oc import-image commands or by using image streams. No distinction is made between internal and external references. However, each unique reference that is tagged in an image stream specification is counted just once. It does not restrict pushes to an internal container image registry in any way, but is useful for tag restriction.
The openshift.io/images resource represents unique image names that are recorded in image stream status. It helps to restrict several images that can be pushed to the internal registry. Internal and external references are not distinguished.
5.3.6. PersistentVolumeClaim limits Copy linkLink copied to clipboard!
Supported Resources:
- Storage
Supported Constraints:
Across all persistent volume claims in a project, the following must hold true:
| Constraint | Enforced Behavior |
|---|---|
|
| Min[<resource>] <= claim.spec.resources.requests[<resource>] (required) |
|
| claim.spec.resources.requests[<resource>] (required) <= Max[<resource>] |
Limit Range Object Definition
Additional resources
For information on stream limits, see managing images streams.
For information on stream limits.
For more information on compute resource constraints.
For more information on how CPU and memory are measured, see Recommended control plane practices.
You can specify limits and requests for ephemeral storage. For more information on this feature, see Understanding ephemeral storage.
5.4. Limit range operations Copy linkLink copied to clipboard!
5.4.1. Creating a limit range Copy linkLink copied to clipboard!
Shown here is an example procedure to follow for creating a limit range.
Procedure
Create the object:
oc create -f <limit_range_file> -n <project>
$ oc create -f <limit_range_file> -n <project>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.4.2. View the limit Copy linkLink copied to clipboard!
You can view any limit ranges that are defined in a project by navigating in the web console to the Quota page for the project. You can also use the CLI to view limit range details by performing the following steps:
Procedure
Get the list of limit range objects that are defined in the project. For example, a project called
demoproject:oc get limits -n demoproject
$ oc get limits -n demoprojectCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example Output
NAME AGE resource-limits 6d
NAME AGE resource-limits 6dCopy to Clipboard Copied! Toggle word wrap Toggle overflow Describe the limit range. For example, for a limit range called
resource-limits:oc describe limits resource-limits -n demoproject
$ oc describe limits resource-limits -n demoprojectCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example Output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.4.3. Deleting a limit range Copy linkLink copied to clipboard!
To remove a limit range, run the following command:
+
oc delete limits <limit_name>
$ oc delete limits <limit_name>
S
Additional resources
For information about enforcing different limits on the number of projects that your users can create, managing limits, and quota on project resources, see Resource quotas per projects.
Chapter 6. Recommended host practices for IBM Z & IBM LinuxONE environments Copy linkLink copied to clipboard!
This topic provides recommended host practices for OpenShift Container Platform on IBM Z® and IBM® LinuxONE.
The s390x architecture is unique in many aspects. Therefore, some recommendations made here might not apply to other platforms.
Unless stated otherwise, these practices apply to both z/VM and Red Hat Enterprise Linux (RHEL) KVM installations on IBM Z® and IBM® LinuxONE.
6.1. Managing CPU overcommitment Copy linkLink copied to clipboard!
In a highly virtualized IBM Z® environment, you must carefully plan the infrastructure setup and sizing. One of the most important features of virtualization is the capability to do resource overcommitment, allocating more resources to the virtual machines than actually available at the hypervisor level. This is very workload dependent and there is no golden rule that can be applied to all setups.
Depending on your setup, consider these best practices regarding CPU overcommitment:
- At LPAR level (PR/SM hypervisor), avoid assigning all available physical cores (IFLs) to each LPAR. For example, with four physical IFLs available, you should not define three LPARs with four logical IFLs each.
- Check and understand LPAR shares and weights.
- An excessive number of virtual CPUs can adversely affect performance. Do not define more virtual processors to a guest than logical processors are defined to the LPAR.
- Configure the number of virtual processors per guest for peak workload, not more.
- Start small and monitor the workload. Increase the vCPU number incrementally if necessary.
- Not all workloads are suitable for high overcommitment ratios. If the workload is CPU intensive, you will probably not be able to achieve high ratios without performance problems. Workloads that are more I/O intensive can keep consistent performance even with high overcommitment ratios.
6.2. Disable Transparent Huge Pages Copy linkLink copied to clipboard!
Transparent Huge Pages (THP) attempt to automate most aspects of creating, managing, and using huge pages. Since THP automatically manages the huge pages, this is not always handled optimally for all types of workloads. THP can lead to performance regressions, since many applications handle huge pages on their own. Therefore, consider disabling THP.
6.3. Boost networking performance with Receive Flow Steering Copy linkLink copied to clipboard!
Receive Flow Steering (RFS) extends Receive Packet Steering (RPS) by further reducing network latency. RFS is technically based on RPS, and improves the efficiency of packet processing by increasing the CPU cache hit rate. RFS achieves this, and in addition considers queue length, by determining the most convenient CPU for computation so that cache hits are more likely to occur within the CPU. Thus, the CPU cache is invalidated less and requires fewer cycles to rebuild the cache. This can help reduce packet processing run time.
6.3.1. Use the Machine Config Operator (MCO) to activate RFS Copy linkLink copied to clipboard!
Procedure
Copy the following MCO sample profile into a YAML file. For example,
enable-rfs.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the MCO profile:
oc create -f enable-rfs.yaml
$ oc create -f enable-rfs.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that an entry named
50-enable-rfsis listed:oc get mc
$ oc get mcCopy to Clipboard Copied! Toggle word wrap Toggle overflow To deactivate, enter:
oc delete mc 50-enable-rfs
$ oc delete mc 50-enable-rfsCopy to Clipboard Copied! Toggle word wrap Toggle overflow
6.4. Choose your networking setup Copy linkLink copied to clipboard!
The networking stack is one of the most important components for a Kubernetes-based product like OpenShift Container Platform. For IBM Z® setups, the networking setup depends on the hypervisor of your choice. Depending on the workload and the application, the best fit usually changes with the use case and the traffic pattern.
Depending on your setup, consider these best practices:
- Consider all options regarding networking devices to optimize your traffic pattern. Explore the advantages of OSA-Express, RoCE Express, HiperSockets, z/VM VSwitch, Linux Bridge (KVM), and others to decide which option leads to the greatest benefit for your setup.
- Always use the latest available NIC version. For example, OSA Express 7S 10 GbE shows great improvement compared to OSA Express 6S 10 GbE with transactional workload types, although both are 10 GbE adapters.
- Each virtual switch adds an additional layer of latency.
- The load balancer plays an important role for network communication outside the cluster. Consider using a production-grade hardware load balancer if this is critical for your application.
- OpenShift Container Platform SDN introduces flows and rules, which impact the networking performance. Make sure to consider pod affinities and placements, to benefit from the locality of services where communication is critical.
- Balance the trade-off between performance and functionality.
6.5. Ensure high disk performance with HyperPAV on z/VM Copy linkLink copied to clipboard!
DASD and ECKD devices are commonly used disk types in IBM Z® environments. In a typical OpenShift Container Platform setup in z/VM environments, DASD disks are commonly used to support the local storage for the nodes. You can set up HyperPAV alias devices to provide more throughput and overall better I/O performance for the DASD disks that support the z/VM guests.
Using HyperPAV for the local storage devices leads to a significant performance benefit. However, you must be aware that there is a trade-off between throughput and CPU costs.
6.5.1. Use the Machine Config Operator (MCO) to activate HyperPAV aliases in nodes using z/VM full-pack minidisks Copy linkLink copied to clipboard!
For z/VM-based OpenShift Container Platform setups that use full-pack minidisks, you can leverage the advantage of MCO profiles by activating HyperPAV aliases in all of the nodes. You must add YAML configurations for both control plane and compute nodes.
Procedure
Copy the following MCO sample profile into a YAML file for the control plane node. For example,
05-master-kernelarg-hpav.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy the following MCO sample profile into a YAML file for the compute node. For example,
05-worker-kernelarg-hpav.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou must modify the
rd.dasdarguments to fit the device IDs.Create the MCO profiles:
oc create -f 05-master-kernelarg-hpav.yaml
$ oc create -f 05-master-kernelarg-hpav.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f 05-worker-kernelarg-hpav.yaml
$ oc create -f 05-worker-kernelarg-hpav.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow To deactivate, enter:
oc delete -f 05-master-kernelarg-hpav.yaml
$ oc delete -f 05-master-kernelarg-hpav.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -f 05-worker-kernelarg-hpav.yaml
$ oc delete -f 05-worker-kernelarg-hpav.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
6.6. RHEL KVM on IBM Z host recommendations Copy linkLink copied to clipboard!
Optimizing a KVM virtual server environment strongly depends on the workloads of the virtual servers and on the available resources. The same action that enhances performance in one environment can have adverse effects in another. Finding the best balance for a particular setting can be a challenge and often involves experimentation.
The following section introduces some best practices when using OpenShift Container Platform with RHEL KVM on IBM Z® and IBM® LinuxONE environments.
6.6.1. Use I/O threads for your virtual block devices Copy linkLink copied to clipboard!
To make virtual block devices use I/O threads, you must configure one or more I/O threads for the virtual server and each virtual block device to use one of these I/O threads.
The following example specifies <iothreads>3</iothreads> to configure three I/O threads, with consecutive decimal thread IDs 1, 2, and 3. The iothread="2" parameter specifies the driver element of the disk device to use the I/O thread with ID 2.
Sample I/O thread specification
Threads can increase the performance of I/O operations for disk devices, but they also use memory and CPU resources. You can configure multiple devices to use the same thread. The best mapping of threads to devices depends on the available resources and the workload.
Start with a small number of I/O threads. Often, a single I/O thread for all disk devices is sufficient. Do not configure more threads than the number of virtual CPUs, and do not configure idle threads.
You can use the virsh iothreadadd command to add I/O threads with specific thread IDs to a running virtual server.
6.6.2. Avoid virtual SCSI devices Copy linkLink copied to clipboard!
Configure virtual SCSI devices only if you need to address the device through SCSI-specific interfaces. Configure disk space as virtual block devices rather than virtual SCSI devices, regardless of the backing on the host.
However, you might need SCSI-specific interfaces for:
- A LUN for a SCSI-attached tape drive on the host.
- A DVD ISO file on the host file system that is mounted on a virtual DVD drive.
6.6.3. Configure guest caching for disk Copy linkLink copied to clipboard!
Configure your disk devices to do caching by the guest and not by the host.
Ensure that the driver element of the disk device includes the cache="none" and io="native" parameters.
<disk type="block" device="disk">
<driver name="qemu" type="raw" cache="none" io="native" iothread="1"/>
...
</disk>
<disk type="block" device="disk">
<driver name="qemu" type="raw" cache="none" io="native" iothread="1"/>
...
</disk>
6.6.4. Exclude the memory balloon device Copy linkLink copied to clipboard!
Unless you need a dynamic memory size, do not define a memory balloon device and ensure that libvirt does not create one for you. Include the memballoon parameter as a child of the devices element in your domain configuration XML file.
Check the list of active profiles:
<memballoon model="none"/>
<memballoon model="none"/>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.6.5. Tune the CPU migration algorithm of the host scheduler Copy linkLink copied to clipboard!
Do not change the scheduler settings unless you are an expert who understands the implications. Do not apply changes to production systems without testing them and confirming that they have the intended effect.
The kernel.sched_migration_cost_ns parameter specifies a time interval in nanoseconds. After the last execution of a task, the CPU cache is considered to have useful content until this interval expires. Increasing this interval results in fewer task migrations. The default value is 500000 ns.
If the CPU idle time is higher than expected when there are runnable processes, try reducing this interval. If tasks bounce between CPUs or nodes too often, try increasing it.
To dynamically set the interval to 60000 ns, enter the following command:
sysctl kernel.sched_migration_cost_ns=60000
# sysctl kernel.sched_migration_cost_ns=60000
To persistently change the value to 60000 ns, add the following entry to /etc/sysctl.conf:
kernel.sched_migration_cost_ns=60000
kernel.sched_migration_cost_ns=60000
6.6.6. Disable the cpuset cgroup controller Copy linkLink copied to clipboard!
This setting applies only to KVM hosts with cgroups version 1. To enable CPU hotplug on the host, disable the cgroup controller.
Procedure
-
Open
/etc/libvirt/qemu.confwith an editor of your choice. -
Go to the
cgroup_controllersline. - Duplicate the entire line and remove the leading number sign (#) from the copy.
Remove the
cpusetentry, as follows:cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]
cgroup_controllers = [ "cpu", "devices", "memory", "blkio", "cpuacct" ]Copy to Clipboard Copied! Toggle word wrap Toggle overflow For the new setting to take effect, you must restart the libvirtd daemon:
- Stop all virtual machines.
Run the following command:
systemctl restart libvirtd
# systemctl restart libvirtdCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Restart the virtual machines.
This setting persists across host reboots.
6.6.7. Tune the polling period for idle virtual CPUs Copy linkLink copied to clipboard!
When a virtual CPU becomes idle, KVM polls for wakeup conditions for the virtual CPU before allocating the host resource. You can specify the time interval, during which polling takes place in sysfs at /sys/module/kvm/parameters/halt_poll_ns. During the specified time, polling reduces the wakeup latency for the virtual CPU at the expense of resource usage. Depending on the workload, a longer or shorter time for polling can be beneficial. The time interval is specified in nanoseconds. The default is 50000 ns.
To optimize for low CPU consumption, enter a small value or write 0 to disable polling:
echo 0 > /sys/module/kvm/parameters/halt_poll_ns
# echo 0 > /sys/module/kvm/parameters/halt_poll_nsCopy to Clipboard Copied! Toggle word wrap Toggle overflow To optimize for low latency, for example for transactional workloads, enter a large value:
echo 80000 > /sys/module/kvm/parameters/halt_poll_ns
# echo 80000 > /sys/module/kvm/parameters/halt_poll_nsCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 7. Using the Node Tuning Operator Copy linkLink copied to clipboard!
Learn about the Node Tuning Operator and how you can use it to manage node-level tuning by orchestrating the tuned daemon.
7.1. About the Node Tuning Operator Copy linkLink copied to clipboard!
The Node Tuning Operator helps you manage node-level tuning by orchestrating the TuneD daemon and achieves low latency performance by using the Performance Profile controller. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs.
The Operator manages the containerized TuneD daemon for OpenShift Container Platform as a Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized TuneD daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.
Node-level settings applied by the containerized TuneD daemon are rolled back on an event that triggers a profile change or when the containerized TuneD daemon is terminated gracefully by receiving and handling a termination signal.
The Node Tuning Operator uses the Performance Profile controller to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications.
The cluster administrator configures a performance profile to define node-level settings such as the following:
- Updating the kernel to kernel-rt.
- Choosing CPUs for housekeeping.
- Choosing CPUs for running workloads.
Currently, disabling CPU load balancing is not supported by cgroup v2. As a result, you might not get the desired behavior from performance profiles if you have cgroup v2 enabled. Enabling cgroup v2 is not recommended if you are using performance profiles.
The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1 and later.
In earlier versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.
7.2. Accessing an example Node Tuning Operator specification Copy linkLink copied to clipboard!
Use this process to access an example Node Tuning Operator specification.
Procedure
Run the following command to access an example Node Tuning Operator specification:
oc get tuned.tuned.openshift.io/default -o yaml -n openshift-cluster-node-tuning-operator
oc get tuned.tuned.openshift.io/default -o yaml -n openshift-cluster-node-tuning-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow
The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift Container Platform nodes based on node or pod labels and profile priorities.
While in certain situations the support for pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without pod label matching. If a custom profile is created with pod label matching, then the functionality will be enabled at that time. The pod label functionality will be deprecated in future versions of the Node Tuning Operator.
7.3. Default profiles set on a cluster Copy linkLink copied to clipboard!
The following are the default profiles set on a cluster.
Starting with OpenShift Container Platform 4.9, all OpenShift TuneD profiles are shipped with the TuneD package. You can use the oc exec command to view the contents of these profiles:
oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/openshift{,-control-plane,-node} -name tuned.conf -exec grep -H ^ {} \;
$ oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/openshift{,-control-plane,-node} -name tuned.conf -exec grep -H ^ {} \;
7.4. Verifying that the TuneD profiles are applied Copy linkLink copied to clipboard!
Verify the TuneD profiles that are applied to your cluster node.
oc get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operator
$ oc get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operator
Example output
-
NAME: Name of the Profile object. There is one Profile object per node and their names match. -
TUNED: Name of the desired TuneD profile to apply. -
APPLIED:Trueif the TuneD daemon applied the desired profile. (True/False/Unknown). -
DEGRADED:Trueif any errors were reported during application of the TuneD profile (True/False/Unknown). -
AGE: Time elapsed since the creation of Profile object.
The ClusterOperator/node-tuning object also contains useful information about the Operator and its node agents' health. For example, Operator misconfiguration is reported by ClusterOperator/node-tuning status messages.
To get status information about the ClusterOperator/node-tuning object, run the following command:
oc get co/node-tuning -n openshift-cluster-node-tuning-operator
$ oc get co/node-tuning -n openshift-cluster-node-tuning-operator
Example output
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE node-tuning 4.14.1 True False True 60m 1/5 Profiles with bootcmdline conflict
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
node-tuning 4.14.1 True False True 60m 1/5 Profiles with bootcmdline conflict
If either the ClusterOperator/node-tuning or a profile object’s status is DEGRADED, additional information is provided in the Operator or operand logs.
7.5. Custom tuning specification Copy linkLink copied to clipboard!
The custom resource (CR) for the Operator has two major sections. The first section, profile:, is a list of TuneD profiles and their names. The second, recommend:, defines the profile selection logic.
Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized TuneD daemons are updated.
Management state
The Operator Management state is set by adjusting the default Tuned CR. By default, the Operator is in the Managed state and the spec.managementState field is not present in the default Tuned CR. Valid values for the Operator Management state are as follows:
- Managed: the Operator will update its operands as configuration resources are updated
- Unmanaged: the Operator will ignore changes to the configuration resources
- Removed: the Operator will remove its operands and resources the Operator provisioned
Profile data
The profile: section lists TuneD profiles and their names.
Recommended profiles
The profile: selection logic is defined by the recommend: section of the CR. The recommend: section is a list of items to recommend the profiles based on a selection criteria.
recommend: <recommend-item-1> # ... <recommend-item-n>
recommend:
<recommend-item-1>
# ...
<recommend-item-n>
The individual items of the list:
- 1
- Optional.
- 2
- A dictionary of key/value
MachineConfiglabels. The keys must be unique. - 3
- If omitted, profile match is assumed unless a profile with a higher priority matches first or
machineConfigLabelsis set. - 4
- An optional list.
- 5
- Profile ordering priority. Lower numbers mean higher priority (
0is the highest priority). - 6
- A TuneD profile to apply on a match. For example
tuned_profile_1. - 7
- Optional operand configuration.
- 8
- Turn debugging on or off for the TuneD daemon. Options are
truefor on orfalsefor off. The default isfalse. - 9
- Turn
reapply_sysctlfunctionality on or off for the TuneD daemon. Options aretruefor on andfalsefor off.
<match> is an optional list recursively defined as follows:
- label: <label_name>
value: <label_value>
type: <label_type>
<match>
- label: <label_name>
value: <label_value>
type: <label_type>
<match>
If <match> is not omitted, all nested <match> sections must also evaluate to true. Otherwise, false is assumed and the profile with the respective <match> section will not be applied or recommended. Therefore, the nesting (child <match> sections) works as logical AND operator. Conversely, if any item of the <match> list matches, the entire <match> list evaluates to true. Therefore, the list acts as logical OR operator.
If machineConfigLabels is defined, machine config pool based matching is turned on for the given recommend: list item. <mcLabels> specifies the labels for a machine config. The machine config is created automatically to apply host settings, such as kernel boot parameters, for the profile <tuned_profile_name>. This involves finding all machine config pools with machine config selector matching <mcLabels> and setting the profile <tuned_profile_name> on all nodes that are assigned the found machine config pools. To target nodes that have both master and worker roles, you must use the master role.
The list items match and machineConfigLabels are connected by the logical OR operator. The match item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true, the machineConfigLabels item is not considered.
When using machine config pool based matching, it is advised to group nodes with the same hardware configuration into the same machine config pool. Not following this practice might result in TuneD operands calculating conflicting kernel parameters for two or more nodes sharing the same machine config pool.
Example: Node or pod label based matching
The CR above is translated for the containerized TuneD daemon into its recommend.conf file based on the profile priorities. The profile with the highest priority (10) is openshift-control-plane-es and, therefore, it is considered first. The containerized TuneD daemon running on a given node looks to see if there is a pod running on the same node with the tuned.openshift.io/elasticsearch label set. If not, the entire <match> section evaluates as false. If there is such a pod with the label, in order for the <match> section to evaluate to true, the node label also needs to be node-role.kubernetes.io/master or node-role.kubernetes.io/infra.
If the labels for the profile with priority 10 matched, openshift-control-plane-es profile is applied and no other profile is considered. If the node/pod label combination did not match, the second highest priority profile (openshift-control-plane) is considered. This profile is applied if the containerized TuneD pod runs on a node with labels node-role.kubernetes.io/master or node-role.kubernetes.io/infra.
Finally, the profile openshift-node has the lowest priority of 30. It lacks the <match> section and, therefore, will always match. It acts as a profile catch-all to set openshift-node profile, if no other profile with higher priority matches on a given node.
Example: Machine config pool based matching
To minimize node reboots, label the target nodes with a label the machine config pool’s node selector will match, then create the Tuned CR above and finally create the custom machine config pool itself.
Cloud provider-specific TuneD profiles
With this functionality, all Cloud provider-specific nodes can conveniently be assigned a TuneD profile specifically tailored to a given Cloud provider on a OpenShift Container Platform cluster. This can be accomplished without adding additional node labels or grouping nodes into machine config pools.
This functionality takes advantage of spec.providerID node object values in the form of <cloud-provider>://<cloud-provider-specific-id> and writes the file /var/lib/tuned/provider with the value <cloud-provider> in NTO operand containers. The content of this file is then used by TuneD to load provider-<cloud-provider> profile if such profile exists.
The openshift profile that both openshift-control-plane and openshift-node profiles inherit settings from is now updated to use this functionality through the use of conditional profile loading. Neither NTO nor TuneD currently include any Cloud provider-specific profiles. However, it is possible to create a custom profile provider-<cloud-provider> that will be applied to all Cloud provider-specific cluster nodes.
Example GCE Cloud provider profile
Due to profile inheritance, any setting specified in the provider-<cloud-provider> profile will be overwritten by the openshift profile and its child profiles.
7.6. Custom tuning examples Copy linkLink copied to clipboard!
Using TuneD profiles from the default CR
The following CR applies custom node-level tuning for OpenShift Container Platform nodes with label tuned.openshift.io/ingress-node-label set to any value.
Example: custom tuning using the openshift-control-plane TuneD profile
Custom profile writers are strongly encouraged to include the default TuneD daemon profiles shipped within the default Tuned CR. The example above uses the default openshift-control-plane profile to accomplish this.
Using built-in TuneD profiles
Given the successful rollout of the NTO-managed daemon set, the TuneD operands all manage the same version of the TuneD daemon. To list the built-in TuneD profiles supported by the daemon, query any TuneD pod in the following way:
oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/ -name tuned.conf -printf '%h\n' | sed 's|^.*/||'
$ oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/ -name tuned.conf -printf '%h\n' | sed 's|^.*/||'
You can use the profile names retrieved by this in your custom tuning specification.
Example: using built-in hpc-compute TuneD profile
In addition to the built-in hpc-compute profile, the example above includes the openshift-node TuneD daemon profile shipped within the default Tuned CR to use OpenShift-specific tuning for compute nodes.
Overriding host-level sysctls
Various kernel parameters can be changed at runtime by using /run/sysctl.d/, /etc/sysctl.d/, and /etc/sysctl.conf host configuration files. OpenShift Container Platform adds several host configuration files which set kernel parameters at runtime; for example, net.ipv[4-6]., fs.inotify., and vm.max_map_count. These runtime parameters provide basic functional tuning for the system prior to the kubelet and the Operator start.
The Operator does not override these settings unless the reapply_sysctl option is set to false. Setting this option to false results in TuneD not applying the settings from the host configuration files after it applies its custom profile.
Example: overriding host-level sysctls
7.7. Supported TuneD daemon plugins Copy linkLink copied to clipboard!
Excluding the [main] section, the following TuneD plugins are supported when using custom profiles defined in the profile: section of the Tuned CR:
- audio
- cpu
- disk
- eeepc_she
- modules
- mounts
- net
- scheduler
- scsi_host
- selinux
- sysctl
- sysfs
- usb
- video
- vm
- bootloader
There is some dynamic tuning functionality provided by some of these plugins that is not supported. The following TuneD plugins are currently not supported:
- script
- systemd
The TuneD bootloader plugin only supports Red Hat Enterprise Linux CoreOS (RHCOS) worker nodes.
Additional resources
7.8. Configuring node tuning in a hosted cluster Copy linkLink copied to clipboard!
To set node-level tuning on the nodes in your hosted cluster, you can use the Node Tuning Operator. In hosted control planes, you can configure node tuning by creating config maps that contain Tuned objects and referencing those config maps in your node pools.
Procedure
Create a config map that contains a valid tuned manifest, and reference the manifest in a node pool. In the following example, a
Tunedmanifest defines a profile that setsvm.dirty_ratioto 55 on nodes that contain thetuned-1-node-labelnode label with any value. Save the followingConfigMapmanifest in a file namedtuned-1.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf you do not add any labels to an entry in the
spec.recommendsection of the Tuned spec, node-pool-based matching is assumed, so the highest priority profile in thespec.recommendsection is applied to nodes in the pool. Although you can achieve more fine-grained node-label-based matching by setting a label value in the Tuned.spec.recommend.matchsection, node labels will not persist during an upgrade unless you set the.spec.management.upgradeTypevalue of the node pool toInPlace.Create the
ConfigMapobject in the management cluster:oc --kubeconfig="$MGMT_KUBECONFIG" create -f tuned-1.yaml
$ oc --kubeconfig="$MGMT_KUBECONFIG" create -f tuned-1.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reference the
ConfigMapobject in thespec.tuningConfigfield of the node pool, either by editing a node pool or creating one. In this example, assume that you have only oneNodePool, namednodepool-1, which contains 2 nodes.Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou can reference the same config map in multiple node pools. In hosted control planes, the Node Tuning Operator appends a hash of the node pool name and namespace to the name of the Tuned CRs to distinguish them. Outside of this case, do not create multiple TuneD profiles of the same name in different Tuned CRs for the same hosted cluster.
Verification
Now that you have created the ConfigMap object that contains a Tuned manifest and referenced it in a NodePool, the Node Tuning Operator syncs the Tuned objects into the hosted cluster. You can verify which Tuned objects are defined and which TuneD profiles are applied to each node.
List the
Tunedobjects in the hosted cluster:oc --kubeconfig="$HC_KUBECONFIG" get tuned.tuned.openshift.io -n openshift-cluster-node-tuning-operator
$ oc --kubeconfig="$HC_KUBECONFIG" get tuned.tuned.openshift.io -n openshift-cluster-node-tuning-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE default 7m36s rendered 7m36s tuned-1 65s
NAME AGE default 7m36s rendered 7m36s tuned-1 65sCopy to Clipboard Copied! Toggle word wrap Toggle overflow List the
Profileobjects in the hosted cluster:oc --kubeconfig="$HC_KUBECONFIG" get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operator
$ oc --kubeconfig="$HC_KUBECONFIG" get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME TUNED APPLIED DEGRADED AGE nodepool-1-worker-1 tuned-1-profile True False 7m43s nodepool-1-worker-2 tuned-1-profile True False 7m14s
NAME TUNED APPLIED DEGRADED AGE nodepool-1-worker-1 tuned-1-profile True False 7m43s nodepool-1-worker-2 tuned-1-profile True False 7m14sCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf no custom profiles are created, the
openshift-nodeprofile is applied by default.To confirm that the tuning was applied correctly, start a debug shell on a node and check the sysctl values:
oc --kubeconfig="$HC_KUBECONFIG" debug node/nodepool-1-worker-1 -- chroot /host sysctl vm.dirty_ratio
$ oc --kubeconfig="$HC_KUBECONFIG" debug node/nodepool-1-worker-1 -- chroot /host sysctl vm.dirty_ratioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
vm.dirty_ratio = 55
vm.dirty_ratio = 55Copy to Clipboard Copied! Toggle word wrap Toggle overflow
7.9. Advanced node tuning for hosted clusters by setting kernel boot parameters Copy linkLink copied to clipboard!
For more advanced tuning in hosted control planes, which requires setting kernel boot parameters, you can also use the Node Tuning Operator. The following example shows how you can create a node pool with huge pages reserved.
Procedure
Create a
ConfigMapobject that contains aTunedobject manifest for creating 10 huge pages that are 2 MB in size. Save thisConfigMapmanifest in a file namedtuned-hugepages.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe
.spec.recommend.matchfield is intentionally left blank. In this case, thisTunedobject is applied to all nodes in the node pool where thisConfigMapobject is referenced. Group nodes with the same hardware configuration into the same node pool. Otherwise, TuneD operands can calculate conflicting kernel parameters for two or more nodes that share the same node pool.Create the
ConfigMapobject in the management cluster:oc --kubeconfig="<management_cluster_kubeconfig>" create -f tuned-hugepages.yaml
$ oc --kubeconfig="<management_cluster_kubeconfig>" create -f tuned-hugepages.yaml1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Replace
<management_cluster_kubeconfig>with the name of your management clusterkubeconfigfile.
Create a
NodePoolmanifest YAML file, customize the upgrade type of theNodePool, and reference theConfigMapobject that you created in thespec.tuningConfigsection. Create theNodePoolmanifest and save it in a file namedhugepages-nodepool.yamlby using thehcpCLI:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe
--renderflag in thehcp createcommand does not render the secrets. To render the secrets, you must use both the--renderand the--render-sensitiveflags in thehcp createcommand.In the
hugepages-nodepool.yamlfile, set.spec.management.upgradeTypetoInPlace, and set.spec.tuningConfigto reference thetuned-hugepagesConfigMapobject that you created.Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteTo avoid the unnecessary re-creation of nodes when you apply the new
MachineConfigobjects, set.spec.management.upgradeTypetoInPlace. If you use theReplaceupgrade type, nodes are fully deleted and new nodes can replace them when you apply the new kernel boot parameters that the TuneD operand calculated.Create the
NodePoolin the management cluster:oc --kubeconfig="<management_cluster_kubeconfig>" create -f hugepages-nodepool.yaml
$ oc --kubeconfig="<management_cluster_kubeconfig>" create -f hugepages-nodepool.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
After the nodes are available, the containerized TuneD daemon calculates the required kernel boot parameters based on the applied TuneD profile. After the nodes are ready and reboot once to apply the generated MachineConfig object, you can verify that the TuneD profile is applied and that the kernel boot parameters are set.
List the
Tunedobjects in the hosted cluster:oc --kubeconfig="<hosted_cluster_kubeconfig>" get tuned.tuned.openshift.io -n openshift-cluster-node-tuning-operator
$ oc --kubeconfig="<hosted_cluster_kubeconfig>" get tuned.tuned.openshift.io -n openshift-cluster-node-tuning-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE default 123m hugepages-8dfb1fed 1m23s rendered 123m
NAME AGE default 123m hugepages-8dfb1fed 1m23s rendered 123mCopy to Clipboard Copied! Toggle word wrap Toggle overflow List the
Profileobjects in the hosted cluster:oc --kubeconfig="<hosted_cluster_kubeconfig>" get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operator
$ oc --kubeconfig="<hosted_cluster_kubeconfig>" get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME TUNED APPLIED DEGRADED AGE nodepool-1-worker-1 openshift-node True False 132m nodepool-1-worker-2 openshift-node True False 131m hugepages-nodepool-worker-1 openshift-node-hugepages True False 4m8s hugepages-nodepool-worker-2 openshift-node-hugepages True False 3m57s
NAME TUNED APPLIED DEGRADED AGE nodepool-1-worker-1 openshift-node True False 132m nodepool-1-worker-2 openshift-node True False 131m hugepages-nodepool-worker-1 openshift-node-hugepages True False 4m8s hugepages-nodepool-worker-2 openshift-node-hugepages True False 3m57sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Both of the worker nodes in the new
NodePoolhave theopenshift-node-hugepagesprofile applied.To confirm that the tuning was applied correctly, start a debug shell on a node and check
/proc/cmdline.oc --kubeconfig="<hosted_cluster_kubeconfig>" debug node/nodepool-1-worker-1 -- chroot /host cat /proc/cmdline
$ oc --kubeconfig="<hosted_cluster_kubeconfig>" debug node/nodepool-1-worker-1 -- chroot /host cat /proc/cmdlineCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-... hugepagesz=2M hugepages=50
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-... hugepagesz=2M hugepages=50Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 8. Using CPU Manager and Topology Manager Copy linkLink copied to clipboard!
CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.
CPU Manager is useful for workloads that have some of these attributes:
- Require as much CPU time as possible.
- Are sensitive to processor cache misses.
- Are low-latency network applications.
- Coordinate with other processes and benefit from sharing a single processor cache.
Topology Manager collects hints from the CPU Manager, Device Manager, and other Hint Providers to align pod resources, such as CPU, SR-IOV VFs, and other device resources, for all Quality of Service (QoS) classes on the same non-uniform memory access (NUMA) node.
Topology Manager uses topology information from the collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and pod resources requested.
Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.
To use Topology Manager you must configure CPU Manager with the static policy.
8.1. Setting up CPU Manager Copy linkLink copied to clipboard!
Procedure
Optional: Label a node:
oc label node perf-node.example.com cpumanager=true
# oc label node perf-node.example.com cpumanager=trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
MachineConfigPoolof the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:oc edit machineconfigpool worker
# oc edit machineconfigpool workerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add a label to the worker machine config pool:
metadata: creationTimestamp: 2020-xx-xxx generation: 3 labels: custom-kubelet: cpumanager-enabledmetadata: creationTimestamp: 2020-xx-xxx generation: 3 labels: custom-kubelet: cpumanager-enabledCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
KubeletConfig,cpumanager-kubeletconfig.yaml, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See themachineConfigPoolSelectorsection:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify a policy:
-
none. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically. This is the default policy. -
static. This policy allows containers in guaranteed pods with integer CPU requests. It also limits access to exclusive CPUs on the node. Ifstatic, you must use a lowercases.
-
- 2
- Optional. Specify the CPU Manager reconcile frequency. The default is
5s.
Create the dynamic kubelet config:
oc create -f cpumanager-kubeletconfig.yaml
# oc create -f cpumanager-kubeletconfig.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.
Check for the merged kubelet config:
oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
# oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the worker for the updated
kubelet.conf:oc debug node/perf-node.example.com
# oc debug node/perf-node.example.com sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManagerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
cpuManagerPolicy: static cpuManagerReconcilePeriod: 5s
cpuManagerPolicy: static1 cpuManagerReconcilePeriod: 5s2 Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:
cat cpumanager-pod.yaml
# cat cpumanager-pod.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod:
oc create -f cpumanager-pod.yaml
# oc create -f cpumanager-pod.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the pod is scheduled to the node that you labeled:
oc describe pod cpumanager
# oc describe pod cpumanagerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
cgroupsare set up correctly. Get the process ID (PID) of thepauseprocess:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pods of quality of service (QoS) tier
Guaranteedare placed within thekubepods.slice. Pods of other QoS tiers end up in childcgroupsofkubepods:cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
# cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
cpuset.cpus 1 tasks 32706
cpuset.cpus 1 tasks 32706Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the allowed CPU list for the task:
grep ^Cpus_allowed_list /proc/32706/status
# grep ^Cpus_allowed_list /proc/32706/statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Cpus_allowed_list: 1
Cpus_allowed_list: 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that another pod (in this case, the pod in the
burstableQoS tier) on the system cannot run on the core allocated for theGuaranteedpod:cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus 0 oc describe node perf-node.example.com
# cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus 0 # oc describe node perf-node.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This VM has two CPU cores. The
system-reservedsetting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at theNode Allocatableamount. You can see thatAllocatable CPUis 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:NAME READY STATUS RESTARTS AGE cpumanager-6cqz7 1/1 Running 0 33m cpumanager-7qc2t 0/1 Pending 0 11s
NAME READY STATUS RESTARTS AGE cpumanager-6cqz7 1/1 Running 0 33m cpumanager-7qc2t 0/1 Pending 0 11sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
8.2. Topology Manager policies Copy linkLink copied to clipboard!
Topology Manager aligns Pod resources of all Quality of Service (QoS) classes by collecting topology hints from Hint Providers, such as CPU Manager and Device Manager, and using the collected hints to align the Pod resources.
Topology Manager supports four allocation policies, which you assign in the KubeletConfig custom resource (CR) named cpumanager-enabled:
nonepolicy- This is the default policy and does not perform any topology alignment.
best-effortpolicy-
For each container in a pod with the
best-efforttopology management policy, kubelet tries to align all the required resources on a NUMA node according to the preferred NUMA node affinity for that container. Even if the allocation is not possible due to insufficient resources, the Topology Manager still admits the pod but the allocation is shared with other NUMA nodes. restrictedpolicy-
For each container in a pod with the
restrictedtopology management policy, kubelet determines the theoretical minimum number of NUMA nodes that can fulfill the request. If the actual allocation requires more than the that number of NUMA nodes, the Topology Manager rejects the admission, placing the pod in aTerminatedstate. If the number of NUMA nodes can fulfill the request, the Topology Manager admits the pod and the pod starts running. single-numa-nodepolicy-
For each container in a pod with the
single-numa-nodetopology management policy, kubelet admits the pod if all the resources required by the pod can be allocated on the same NUMA node. If a single NUMA node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in aTerminatedstate with a pod admission failure.
8.3. Setting up Topology Manager Copy linkLink copied to clipboard!
To use Topology Manager, you must configure an allocation policy in the KubeletConfig custom resource (CR) named cpumanager-enabled. This file might exist if you have set up CPU Manager. If the file does not exist, you can create the file.
Prerequisites
-
Configure the CPU Manager policy to be
static.
Procedure
To activate Topology Manager:
Configure the Topology Manager allocation policy in the custom resource.
oc edit KubeletConfig cpumanager-enabled
$ oc edit KubeletConfig cpumanager-enabledCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow
8.4. Pod interactions with Topology Manager policies Copy linkLink copied to clipboard!
The example Pod specs illustrate pod interactions with Topology Manager.
The following pod runs in the BestEffort QoS class because no resource requests or limits are specified.
spec:
containers:
- name: nginx
image: nginx
spec:
containers:
- name: nginx
image: nginx
The next pod runs in the Burstable QoS class because requests are less than limits.
If the selected policy is anything other than none, Topology Manager would process all the pods and it enforces resource alignment only for the Guaranteed Qos Pod specification. When the Topology Manager policy is set to none, the relevant containers are pinned to any available CPU without considering NUMA affinity. This is the default behavior and it does not optimize for performance-sensitive workloads. Other values enable the use of topology awareness information from device plugins core resources, such as CPU and memory. The Topology Manager attempts to align the CPU, memory, and device allocations according to the topology of the node when the policy is set to other values than none. For more information about the available values, see Topology Manager policies.
The following example pod runs in the Guaranteed QoS class because requests are equal to limits.
Topology Manager would consider this pod. The Topology Manager would consult the Hint Providers, which are the CPU Manager, the Device Manager, and the Memory Manager, to get topology hints for the pod.
Topology Manager will use this information to store the best topology for this container. In the case of this pod, CPU Manager and Device Manager will use this stored information at the resource allocation stage.
Chapter 9. Scheduling NUMA-aware workloads Copy linkLink copied to clipboard!
Learn about NUMA-aware scheduling and how you can use it to deploy high performance workloads in an OpenShift Container Platform cluster.
The NUMA Resources Operator allows you to schedule high-performance workloads in the same NUMA zone. It deploys a node resources exporting agent that reports on available cluster node NUMA resources, and a secondary scheduler that manages the workloads.
9.1. About NUMA Copy linkLink copied to clipboard!
Non-uniform memory access (NUMA) architecture is a multiprocessor architecture model where CPUs do not access all memory in all locations at the same speed. Instead, CPUs can gain faster access to memory that is in closer proximity to them, or local to them, but slower access to memory that is further away.
A CPU with multiple memory controllers can use any available memory across CPU complexes, regardless of where the memory is located. However, this increased flexibility comes at the expense of performance.
NUMA resource topology refers to the physical locations of CPUs, memory, and PCI devices relative to each other in a NUMA zone. In a NUMA architecture, a NUMA zone is a group of CPUs that has its own processors and memory. Colocated resources are said to be in the same NUMA zone, and CPUs in a zone have faster access to the same local memory than CPUs outside of that zone. A CPU processing a workload using memory that is outside its NUMA zone is slower than a workload processed in a single NUMA zone. For I/O-constrained workloads, the network interface on a distant NUMA zone slows down how quickly information can reach the application.
Applications can achieve better performance by containing data and processing within the same NUMA zone. For high-performance workloads and applications, such as telecommunications workloads, the cluster must process pod workloads in a single NUMA zone so that the workload can operate to specification.
9.2. About NUMA-aware scheduling Copy linkLink copied to clipboard!
NUMA-aware scheduling aligns the requested cluster compute resources (CPUs, memory, devices) in the same NUMA zone to process latency-sensitive or high-performance workloads efficiently. NUMA-aware scheduling also improves pod density per compute node for greater resource efficiency.
9.2.1. Integration with Node Tuning Operator Copy linkLink copied to clipboard!
By integrating the Node Tuning Operator’s performance profile with NUMA-aware scheduling, you can further configure CPU affinity to optimize performance for latency-sensitive workloads.
9.2.2. Default scheduling logic Copy linkLink copied to clipboard!
The default OpenShift Container Platform pod scheduler scheduling logic considers the available resources of the entire compute node, not individual NUMA zones. If the most restrictive resource alignment is requested in the kubelet topology manager, error conditions can occur when admitting the pod to a node. Conversely, if the most restrictive resource alignment is not requested, the pod can be admitted to the node without proper resource alignment, leading to worse or unpredictable performance. For example, runaway pod creation with Topology Affinity Error statuses can occur when the pod scheduler makes suboptimal scheduling decisions for guaranteed pod workloads without knowing if the pod’s requested resources are available. Scheduling mismatch decisions can cause indefinite pod startup delays. Also, depending on the cluster state and resource allocation, poor pod scheduling decisions can cause extra load on the cluster because of failed startup attempts.
9.2.3. NUMA-aware pod scheduling diagram Copy linkLink copied to clipboard!
The NUMA Resources Operator deploys a custom NUMA resources secondary scheduler and other resources to mitigate against the shortcomings of the default OpenShift Container Platform pod scheduler. The following diagram provides a high-level overview of NUMA-aware pod scheduling.
Figure 9.1. NUMA-aware scheduling overview
- NodeResourceTopology API
-
The
NodeResourceTopologyAPI describes the available NUMA zone resources in each compute node. - NUMA-aware scheduler
-
The NUMA-aware secondary scheduler receives information about the available NUMA zones from the
NodeResourceTopologyAPI and schedules high-performance workloads on a node where it can be optimally processed. - Node topology exporter
-
The node topology exporter exposes the available NUMA zone resources for each compute node to the
NodeResourceTopologyAPI. The node topology exporter daemon tracks the resource allocation from the kubelet by using thePodResourcesAPI. - PodResources API
The
PodResourcesAPI is local to each node and exposes the resource topology and available resources to the kubelet.NoteThe
Listendpoint of thePodResourcesAPI exposes exclusive CPUs allocated to a particular container. The API does not expose CPUs that belong to a shared pool.The
GetAllocatableResourcesendpoint exposes allocatable resources available on a node.
Additional resources
- For more information about running secondary pod schedulers in your cluster and how to deploy pods with a secondary pod scheduler, see Scheduling pods using a secondary scheduler.
9.3. Installing the NUMA Resources Operator Copy linkLink copied to clipboard!
NUMA Resources Operator deploys resources that allow you to schedule NUMA-aware workloads and deployments. You can install the NUMA Resources Operator using the OpenShift Container Platform CLI or the web console.
9.3.1. Installing the NUMA Resources Operator using the CLI Copy linkLink copied to clipboard!
As a cluster administrator, you can install the Operator using the CLI.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
Create a namespace for the NUMA Resources Operator:
Save the following YAML in the
nro-namespace.yamlfile:apiVersion: v1 kind: Namespace metadata: name: openshift-numaresources
apiVersion: v1 kind: Namespace metadata: name: openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
NamespaceCR by running the following command:oc create -f nro-namespace.yaml
$ oc create -f nro-namespace.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Create the Operator group for the NUMA Resources Operator:
Save the following YAML in the
nro-operatorgroup.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
OperatorGroupCR by running the following command:oc create -f nro-operatorgroup.yaml
$ oc create -f nro-operatorgroup.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Create the subscription for the NUMA Resources Operator:
Save the following YAML in the
nro-sub.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
SubscriptionCR by running the following command:oc create -f nro-sub.yaml
$ oc create -f nro-sub.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the installation succeeded by inspecting the CSV resource in the
openshift-numaresourcesnamespace. Run the following command:oc get csv -n openshift-numaresources
$ oc get csv -n openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME DISPLAY VERSION REPLACES PHASE numaresources-operator.v4.14.2 numaresources-operator 4.14.2 Succeeded
NAME DISPLAY VERSION REPLACES PHASE numaresources-operator.v4.14.2 numaresources-operator 4.14.2 SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow
9.3.2. Installing the NUMA Resources Operator using the web console Copy linkLink copied to clipboard!
As a cluster administrator, you can install the NUMA Resources Operator using the web console.
Procedure
Create a namespace for the NUMA Resources Operator:
- In the OpenShift Container Platform web console, click Administration → Namespaces.
-
Click Create Namespace, enter
openshift-numaresourcesin the Name field, and then click Create.
Install the NUMA Resources Operator:
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose numaresources-operator from the list of available Operators, and then click Install.
-
In the Installed Namespaces field, select the
openshift-numaresourcesnamespace, and then click Install.
Optional: Verify that the NUMA Resources Operator installed successfully:
- Switch to the Operators → Installed Operators page.
Ensure that NUMA Resources Operator is listed in the
openshift-numaresourcesnamespace with a Status of InstallSucceeded.NoteDuring installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.
If the Operator does not appear as installed, to troubleshoot further:
- Go to the Operators → Installed Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
-
Go to the Workloads → Pods page and check the logs for pods in the
defaultproject.
9.4. Scheduling NUMA-aware workloads Copy linkLink copied to clipboard!
Clusters running latency-sensitive workloads typically feature performance profiles that help to minimize workload latency and optimize performance. The NUMA-aware scheduler deploys workloads based on available node NUMA resources and with respect to any performance profile settings applied to the node. The combination of NUMA-aware deployments, and the performance profile of the workload, ensures that workloads are scheduled in a way that maximizes performance.
For the NUMA Resources Operator to be fully operational, you must deploy the NUMAResourcesOperator custom resource and the NUMA-aware secondary pod scheduler.
9.4.1. Creating the NUMAResourcesOperator custom resource Copy linkLink copied to clipboard!
When you have installed the NUMA Resources Operator, then create the NUMAResourcesOperator custom resource (CR) that instructs the NUMA Resources Operator to install all the cluster infrastructure needed to support the NUMA-aware scheduler, including daemon sets and APIs.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges. - Install the NUMA Resources Operator.
Procedure
Create the
NUMAResourcesOperatorcustom resource:Save the following minimal required YAML file example as
nrop.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- This must match the
MachineConfigPoolresource that you want to configure the NUMA Resources Operator on. For example, you might have created aMachineConfigPoolresource namedworker-cnfthat designates a set of nodes expected to run telecommunications workloads. EachNodeGroupmust match exactly oneMachineConfigPool. Configurations whereNodeGroupmatches more than oneMachineConfigPoolare not supported.
Create the
NUMAResourcesOperatorCR by running the following command:oc create -f nrop.yaml
$ oc create -f nrop.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteCreating the
NUMAResourcesOperatortriggers a reboot on the corresponding machine config pool and therefore the affected node.
Optional: To enable NUMA-aware scheduling for multiple machine config pools (MCPs), define a separate
NodeGroupfor each pool. For example, define threeNodeGroupsforworker-cnf,worker-ht, andworker-other, in theNUMAResourcesOperatorCR as shown in the following example:Example YAML definition for a
NUMAResourcesOperatorCR with multipleNodeGroupsCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the NUMA Resources Operator deployed successfully by running the following command:
oc get numaresourcesoperators.nodetopology.openshift.io
$ oc get numaresourcesoperators.nodetopology.openshift.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE numaresourcesoperator 27s
NAME AGE numaresourcesoperator 27sCopy to Clipboard Copied! Toggle word wrap Toggle overflow After a few minutes, run the following command to verify that the required resources deployed successfully:
oc get all -n openshift-numaresources
$ oc get all -n openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE pod/numaresources-controller-manager-7d9d84c58d-qk2mr 1/1 Running 0 12m pod/numaresourcesoperator-worker-7d96r 2/2 Running 0 97s pod/numaresourcesoperator-worker-crsht 2/2 Running 0 97s pod/numaresourcesoperator-worker-jp9mw 2/2 Running 0 97s
NAME READY STATUS RESTARTS AGE pod/numaresources-controller-manager-7d9d84c58d-qk2mr 1/1 Running 0 12m pod/numaresourcesoperator-worker-7d96r 2/2 Running 0 97s pod/numaresourcesoperator-worker-crsht 2/2 Running 0 97s pod/numaresourcesoperator-worker-jp9mw 2/2 Running 0 97sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
9.4.2. Deploying the NUMA-aware secondary pod scheduler Copy linkLink copied to clipboard!
After installing the NUMA Resources Operator, deploy the NUMA-aware secondary pod scheduler to optimize pod placement for improved performance and reduced latency in NUMA-based systems.
Procedure
Create the
NUMAResourcesSchedulercustom resource that deploys the NUMA-aware custom pod scheduler:Save the following minimal required YAML in the
nro-scheduler.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- In a disconnected environment, make sure to configure the resolution of this image by completing one of the following actions:
-
Creating an
ImageTagMirrorSetcustom resource (CR). For more information, see "Configuring image registry repository mirroring" in the "Additional resources" section. - Setting the URL to the disconnected registry.
-
Creating an
Create the
NUMAResourcesSchedulerCR by running the following command:oc create -f nro-scheduler.yaml
$ oc create -f nro-scheduler.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
After a few seconds, run the following command to confirm the successful deployment of the required resources:
oc get all -n openshift-numaresources
$ oc get all -n openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional resources
9.4.3. Configuring a single NUMA node policy Copy linkLink copied to clipboard!
The NUMA Resources Operator requires a single NUMA node policy to be configured on the cluster. This can be achieved in two ways: by creating and applying a performance profile, or by configuring a KubeletConfig.
The preferred way to configure a single NUMA node policy is to apply a performance profile. You can use the Performance Profile Creator (PPC) tool to create the performance profile. If a performance profile is created on the cluster, it automatically creates other tuning components like KubeletConfig and the tuned profile.
For more information about creating a performance profile, see "About the Performance Profile Creator" in the "Additional resources" section.
Additional resources
9.4.4. Sample performance profile Copy linkLink copied to clipboard!
This example YAML shows a performance profile created by using the performance profile creator (PPC) tool:
- 1
- This should match the
MachineConfigPoolthat you want to configure the NUMA Resources Operator on. For example, you might have created aMachineConfigPoolnamedworker-cnfthat designates a set of nodes that run telecommunications workloads. - 2
- The
topologyPolicymust be set tosingle-numa-node. Ensure that this is the case by setting thetopology-manager-policyargument tosingle-numa-nodewhen running the PPC tool.
9.4.5. Creating a KubeletConfig CR Copy linkLink copied to clipboard!
The recommended way to configure a single NUMA node policy is to apply a performance profile. Another way is by creating and applying a KubeletConfig custom resource (CR), as shown in the following procedure.
Procedure
Create the
KubeletConfigcustom resource (CR) that configures the pod admittance policy for the machine profile:Save the following YAML in the
nro-kubeletconfig.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Adjust this label to match the
machineConfigPoolSelectorin theNUMAResourcesOperatorCR. - 2
- For
cpuManagerPolicy,staticmust use a lowercases. - 3
- Adjust this based on the CPU on your nodes.
- 4
- For
memoryManagerPolicy,Staticmust use an uppercaseS. - 5
topologyManagerPolicymust be set tosingle-numa-node.
Create the
KubeletConfigCR by running the following command:oc create -f nro-kubeletconfig.yaml
$ oc create -f nro-kubeletconfig.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteApplying performance profile or
KubeletConfigautomatically triggers rebooting of the nodes. If no reboot is triggered, you can troubleshoot the issue by looking at the labels inKubeletConfigthat address the node group.
9.4.6. Scheduling workloads with the NUMA-aware scheduler Copy linkLink copied to clipboard!
Now that topo-aware-scheduler is installed, the NUMAResourcesOperator and NUMAResourcesScheduler CRs are applied and your cluster has a matching performance profile or kubeletconfig, you can schedule workloads with the NUMA-aware scheduler using deployment CRs that specify the minimum required resources to process the workload.
The following example deployment uses NUMA-aware scheduling for a sample workload.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
Get the name of the NUMA-aware scheduler that is deployed in the cluster by running the following command:
oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
$ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
"topo-aware-scheduler"
"topo-aware-scheduler"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
DeploymentCR that uses scheduler namedtopo-aware-scheduler, for example:Save the following YAML in the
nro-deployment.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
schedulerNamemust match the name of the NUMA-aware scheduler that is deployed in your cluster, for exampletopo-aware-scheduler.
Create the
DeploymentCR by running the following command:oc create -f nro-deployment.yaml
$ oc create -f nro-deployment.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the deployment was successful:
oc get pods -n openshift-numaresources
$ oc get pods -n openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
topo-aware-scheduleris scheduling the deployed pod by running the following command:oc describe pod numa-deployment-1-6c4f5bdb84-wgn6g -n openshift-numaresources
$ oc describe pod numa-deployment-1-6c4f5bdb84-wgn6g -n openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m45s topo-aware-scheduler Successfully assigned openshift-numaresources/numa-deployment-1-6c4f5bdb84-wgn6g to worker-1
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m45s topo-aware-scheduler Successfully assigned openshift-numaresources/numa-deployment-1-6c4f5bdb84-wgn6g to worker-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteDeployments that request more resources than is available for scheduling will fail with a
MinimumReplicasUnavailableerror. The deployment succeeds when the required resources become available. Pods remain in thePendingstate until the required resources are available.Verify that the expected allocated resources are listed for the node.
Identify the node that is running the deployment pod by running the following command:
oc get pods -n openshift-numaresources -o wide
$ oc get pods -n openshift-numaresources -o wideCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES numa-deployment-1-6c4f5bdb84-wgn6g 0/2 Running 0 82m 10.128.2.50 worker-1 <none> <none>
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES numa-deployment-1-6c4f5bdb84-wgn6g 0/2 Running 0 82m 10.128.2.50 worker-1 <none> <none>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command with the name of that node that is running the deployment pod.
oc describe noderesourcetopologies.topology.node.k8s.io worker-1
$ oc describe noderesourcetopologies.topology.node.k8s.io worker-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
Availablecapacity is reduced because of the resources that have been allocated to the guaranteed pod.
Resources consumed by guaranteed pods are subtracted from the available node resources listed under
noderesourcetopologies.topology.node.k8s.io.
Resource allocations for pods with a
Best-effortorBurstablequality of service (qosClass) are not reflected in the NUMA node resources undernoderesourcetopologies.topology.node.k8s.io. If a pod’s consumed resources are not reflected in the node resource calculation, verify that the pod hasqosClassofGuaranteedand the CPU request is an integer value, not a decimal value. You can verify the that the pod has aqosClassofGuaranteedby running the following command:oc get pod numa-deployment-1-6c4f5bdb84-wgn6g -n openshift-numaresources -o jsonpath="{ .status.qosClass }"$ oc get pod numa-deployment-1-6c4f5bdb84-wgn6g -n openshift-numaresources -o jsonpath="{ .status.qosClass }"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Guaranteed
GuaranteedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
9.5. Optional: Configuring polling operations for NUMA resources updates Copy linkLink copied to clipboard!
The daemons controlled by the NUMA Resources Operator in their nodeGroup poll resources to retrieve updates about available NUMA resources. You can fine-tune polling operations for these daemons by configuring the spec.nodeGroups specification in the NUMAResourcesOperator custom resource (CR). This provides advanced control of polling operations. Configure these specifications to improve scheduling behaviour and troubleshoot suboptimal scheduling decisions.
The configuration options are the following:
-
infoRefreshMode: Determines the trigger condition for polling the kubelet. The NUMA Resources Operator reports the resulting information to the API server. -
infoRefreshPeriod: Determines the duration between polling updates. podsFingerprinting: Determines if point-in-time information for the current set of pods running on a node is exposed in polling updates.NotepodsFingerprintingis enabled by default.podsFingerprintingis a requirement for thecacheResyncPeriodspecification in theNUMAResourcesSchedulerCR. ThecacheResyncPeriodspecification helps to report more exact resource availability by monitoring pending resources on nodes.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges. - Install the NUMA Resources Operator.
Procedure
Configure the
spec.nodeGroupsspecification in yourNUMAResourcesOperatorCR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Valid values are
Periodic,Events,PeriodicAndEvents. UsePeriodicto poll the kubelet at intervals that you define ininfoRefreshPeriod. UseEventsto poll the kubelet at every pod lifecycle event. UsePeriodicAndEventsto enable both methods. - 2
- Define the polling interval for
PeriodicorPeriodicAndEventsrefresh modes. The field is ignored if the refresh mode isEvents. - 3
- Valid values are
Enabled,Disabled, andEnabledExclusiveResources. Setting toEnabledis a requirement for thecacheResyncPeriodspecification in theNUMAResourcesScheduler.
Verification
After you deploy the NUMA Resources Operator, verify that the node group configurations were applied by running the following command:
oc get numaresop numaresourcesoperator -o json | jq '.status'
$ oc get numaresop numaresourcesoperator -o json | jq '.status'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.6. Troubleshooting NUMA-aware scheduling Copy linkLink copied to clipboard!
To troubleshoot common problems with NUMA-aware pod scheduling, perform the following steps.
Prerequisites
-
Install the OpenShift Container Platform CLI (
oc). - Log in as a user with cluster-admin privileges.
- Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.
Procedure
Verify that the
noderesourcetopologiesCRD is deployed in the cluster by running the following command:oc get crd | grep noderesourcetopologies
$ oc get crd | grep noderesourcetopologiesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CREATED AT noderesourcetopologies.topology.node.k8s.io 2022-01-18T08:28:06Z
NAME CREATED AT noderesourcetopologies.topology.node.k8s.io 2022-01-18T08:28:06ZCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the NUMA-aware scheduler name matches the name specified in your NUMA-aware workloads by running the following command:
oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'
$ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
topo-aware-scheduler
topo-aware-schedulerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that NUMA-aware schedulable nodes have the
noderesourcetopologiesCR applied to them. Run the following command:oc get noderesourcetopologies.topology.node.k8s.io
$ oc get noderesourcetopologies.topology.node.k8s.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE compute-0.example.com 17h compute-1.example.com 17h
NAME AGE compute-0.example.com 17h compute-1.example.com 17hCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe number of nodes should equal the number of worker nodes that are configured by the machine config pool (
mcp) worker definition.Verify the NUMA zone granularity for all schedulable nodes by running the following command:
oc get noderesourcetopologies.topology.node.k8s.io -o yaml
$ oc get noderesourcetopologies.topology.node.k8s.io -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.6.1. Reporting more exact resource availability Copy linkLink copied to clipboard!
Enable the cacheResyncPeriod specification to help the NUMA Resources Operator report more exact resource availability by monitoring pending resources on nodes and synchronizing this information in the scheduler cache at a defined interval. This also helps to minimize Topology Affinity Error errors because of sub-optimal scheduling decisions. The lower the interval, the greater the network load. The cacheResyncPeriod specification is disabled by default.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
Delete the currently running
NUMAResourcesSchedulerresource:Get the active
NUMAResourcesSchedulerby running the following command:oc get NUMAResourcesScheduler
$ oc get NUMAResourcesSchedulerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE numaresourcesscheduler 92m
NAME AGE numaresourcesscheduler 92mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the secondary scheduler resource by running the following command:
oc delete NUMAResourcesScheduler numaresourcesscheduler
$ oc delete NUMAResourcesScheduler numaresourcesschedulerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Save the following YAML in the file
nro-scheduler-cacheresync.yaml. This example changes the log level toDebug:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Enter an interval value in seconds for synchronization of the scheduler cache. A value of
5sis typical for most implementations.
Create the updated
NUMAResourcesSchedulerresource by running the following command:oc create -f nro-scheduler-cacheresync.yaml
$ oc create -f nro-scheduler-cacheresync.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Check that the NUMA-aware scheduler was successfully deployed:
Run the following command to check that the CRD is created successfully:
oc get crd | grep numaresourcesschedulers
$ oc get crd | grep numaresourcesschedulersCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03ZCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the new custom scheduler is available by running the following command:
oc get numaresourcesschedulers.nodetopology.openshift.io
$ oc get numaresourcesschedulers.nodetopology.openshift.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE numaresourcesscheduler 3h26m
NAME AGE numaresourcesscheduler 3h26mCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Check that the logs for the scheduler show the increased log level:
Get the list of pods running in the
openshift-numaresourcesnamespace by running the following command:oc get pods -n openshift-numaresources
$ oc get pods -n openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the logs for the secondary scheduler pod by running the following command:
oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.6.2. Checking the NUMA-aware scheduler logs Copy linkLink copied to clipboard!
Troubleshoot problems with the NUMA-aware scheduler by reviewing the logs. If required, you can increase the scheduler log level by modifying the spec.logLevel field of the NUMAResourcesScheduler resource. Acceptable values are Normal, Debug, and Trace, with Trace being the most verbose option.
To change the log level of the secondary scheduler, delete the running scheduler resource and re-deploy it with the changed log level. The scheduler is unavailable for scheduling new workloads during this downtime.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
Delete the currently running
NUMAResourcesSchedulerresource:Get the active
NUMAResourcesSchedulerby running the following command:oc get NUMAResourcesScheduler
$ oc get NUMAResourcesSchedulerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE numaresourcesscheduler 90m
NAME AGE numaresourcesscheduler 90mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the secondary scheduler resource by running the following command:
oc delete NUMAResourcesScheduler numaresourcesscheduler
$ oc delete NUMAResourcesScheduler numaresourcesschedulerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Save the following YAML in the file
nro-scheduler-debug.yaml. This example changes the log level toDebug:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the updated
DebugloggingNUMAResourcesSchedulerresource by running the following command:oc create -f nro-scheduler-debug.yaml
$ oc create -f nro-scheduler-debug.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Check that the NUMA-aware scheduler was successfully deployed:
Run the following command to check that the CRD is created successfully:
oc get crd | grep numaresourcesschedulers
$ oc get crd | grep numaresourcesschedulersCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03Z
NAME CREATED AT numaresourcesschedulers.nodetopology.openshift.io 2022-02-25T11:57:03ZCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the new custom scheduler is available by running the following command:
oc get numaresourcesschedulers.nodetopology.openshift.io
$ oc get numaresourcesschedulers.nodetopology.openshift.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE numaresourcesscheduler 3h26m
NAME AGE numaresourcesscheduler 3h26mCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Check that the logs for the scheduler shows the increased log level:
Get the list of pods running in the
openshift-numaresourcesnamespace by running the following command:oc get pods -n openshift-numaresources
$ oc get pods -n openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21m
NAME READY STATUS RESTARTS AGE numaresources-controller-manager-d87d79587-76mrm 1/1 Running 0 46h numaresourcesoperator-worker-5wm2k 2/2 Running 0 45h numaresourcesoperator-worker-pb75c 2/2 Running 0 45h secondary-scheduler-7976c4d466-qm4sc 1/1 Running 0 21mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the logs for the secondary scheduler pod by running the following command:
oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources
$ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresourcesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.6.3. Troubleshooting the resource topology exporter Copy linkLink copied to clipboard!
Troubleshoot noderesourcetopologies objects where unexpected results are occurring by inspecting the corresponding resource-topology-exporter logs.
It is recommended that NUMA resource topology exporter instances in the cluster are named for nodes they refer to. For example, a worker node with the name worker should have a corresponding noderesourcetopologies object called worker.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
Get the daemonsets managed by the NUMA Resources Operator. Each daemonset has a corresponding
nodeGroupin theNUMAResourcesOperatorCR. Run the following command:oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"$ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
{"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}{"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the label for the daemonset of interest using the value for
namefrom the previous step:oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"$ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
{"name":"resource-topology"}{"name":"resource-topology"}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the pods using the
resource-topologylabel by running the following command:oc get pods -n openshift-numaresources -l name=resource-topology -o wide
$ oc get pods -n openshift-numaresources -l name=resource-topology -o wideCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE IP NODE numaresourcesoperator-worker-5wm2k 2/2 Running 0 2d1h 10.135.0.64 compute-0.example.com numaresourcesoperator-worker-pb75c 2/2 Running 0 2d1h 10.132.2.33 compute-1.example.com
NAME READY STATUS RESTARTS AGE IP NODE numaresourcesoperator-worker-5wm2k 2/2 Running 0 2d1h 10.135.0.64 compute-0.example.com numaresourcesoperator-worker-pb75c 2/2 Running 0 2d1h 10.132.2.33 compute-1.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Examine the logs of the
resource-topology-exportercontainer running on the worker pod that corresponds to the node you are troubleshooting. Run the following command:oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c
$ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75cCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.6.4. Correcting a missing resource topology exporter config map Copy linkLink copied to clipboard!
If you install the NUMA Resources Operator in a cluster with misconfigured cluster settings, in some circumstances, the Operator is shown as active but the logs of the resource topology exporter (RTE) daemon set pods show that the configuration for the RTE is missing, for example:
Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"
Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"
This log message indicates that the kubeletconfig with the required configuration was not properly applied in the cluster, resulting in a missing RTE configmap. For example, the following cluster is missing a numaresourcesoperator-worker configmap custom resource (CR):
oc get configmap
$ oc get configmap
Example output
NAME DATA AGE 0e2a6bd3.openshift-kni.io 0 6d21h kube-root-ca.crt 1 6d21h openshift-service-ca.crt 1 6d21h topo-aware-scheduler-config 1 6d18h
NAME DATA AGE
0e2a6bd3.openshift-kni.io 0 6d21h
kube-root-ca.crt 1 6d21h
openshift-service-ca.crt 1 6d21h
topo-aware-scheduler-config 1 6d18h
In a correctly configured cluster, oc get configmap also returns a numaresourcesoperator-worker configmap CR.
Prerequisites
-
Install the OpenShift Container Platform CLI (
oc). - Log in as a user with cluster-admin privileges.
- Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.
Procedure
Compare the values for
spec.machineConfigPoolSelector.matchLabelsinkubeletconfigandmetadata.labelsin theMachineConfigPool(mcp) worker CR using the following commands:Check the
kubeletconfiglabels by running the following command:oc get kubeletconfig -o yaml
$ oc get kubeletconfig -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
machineConfigPoolSelector: matchLabels: cnf-worker-tuning: enabledmachineConfigPoolSelector: matchLabels: cnf-worker-tuning: enabledCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the
mcplabels by running the following command:oc get mcp worker -o yaml
$ oc get mcp worker -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: ""
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: ""Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
cnf-worker-tuning: enabledlabel is not present in theMachineConfigPoolobject.
Edit the
MachineConfigPoolCR to include the missing label, for example:oc edit mcp worker -o yaml
$ oc edit mcp worker -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" cnf-worker-tuning: enabled
labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" cnf-worker-tuning: enabledCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Apply the label changes and wait for the cluster to apply the updated configuration. Run the following command:
Verification
Check that the missing
numaresourcesoperator-workerconfigmapCR is applied:oc get configmap
$ oc get configmapCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
9.6.5. Collecting NUMA Resources Operator data Copy linkLink copied to clipboard!
You can use the oc adm must-gather CLI command to collect information about your cluster, including features and objects associated with the NUMA Resources Operator.
Prerequisites
-
You have access to the cluster as a user with the
cluster-adminrole. -
You have installed the OpenShift CLI (
oc).
Procedure
To collect NUMA Resources Operator data with
must-gather, you must specify the NUMA Resources Operatormust-gatherimage.oc adm must-gather --image=registry.redhat.io/openshift4/numaresources-must-gather-rhel9:v4.14
$ oc adm must-gather --image=registry.redhat.io/openshift4/numaresources-must-gather-rhel9:v4.14Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 10. Scalability and performance optimization Copy linkLink copied to clipboard!
10.1. Optimizing storage Copy linkLink copied to clipboard!
Optimizing storage helps to minimize storage use across all resources. By optimizing storage, administrators help ensure that existing storage resources are working in an efficient manner.
10.1.1. Available persistent storage options Copy linkLink copied to clipboard!
Understand your persistent storage options so that you can optimize your OpenShift Container Platform environment.
| Storage type | Description | Examples |
|---|---|---|
| Block |
| AWS EBS and VMware vSphere support dynamic persistent volume (PV) provisioning natively in the OpenShift Container Platform. |
| File |
| RHEL NFS, NetApp NFS [1], and Vendor NFS |
| Object |
| AWS S3 |
- NetApp NFS supports dynamic PV provisioning when using the Trident plugin.
10.1.2. Recommended configurable storage technology Copy linkLink copied to clipboard!
The following table summarizes the recommended and configurable storage technologies for the given OpenShift Container Platform cluster application.
| Storage type | Block | File | Object |
|---|---|---|---|
|
1
2 3 Prometheus is the underlying technology used for metrics. 4 This does not apply to physical disk, VM physical disk, VMDK, loopback over NFS, AWS EBS, and Azure Disk.
5 For metrics, using file storage with the 6 For logging, review the recommended storage solution in Configuring persistent storage for the log store section. Using NFS storage as a persistent volume or through NAS, such as Gluster, can corrupt the data. Hence, NFS is not supported for Elasticsearch storage and LokiStack log store in OpenShift Container Platform Logging. You must use one persistent volume type per log store. 7 Object storage is not consumed through OpenShift Container Platform’s PVs or PVCs. Apps must integrate with the object storage REST API. | |||
| ROX1 | Yes4 | Yes4 | Yes |
| RWX2 | No | Yes | Yes |
| Registry | Configurable | Configurable | Recommended |
| Scaled registry | Not configurable | Configurable | Recommended |
| Metrics3 | Recommended | Configurable5 | Not configurable |
| Elasticsearch Logging | Recommended | Configurable6 | Not supported6 |
| Loki Logging | Not configurable | Not configurable | Recommended |
| Apps | Recommended | Recommended | Not configurable7 |
A scaled registry is an OpenShift image registry where two or more pod replicas are running.
10.1.2.1. Specific application storage recommendations Copy linkLink copied to clipboard!
Testing shows issues with using the NFS server on Red Hat Enterprise Linux (RHEL) as a storage backend for core services. This includes the OpenShift Container Registry and Quay, Prometheus for monitoring storage, and Elasticsearch for logging storage. Therefore, using RHEL NFS to back PVs used by core services is not recommended.
Other NFS implementations in the marketplace might not have these issues. Contact the individual NFS implementation vendor for more information on any testing that was possibly completed against these OpenShift Container Platform core components.
10.1.2.1.1. Registry Copy linkLink copied to clipboard!
In a non-scaled/high-availability (HA) OpenShift image registry cluster deployment:
- The storage technology does not have to support RWX access mode.
- The storage technology must ensure read-after-write consistency.
- The preferred storage technology is object storage followed by block storage.
- File storage is not recommended for OpenShift image registry cluster deployment with production workloads.
10.1.2.1.2. Scaled registry Copy linkLink copied to clipboard!
In a scaled/HA OpenShift image registry cluster deployment:
- The storage technology must support RWX access mode.
- The storage technology must ensure read-after-write consistency.
- The preferred storage technology is object storage.
- Red Hat OpenShift Data Foundation (ODF), Amazon Simple Storage Service (Amazon S3), Google Cloud Storage (GCS), Microsoft Azure Blob Storage, and OpenStack Swift are supported.
- Object storage should be S3 or Swift compliant.
- For non-cloud platforms, such as vSphere and bare metal installations, the only configurable technology is file storage.
- Block storage is not configurable.
- The use of Network File System (NFS) storage with OpenShift Container Platform is supported. However, the use of NFS storage with a scaled registry can cause known issues. For more information, see the Red Hat Knowledgebase solution, Is NFS supported for OpenShift cluster internal components in Production?.
10.1.2.1.3. Metrics Copy linkLink copied to clipboard!
In an OpenShift Container Platform hosted metrics cluster deployment:
- The preferred storage technology is block storage.
- Object storage is not configurable.
It is not recommended to use file storage for a hosted metrics cluster deployment with production workloads.
10.1.2.1.4. Logging Copy linkLink copied to clipboard!
In an OpenShift Container Platform hosted logging cluster deployment:
Loki Operator:
- The preferred storage technology is S3 compatible Object storage.
- Block storage is not configurable.
OpenShift Elasticsearch Operator:
- The preferred storage technology is block storage.
- Object storage is not supported.
As of logging version 5.4.3 the OpenShift Elasticsearch Operator is deprecated and is planned to be removed in a future release. Red Hat will provide bug fixes and support for this feature during the current release lifecycle, but this feature will no longer receive enhancements and will be removed. As an alternative to using the OpenShift Elasticsearch Operator to manage the default log storage, you can use the Loki Operator.
10.1.2.1.5. Applications Copy linkLink copied to clipboard!
Application use cases vary from application to application, as described in the following examples:
- Storage technologies that support dynamic PV provisioning have low mount time latencies, and are not tied to nodes to support a healthy cluster.
- Application developers are responsible for knowing and understanding the storage requirements for their application, and how it works with the provided storage to ensure that issues do not occur when an application scales or interacts with the storage layer.
10.1.2.2. Other specific application storage recommendations Copy linkLink copied to clipboard!
It is not recommended to use RAID configurations on Write intensive workloads, such as etcd. If you are running etcd with a RAID configuration, you might be at risk of encountering performance issues with your workloads.
- Red Hat OpenStack Platform (RHOSP) Cinder: RHOSP Cinder tends to be adept in ROX access mode use cases.
- Databases: Databases (RDBMSs, NoSQL DBs, etc.) tend to perform best with dedicated block storage.
- The etcd database must have enough storage and adequate performance capacity to enable a large cluster. Information about monitoring and benchmarking tools to establish ample storage and a high-performance environment is described in Recommended etcd practices.
10.1.3. Data storage management Copy linkLink copied to clipboard!
The following table summarizes the main directories that OpenShift Container Platform components write data to.
| Directory | Notes | Sizing | Expected growth |
|---|---|---|---|
| /var/log | Log files for all components. | 10 to 30 GB. | Log files can grow quickly; size can be managed by growing disks or by using log rotate. |
| /var/lib/etcd | Used for etcd storage when storing the database. | Less than 20 GB. Database can grow up to 8 GB. | Will grow slowly with the environment. Only storing metadata. Additional 20-25 GB for every additional 8 GB of memory. |
| /var/lib/containers | This is the mount point for the CRI-O runtime. Storage used for active container runtimes, including pods, and storage of local images. Not used for registry storage. | 50 GB for a node with 16 GB memory. Note that this sizing should not be used to determine minimum cluster requirements. Additional 20-25 GB for every additional 8 GB of memory. | Growth is limited by capacity for running containers. |
| /var/lib/kubelet | Ephemeral volume storage for pods. This includes anything external that is mounted into a container at runtime. Includes environment variables, kube secrets, and data volumes not backed by persistent volumes. | Varies | Minimal if pods requiring storage are using persistent volumes. If using ephemeral storage, this can grow quickly. |
10.1.4. Optimizing storage performance for Microsoft Azure Copy linkLink copied to clipboard!
OpenShift Container Platform and Kubernetes are sensitive to disk performance, and faster storage is recommended, particularly for etcd on the control plane nodes.
For production Azure clusters and clusters with intensive workloads, the virtual machine operating system disk for control plane machines should be able to sustain a tested and recommended minimum throughput of 5000 IOPS / 200MBps. This throughput can be provided by having a minimum of 1 TiB Premium SSD (P30). In Azure and Azure Stack Hub, disk performance is directly dependent on SSD disk sizes. To achieve the throughput supported by a Standard_D8s_v3 virtual machine, or other similar machine types, and the target of 5000 IOPS, at least a P30 disk is required.
Host caching must be set to ReadOnly for low latency and high IOPS and throughput when reading data. Reading data from the cache, which is present either in the VM memory or in the local SSD disk, is much faster than reading from the disk, which is in the blob storage.
10.2. Optimizing routing Copy linkLink copied to clipboard!
The OpenShift Container Platform HAProxy router can be scaled or configured to optimize performance.
10.2.1. Baseline Ingress Controller (router) performance Copy linkLink copied to clipboard!
The OpenShift Container Platform Ingress Controller, or router, is the ingress point for ingress traffic for applications and services that are configured using routes and ingresses.
When evaluating a single HAProxy router performance in terms of HTTP requests handled per second, the performance varies depending on many factors. In particular:
- HTTP keep-alive/close mode
- Route type
- TLS session resumption client support
- Number of concurrent connections per target route
- Number of target routes
- Back end server page size
- Underlying infrastructure (network/SDN solution, CPU, and so on)
While performance in your specific environment will vary, Red Hat lab tests on a public cloud instance of size 4 vCPU/16GB RAM. A single HAProxy router handling 100 routes terminated by backends serving 1kB static pages is able to handle the following number of transactions per second.
In HTTP keep-alive mode scenarios:
| Encryption | LoadBalancerService | HostNetwork |
|---|---|---|
| none | 21515 | 29622 |
| edge | 16743 | 22913 |
| passthrough | 36786 | 53295 |
| re-encrypt | 21583 | 25198 |
In HTTP close (no keep-alive) scenarios:
| Encryption | LoadBalancerService | HostNetwork |
|---|---|---|
| none | 5719 | 8273 |
| edge | 2729 | 4069 |
| passthrough | 4121 | 5344 |
| re-encrypt | 2320 | 2941 |
The default Ingress Controller configuration was used with the spec.tuningOptions.threadCount field set to 4. Two different endpoint publishing strategies were tested: Load Balancer Service and Host Network. TLS session resumption was used for encrypted routes. With HTTP keep-alive, a single HAProxy router is capable of saturating a 1 Gbit NIC at page sizes as small as 8 kB.
When running on bare metal with modern processors, you can expect roughly twice the performance of the public cloud instance above. This overhead is introduced by the virtualization layer in place on public clouds and holds mostly true for private cloud-based virtualization as well. The following table is a guide to how many applications to use behind the router:
| Number of applications | Application type |
|---|---|
| 5-10 | static file/web server or caching proxy |
| 100-1000 | applications generating dynamic content |
In general, HAProxy can support routes for up to 1000 applications, depending on the technology in use. Ingress Controller performance might be limited by the capabilities and performance of the applications behind it, such as language or static versus dynamic content.
Ingress, or router, sharding should be used to serve more routes towards applications and help horizontally scale the routing tier.
For more information on Ingress sharding, see Configuring Ingress Controller sharding by using route labels and Configuring Ingress Controller sharding by using namespace labels.
You can modify the Ingress Controller deployment by using the information provided in Setting Ingress Controller thread count for threads and Ingress Controller configuration parameters for timeouts, and other tuning configurations in the Ingress Controller specification.
10.2.2. Configuring Ingress Controller liveness, readiness, and startup probes Copy linkLink copied to clipboard!
Cluster administrators can configure the timeout values for the kubelet’s liveness, readiness, and startup probes for router deployments that are managed by the OpenShift Container Platform Ingress Controller (router). The liveness and readiness probes of the router use the default timeout value of 1 second, which is too brief when networking or runtime performance is severely degraded. Probe timeouts can cause unwanted router restarts that interrupt application connections. The ability to set larger timeout values can reduce the risk of unnecessary and unwanted restarts.
You can update the timeoutSeconds value on the livenessProbe, readinessProbe, and startupProbe parameters of the router container.
| Parameter | Description |
|---|---|
|
|
The |
|
|
The |
|
|
The |
The timeout configuration option is an advanced tuning technique that can be used to work around issues. However, these issues should eventually be diagnosed and possibly a support case or Jira issue opened for any issues that causes probes to time out.
The following example demonstrates how you can directly patch the default router deployment to set a 5-second timeout for the liveness and readiness probes:
oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}'
$ oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}'
Verification
oc -n openshift-ingress describe deploy/router-default | grep -e Liveness: -e Readiness:
$ oc -n openshift-ingress describe deploy/router-default | grep -e Liveness: -e Readiness:
Liveness: http-get http://:1936/healthz delay=0s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:1936/healthz/ready delay=0s timeout=5s period=10s #success=1 #failure=3
10.2.3. Configuring HAProxy reload interval Copy linkLink copied to clipboard!
When you update a route or an endpoint associated with a route, the OpenShift Container Platform router updates the configuration for HAProxy. Then, HAProxy reloads the updated configuration for those changes to take effect. When HAProxy reloads, it generates a new process that handles new connections using the updated configuration.
HAProxy keeps the old process running to handle existing connections until those connections are all closed. When old processes have long-lived connections, these processes can accumulate and consume resources.
The default minimum HAProxy reload interval is five seconds. You can configure an Ingress Controller using its spec.tuningOptions.reloadInterval field to set a longer minimum reload interval.
Setting a large value for the minimum HAProxy reload interval can cause latency in observing updates to routes and their endpoints. To lessen the risk, avoid setting a value larger than the tolerable latency for updates. The maximum value for HAProxy reload interval is 120 seconds.
Procedure
Change the minimum HAProxy reload interval of the default Ingress Controller to 15 seconds by running the following command:
oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"tuningOptions":{"reloadInterval":"15s"}}}'$ oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"tuningOptions":{"reloadInterval":"15s"}}}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
10.3. Optimizing networking Copy linkLink copied to clipboard!
The OpenShift SDN uses OpenvSwitch, virtual extensible LAN (VXLAN) tunnels, OpenFlow rules, and iptables. This network can be tuned by using jumbo frames, multi-queue, and ethtool settings.
OVN-Kubernetes uses Generic Network Virtualization Encapsulation (Geneve) instead of VXLAN as the tunnel protocol. This network can be tuned by using network interface controller (NIC) offloads.
Cloud, virtual, and bare-metal environments running OpenShift Container Platform can use a high percentage of a NIC’s capabilities with minimal tuning. Production clusters using OVN-Kubernetes with Geneve tunneling can handle high-throughput traffic effectively and scale up (for example, utilizing 100 Gbps NICs) and scale out (for example, adding more NICs) without requiring special configuration.
In some high-performance scenarios where maximum efficiency is critical, targeted performance tuning can help optimize CPU usage, reduce overhead, and ensure that you are making full use of the NIC’s capabilities.
For environments where maximum throughput and CPU efficiency are critical, you can further optimize performance with the following strategies:
-
Validate network performance using tools such as
iPerf3andk8s-netperf. These tools allow you to benchmark throughput, latency, and packets-per-second (PPS) across pod and node interfaces. - Evaluate OVN-Kubernetes User Defined Networking (UDN) routing techniques, such as border gateway protocol (BGP).
- Use Geneve-offload capable network adapters. Geneve-offload moves the packet checksum calculation and associated CPU overhead off of the system CPU and onto dedicated hardware on the network adapter. This frees up CPU cycles for use by pods and applications, and allows users to use the full bandwidth of their network infrastructure.
10.3.1. Optimizing the MTU for your network Copy linkLink copied to clipboard!
There are two important maximum transmission units (MTUs): the network interface controller (NIC) MTU and the cluster network MTU.
The NIC MTU is configured at the time of OpenShift Container Platform installation, and you can also change the MTU of a cluster as a postinstallation task. For more information, see "Changing cluster network MTU".
For a cluster that uses the OVN-Kubernetes plugin, the MTU must be less than 100 bytes to the maximum supported value of the NIC of your network. If you are optimizing for throughput, choose the largest possible value, such as 8900. If you are optimizing for lowest latency, choose a lower value.
If your cluster uses the OVN-Kubernetes plugin and the network uses a NIC to send and receive unfragmented jumbo frame packets over the network, you must specify 9000 bytes as the MTU value for the NIC so that pods do not fail.
The OpenShift SDN network plugin overlay MTU must be less than the NIC MTU by 50 bytes at a minimum. This accounts for the SDN overlay header. So, on a normal ethernet network, this should be set to 1450. On a jumbo frame ethernet network, this should be set to 8950. These values should be set automatically by the Cluster Network Operator based on the NIC’s configured MTU. Therefore, cluster administrators do not typically update these values. Amazon Web Services (AWS) and bare-metal environments support jumbo frame ethernet networks. This setting will help throughput, especially with transmission control protocol (TCP).
For OVN and Geneve, the MTU must be less than the NIC MTU by 100 bytes at a minimum.
This 50 byte overlay header is relevant to the OpenShift SDN network plugin. Other SDN solutions might require the value to be more or less.
10.3.2. Recommended practices for installing large scale clusters Copy linkLink copied to clipboard!
When installing large clusters or scaling the cluster to larger node counts, set the cluster network cidr accordingly in your install-config.yaml file before you install the cluster.
Example install-config.yaml file with a network configuration for a cluster with a large node count
The default cluster network cidr 10.128.0.0/14 cannot be used if the cluster size is more than 500 nodes. The cidr must be set to 10.128.0.0/12 or 10.128.0.0/10 to get to larger node counts beyond 500 nodes.
10.3.3. Impact of IPsec Copy linkLink copied to clipboard!
Because encrypting and decrypting node hosts uses CPU power, performance is affected both in throughput and CPU usage on the nodes when encryption is enabled, regardless of the IP security system being used.
IPSec encrypts traffic at the IP payload level, before it hits the NIC, protecting fields that would otherwise be used for NIC offloading. This means that some NIC acceleration features might not be usable when IPSec is enabled and leads to decreased throughput and increased CPU usage.
10.4. Optimizing CPU usage with mount namespace encapsulation Copy linkLink copied to clipboard!
You can optimize CPU usage in OpenShift Container Platform clusters by using mount namespace encapsulation to provide a private namespace for kubelet and CRI-O processes. This reduces the cluster CPU resources used by systemd with no difference in functionality.
Mount namespace encapsulation is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
10.4.1. Encapsulating mount namespaces Copy linkLink copied to clipboard!
Mount namespaces are used to isolate mount points so that processes in different namespaces cannot view each others' files. Encapsulation is the process of moving Kubernetes mount namespaces to an alternative location where they will not be constantly scanned by the host operating system.
The host operating system uses systemd to constantly scan all mount namespaces: both the standard Linux mounts and the numerous mounts that Kubernetes uses to operate. The current implementation of kubelet and CRI-O both use the top-level namespace for all container runtime and kubelet mount points. However, encapsulating these container-specific mount points in a private namespace reduces systemd overhead with no difference in functionality. Using a separate mount namespace for both CRI-O and kubelet can encapsulate container-specific mounts from any systemd or other host operating system interaction.
This ability to potentially achieve major CPU optimization is now available to all OpenShift Container Platform administrators. Encapsulation can also improve security by storing Kubernetes-specific mount points in a location safe from inspection by unprivileged users.
The following diagrams illustrate a Kubernetes installation before and after encapsulation. Both scenarios show example containers which have mount propagation settings of bidirectional, host-to-container, and none.
Here we see systemd, host operating system processes, kubelet, and the container runtime sharing a single mount namespace.
- systemd, host operating system processes, kubelet, and the container runtime each have access to and visibility of all mount points.
-
Container 1, configured with bidirectional mount propagation, can access systemd and host mounts, kubelet and CRI-O mounts. A mount originating in Container 1, such as
/run/ais visible to systemd, host operating system processes, kubelet, container runtime, and other containers with host-to-container or bidirectional mount propagation configured (as in Container 2). -
Container 2, configured with host-to-container mount propagation, can access systemd and host mounts, kubelet and CRI-O mounts. A mount originating in Container 2, such as
/run/b, is not visible to any other context. -
Container 3, configured with no mount propagation, has no visibility of external mount points. A mount originating in Container 3, such as
/run/c, is not visible to any other context.
The following diagram illustrates the system state after encapsulation.
- The main systemd process is no longer devoted to unnecessary scanning of Kubernetes-specific mount points. It only monitors systemd-specific and host mount points.
- The host operating system processes can access only the systemd and host mount points.
- Using a separate mount namespace for both CRI-O and kubelet completely separates all container-specific mounts away from any systemd or other host operating system interaction whatsoever.
-
The behavior of Container 1 is unchanged, except a mount it creates such as
/run/ais no longer visible to systemd or host operating system processes. It is still visible to kubelet, CRI-O, and other containers with host-to-container or bidirectional mount propagation configured (like Container 2). - The behavior of Container 2 and Container 3 is unchanged.
10.4.2. Configuring mount namespace encapsulation Copy linkLink copied to clipboard!
You can configure mount namespace encapsulation so that a cluster runs with less resource overhead.
Mount namespace encapsulation is a Technology Preview feature and it is disabled by default. To use it, you must enable the feature manually.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges.
Procedure
Create a file called
mount_namespace_config.yamlwith the following YAML:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the mount namespace
MachineConfigCR by running the following command:oc apply -f mount_namespace_config.yaml
$ oc apply -f mount_namespace_config.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
machineconfig.machineconfiguration.openshift.io/99-kubens-master created machineconfig.machineconfiguration.openshift.io/99-kubens-worker created
machineconfig.machineconfiguration.openshift.io/99-kubens-master created machineconfig.machineconfiguration.openshift.io/99-kubens-worker createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow The
MachineConfigCR can take up to 30 minutes to finish being applied in the cluster. You can check the status of theMachineConfigCR by running the following command:oc get mcp
$ oc get mcpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-03d4bc4befb0f4ed3566a2c8f7636751 False True False 3 0 0 0 45m worker rendered-worker-10577f6ab0117ed1825f8af2ac687ddf False True False 3 1 1
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-03d4bc4befb0f4ed3566a2c8f7636751 False True False 3 0 0 0 45m worker rendered-worker-10577f6ab0117ed1825f8af2ac687ddf False True False 3 1 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Wait for the
MachineConfigCR to be applied successfully across all control plane and worker nodes after running the following command:oc wait --for=condition=Updated mcp --all --timeout=30m
$ oc wait --for=condition=Updated mcp --all --timeout=30mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
machineconfigpool.machineconfiguration.openshift.io/master condition met machineconfigpool.machineconfiguration.openshift.io/worker condition met
machineconfigpool.machineconfiguration.openshift.io/master condition met machineconfigpool.machineconfiguration.openshift.io/worker condition metCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
To verify encapsulation for a cluster host, run the following commands:
Open a debug shell to the cluster host:
oc debug node/<node_name>
$ oc debug node/<node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Open a
chrootsession:chroot /host
sh-4.4# chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the systemd mount namespace:
readlink /proc/1/ns/mnt
sh-4.4# readlink /proc/1/ns/mntCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
mnt:[4026531953]
mnt:[4026531953]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check kubelet mount namespace:
readlink /proc/$(pgrep kubelet)/ns/mnt
sh-4.4# readlink /proc/$(pgrep kubelet)/ns/mntCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
mnt:[4026531840]
mnt:[4026531840]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the CRI-O mount namespace:
readlink /proc/$(pgrep crio)/ns/mnt
sh-4.4# readlink /proc/$(pgrep crio)/ns/mntCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
mnt:[4026531840]
mnt:[4026531840]Copy to Clipboard Copied! Toggle word wrap Toggle overflow
These commands return the mount namespaces associated with systemd, kubelet, and the container runtime. In OpenShift Container Platform, the container runtime is CRI-O.
Encapsulation is in effect if systemd is in a different mount namespace to kubelet and CRI-O as in the above example. Encapsulation is not in effect if all three processes are in the same mount namespace.
10.4.3. Inspecting encapsulated namespaces Copy linkLink copied to clipboard!
You can inspect Kubernetes-specific mount points in the cluster host operating system for debugging or auditing purposes by using the kubensenter script that is available in Red Hat Enterprise Linux CoreOS (RHCOS).
SSH shell sessions to the cluster host are in the default namespace. To inspect Kubernetes-specific mount points in an SSH shell prompt, you need to run the kubensenter script as root. The kubensenter script is aware of the state of the mount encapsulation, and is safe to run even if encapsulation is not enabled.
oc debug remote shell sessions start inside the Kubernetes namespace by default. You do not need to run kubensenter to inspect mount points when you use oc debug.
If the encapsulation feature is not enabled, the kubensenter findmnt and findmnt commands return the same output, regardless of whether they are run in an oc debug session or in an SSH shell prompt.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have configured SSH access to the cluster host.
Procedure
Open a remote SSH shell to the cluster host. For example:
ssh core@<node_name>
$ ssh core@<node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run commands using the provided
kubensenterscript as the root user. To run a single command inside the Kubernetes namespace, provide the command and any arguments to thekubensenterscript. For example, to run thefindmntcommand inside the Kubernetes namespace, run the following command:sudo kubensenter findmnt
[core@control-plane-1 ~]$ sudo kubensenter findmntCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To start a new interactive shell inside the Kubernetes namespace, run the
kubensenterscript without any arguments:sudo kubensenter
[core@control-plane-1 ~]$ sudo kubensenterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
kubensenter: Autodetect: kubens.service namespace found at /run/kubens/mnt
kubensenter: Autodetect: kubens.service namespace found at /run/kubens/mntCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.4.4. Running additional services in the encapsulated namespace Copy linkLink copied to clipboard!
Any monitoring tool that relies on the ability to run in the host operating system and have visibility of mount points created by kubelet, CRI-O, or containers themselves, must enter the container mount namespace to see these mount points. The kubensenter script that is provided with OpenShift Container Platform executes another command inside the Kubernetes mount point and can be used to adapt any existing tools.
The kubensenter script is aware of the state of the mount encapsulation feature status, and is safe to run even if encapsulation is not enabled. In that case the script executes the provided command in the default mount namespace.
For example, if a systemd service needs to run inside the new Kubernetes mount namespace, edit the service file and use the ExecStart= command line with kubensenter.
[Unit] Description=Example service [Service] ExecStart=/usr/bin/kubensenter /path/to/original/command arg1 arg2
[Unit]
Description=Example service
[Service]
ExecStart=/usr/bin/kubensenter /path/to/original/command arg1 arg2
Chapter 11. Managing bare-metal hosts Copy linkLink copied to clipboard!
When you install OpenShift Container Platform on a bare-metal cluster, you can provision and manage bare-metal nodes by using machine and machineset custom resources (CRs) for bare-metal hosts that exist in the cluster.
11.1. About bare metal hosts and nodes Copy linkLink copied to clipboard!
To provision a Red Hat Enterprise Linux CoreOS (RHCOS) bare metal host as a node in your cluster, first create a MachineSet custom resource (CR) object that corresponds to the bare metal host hardware. Bare metal host compute machine sets describe infrastructure components specific to your configuration. You apply specific Kubernetes labels to these compute machine sets and then update the infrastructure components to run on only those machines.
Machine CR’s are created automatically when you scale up the relevant MachineSet containing a metal3.io/autoscale-to-hosts annotation. OpenShift Container Platform uses Machine CR’s to provision the bare metal node that corresponds to the host as specified in the MachineSet CR.
11.2. Maintaining bare metal hosts Copy linkLink copied to clipboard!
You can maintain the details of the bare metal hosts in your cluster from the OpenShift Container Platform web console. Navigate to Compute → Bare Metal Hosts, and select a task from the Actions drop down menu. Here you can manage items such as BMC details, boot MAC address for the host, enable power management, and so on. You can also review the details of the network interfaces and drives for the host.
You can move a bare metal host into maintenance mode. When you move a host into maintenance mode, the scheduler moves all managed workloads off the corresponding bare metal node. No new workloads are scheduled while in maintenance mode.
You can deprovision a bare metal host in the web console. Deprovisioning a host does the following actions:
-
Annotates the bare metal host CR with
cluster.k8s.io/delete-machine: true - Scales down the related compute machine set
Powering off the host without first moving the daemon set and unmanaged static pods to another node can cause service disruption and loss of data.
11.2.1. Adding a bare metal host to the cluster using the web console Copy linkLink copied to clipboard!
You can add bare metal hosts to the cluster in the web console.
Prerequisites
- Install an RHCOS cluster on bare metal.
-
Log in as a user with
cluster-adminprivileges.
Procedure
- In the web console, navigate to Compute → Bare Metal Hosts.
- Select Add Host → New with Dialog.
- Specify a unique name for the new bare metal host.
- Set the Boot MAC address.
- Set the Baseboard Management Console (BMC) Address.
- Enter the user credentials for the host’s baseboard management controller (BMC).
- Select to power on the host after creation, and select Create.
- Scale up the number of replicas to match the number of available bare metal hosts. Navigate to Compute → MachineSets, and increase the number of machine replicas in the cluster by selecting Edit Machine count from the Actions drop-down menu.
You can also manage the number of bare metal nodes using the oc scale command and the appropriate bare metal compute machine set.
11.2.2. Adding a bare metal host to the cluster using YAML in the web console Copy linkLink copied to clipboard!
You can add bare metal hosts to the cluster in the web console using a YAML file that describes the bare metal host.
Prerequisites
- Install a RHCOS compute machine on bare metal infrastructure for use in the cluster.
-
Log in as a user with
cluster-adminprivileges. -
Create a
SecretCR for the bare metal host.
Procedure
- In the web console, navigate to Compute → Bare Metal Hosts.
- Select Add Host → New from YAML.
Copy and paste the below YAML, modifying the relevant fields with the details of your host:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
credentialsNamemust reference a validSecretCR. Thebaremetal-operatorcannot manage the bare metal host without a validSecretreferenced in thecredentialsName. For more information about secrets and how to create them, see Understanding secrets.- 2
- Setting
disableCertificateVerificationtotruedisables TLS host validation between the cluster and the baseboard management controller (BMC).
- Select Create to save the YAML and create the new bare metal host.
Scale up the number of replicas to match the number of available bare metal hosts. Navigate to Compute → MachineSets, and increase the number of machines in the cluster by selecting Edit Machine count from the Actions drop-down menu.
NoteYou can also manage the number of bare metal nodes using the
oc scalecommand and the appropriate bare metal compute machine set.
11.2.3. Automatically scaling machines to the number of available bare metal hosts Copy linkLink copied to clipboard!
To automatically create the number of Machine objects that matches the number of available BareMetalHost objects, add a metal3.io/autoscale-to-hosts annotation to the MachineSet object.
Prerequisites
-
Install RHCOS bare metal compute machines for use in the cluster, and create corresponding
BareMetalHostobjects. -
Install the OpenShift Container Platform CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
Annotate the compute machine set that you want to configure for automatic scaling by adding the
metal3.io/autoscale-to-hostsannotation. Replace<machineset>with the name of the compute machine set.oc annotate machineset <machineset> -n openshift-machine-api 'metal3.io/autoscale-to-hosts=<any_value>'
$ oc annotate machineset <machineset> -n openshift-machine-api 'metal3.io/autoscale-to-hosts=<any_value>'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Wait for the new scaled machines to start.
When you use a BareMetalHost object to create a machine in the cluster and labels or selectors are subsequently changed on the BareMetalHost, the BareMetalHost object continues be counted against the MachineSet that the Machine object was created from.
11.2.4. Removing bare metal hosts from the provisioner node Copy linkLink copied to clipboard!
In certain circumstances, you might want to temporarily remove bare metal hosts from the provisioner node. For example, during provisioning when a bare metal host reboot is triggered by using the OpenShift Container Platform administration console or as a result of a Machine Config Pool update, OpenShift Container Platform logs into the integrated Dell Remote Access Controller (iDrac) and issues a delete of the job queue.
To prevent the management of the number of Machine objects that matches the number of available BareMetalHost objects, add a baremetalhost.metal3.io/detached annotation to the MachineSet object.
This annotation has an effect for only BareMetalHost objects that are in either Provisioned, ExternallyProvisioned or Ready/Available state.
Prerequisites
-
Install RHCOS bare metal compute machines for use in the cluster and create corresponding
BareMetalHostobjects. -
Install the OpenShift Container Platform CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
Annotate the compute machine set that you want to remove from the provisioner node by adding the
baremetalhost.metal3.io/detachedannotation.oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached'
$ oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Wait for the new machines to start.
NoteWhen you use a
BareMetalHostobject to create a machine in the cluster and labels or selectors are subsequently changed on theBareMetalHost, theBareMetalHostobject continues be counted against theMachineSetthat theMachineobject was created from.In the provisioning use case, remove the annotation after the reboot is complete by using the following command:
oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached-'
$ oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached-'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
11.2.5. Powering off bare-metal hosts Copy linkLink copied to clipboard!
You can power off bare-metal cluster hosts in the web console or by applying a patch in the cluster by using the OpenShift CLI (oc). Before you power off a host, you should mark the node as unschedulable and drain all pods and workloads from the node.
Prerequisites
- You have installed a RHCOS compute machine on bare-metal infrastructure for use in the cluster.
-
You have logged in as a user with
cluster-adminprivileges. -
You have configured the host to be managed and have added BMC credentials for the cluster host. You can add BMC credentials by applying a
Secretcustom resource (CR) in the cluster or by logging in to the web console and configuring the bare-metal host to be managed.
Procedure
In the web console, mark the node that you want to power off as unschedulable. Perform the following steps:
- Navigate to Nodes and select the node that you want to power off. Expand the Actions menu and select Mark as unschedulable.
- Manually delete or relocate running pods on the node by adjusting the pod deployments or scaling down workloads on the node to zero. Wait for the drain process to complete.
- Navigate to Compute → Bare Metal Hosts.
- Expand the Options menu for the bare-metal host that you want to power off, and select Power Off. Select Immediate power off.
Alternatively, you can patch the
BareMetalHostresource for the host that you want to power off by usingoc.Get the name of the managed bare-metal host. Run the following command:
oc get baremetalhosts -n openshift-machine-api -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.provisioning.state}{"\n"}{end}'$ oc get baremetalhosts -n openshift-machine-api -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.provisioning.state}{"\n"}{end}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable:
oc adm cordon <bare_metal_host>
$ oc adm cordon <bare_metal_host>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
<bare_metal_host>is the host that you want to shut down, for example,worker-2.example.com.
Drain all pods on the node:
oc adm drain <bare_metal_host> --force=true
$ oc adm drain <bare_metal_host> --force=trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Pods that are backed by replication controllers are rescheduled to other available nodes in the cluster.
Safely power off the bare-metal host. Run the following command:
oc patch <bare_metal_host> --type json -p '[{"op": "replace", "path": "/spec/online", "value": false}]'$ oc patch <bare_metal_host> --type json -p '[{"op": "replace", "path": "/spec/online", "value": false}]'Copy to Clipboard Copied! Toggle word wrap Toggle overflow After you power on the host, make the node schedulable for workloads. Run the following command:
oc adm uncordon <bare_metal_host>
$ oc adm uncordon <bare_metal_host>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 12. Monitoring bare-metal events with the Bare Metal Event Relay Copy linkLink copied to clipboard!
Bare Metal Event Relay is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
12.1. About bare-metal events Copy linkLink copied to clipboard!
Use the Bare Metal Event Relay to subscribe applications that run in your OpenShift Container Platform cluster to events that are generated on the underlying bare-metal host. The Redfish service publishes events on a node and transmits them on an advanced message queue to subscribed applications.
Bare-metal events are based on the open Redfish standard that is developed under the guidance of the Distributed Management Task Force (DMTF). Redfish provides a secure industry-standard protocol with a REST API. The protocol is used for the management of distributed, converged or software-defined resources and infrastructure.
Hardware-related events published through Redfish includes:
- Breaches of temperature limits
- Server status
- Fan status
Begin using bare-metal events by deploying the Bare Metal Event Relay Operator and subscribing your application to the service. The Bare Metal Event Relay Operator installs and manages the lifecycle of the Redfish bare-metal event service.
The Bare Metal Event Relay works only with Redfish-capable devices on single-node clusters provisioned on bare-metal infrastructure.
12.2. How bare-metal events work Copy linkLink copied to clipboard!
The Bare Metal Event Relay enables applications running on bare-metal clusters to respond quickly to Redfish hardware changes and failures such as breaches of temperature thresholds, fan failure, disk loss, power outages, and memory failure. These hardware events are delivered using an HTTP transport or AMQP mechanism. The latency of the messaging service is between 10 to 20 milliseconds.
The Bare Metal Event Relay provides a publish-subscribe service for the hardware events. Applications can use a REST API to subscribe to the events. The Bare Metal Event Relay supports hardware that complies with Redfish OpenAPI v1.8 or later.
12.2.1. Bare Metal Event Relay data flow Copy linkLink copied to clipboard!
The following figure illustrates an example bare-metal events data flow:
Figure 12.1. Bare Metal Event Relay data flow
12.2.1.1. Operator-managed pod Copy linkLink copied to clipboard!
The Operator uses custom resources to manage the pod containing the Bare Metal Event Relay and its components using the HardwareEvent CR.
12.2.1.2. Bare Metal Event Relay Copy linkLink copied to clipboard!
At startup, the Bare Metal Event Relay queries the Redfish API and downloads all the message registries, including custom registries. The Bare Metal Event Relay then begins to receive subscribed events from the Redfish hardware.
The Bare Metal Event Relay enables applications running on bare-metal clusters to respond quickly to Redfish hardware changes and failures such as breaches of temperature thresholds, fan failure, disk loss, power outages, and memory failure. The events are reported using the HardwareEvent CR.
12.2.1.3. Cloud native event Copy linkLink copied to clipboard!
Cloud native events (CNE) is a REST API specification for defining the format of event data.
12.2.1.4. CNCF CloudEvents Copy linkLink copied to clipboard!
CloudEvents is a vendor-neutral specification developed by the Cloud Native Computing Foundation (CNCF) for defining the format of event data.
12.2.1.5. HTTP transport or AMQP dispatch router Copy linkLink copied to clipboard!
The HTTP transport or AMQP dispatch router is responsible for the message delivery service between publisher and subscriber.
HTTP transport is the default transport for PTP and bare-metal events. Use HTTP transport instead of AMQP for PTP and bare-metal events where possible. AMQ Interconnect is EOL from 30 June 2024. Extended life cycle support (ELS) for AMQ Interconnect ends 29 November 2029. For more information see, Red Hat AMQ Interconnect support status.
12.2.1.6. Cloud event proxy sidecar Copy linkLink copied to clipboard!
The cloud event proxy sidecar container image is based on the O-RAN API specification and provides a publish-subscribe event framework for hardware events.
12.2.2. Redfish message parsing service Copy linkLink copied to clipboard!
In addition to handling Redfish events, the Bare Metal Event Relay provides message parsing for events without a Message property. The proxy downloads all the Redfish message registries including vendor specific registries from the hardware when it starts. If an event does not contain a Message property, the proxy uses the Redfish message registries to construct the Message and Resolution properties and add them to the event before passing the event to the cloud events framework. This service allows Redfish events to have smaller message size and lower transmission latency.
12.2.3. Installing the Bare Metal Event Relay using the CLI Copy linkLink copied to clipboard!
As a cluster administrator, you can install the Bare Metal Event Relay Operator by using the CLI.
Prerequisites
- A cluster that is installed on bare-metal hardware with nodes that have a RedFish-enabled Baseboard Management Controller (BMC).
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
Create a namespace for the Bare Metal Event Relay.
Save the following YAML in the
bare-metal-events-namespace.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
NamespaceCR:oc create -f bare-metal-events-namespace.yaml
$ oc create -f bare-metal-events-namespace.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Create an Operator group for the Bare Metal Event Relay Operator.
Save the following YAML in the
bare-metal-events-operatorgroup.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
OperatorGroupCR:oc create -f bare-metal-events-operatorgroup.yaml
$ oc create -f bare-metal-events-operatorgroup.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Subscribe to the Bare Metal Event Relay.
Save the following YAML in the
bare-metal-events-sub.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
SubscriptionCR:oc create -f bare-metal-events-sub.yaml
$ oc create -f bare-metal-events-sub.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
To verify that the Bare Metal Event Relay Operator is installed, run the following command:
oc get csv -n openshift-bare-metal-events -o custom-columns=Name:.metadata.name,Phase:.status.phase
$ oc get csv -n openshift-bare-metal-events -o custom-columns=Name:.metadata.name,Phase:.status.phase
12.2.4. Installing the Bare Metal Event Relay using the web console Copy linkLink copied to clipboard!
As a cluster administrator, you can install the Bare Metal Event Relay Operator using the web console.
Prerequisites
- A cluster that is installed on bare-metal hardware with nodes that have a RedFish-enabled Baseboard Management Controller (BMC).
-
Log in as a user with
cluster-adminprivileges.
Procedure
Install the Bare Metal Event Relay using the OpenShift Container Platform web console:
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose Bare Metal Event Relay from the list of available Operators, and then click Install.
- On the Install Operator page, select or create a Namespace, select openshift-bare-metal-events, and then click Install.
Verification
Optional: You can verify that the Operator installed successfully by performing the following check:
- Switch to the Operators → Installed Operators page.
Ensure that Bare Metal Event Relay is listed in the project with a Status of InstallSucceeded.
NoteDuring installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.
If the Operator does not appear as installed, to troubleshoot further:
- Go to the Operators → Installed Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
- Go to the Workloads → Pods page and check the logs for pods in the project namespace.
12.3. Installing the AMQ messaging bus Copy linkLink copied to clipboard!
To pass Redfish bare-metal event notifications between publisher and subscriber on a node, you can install and configure an AMQ messaging bus to run locally on the node. You do this by installing the AMQ Interconnect Operator for use in the cluster.
HTTP transport is the default transport for PTP and bare-metal events. Use HTTP transport instead of AMQP for PTP and bare-metal events where possible. AMQ Interconnect is EOL from 30 June 2024. Extended life cycle support (ELS) for AMQ Interconnect ends 29 November 2029. For more information see, Red Hat AMQ Interconnect support status.
Prerequisites
-
Install the OpenShift Container Platform CLI (
oc). -
Log in as a user with
cluster-adminprivileges.
Procedure
-
Install the AMQ Interconnect Operator to its own
amq-interconnectnamespace. See Installing the AMQ Interconnect Operator.
Verification
Verify that the AMQ Interconnect Operator is available and the required pods are running:
oc get pods -n amq-interconnect
$ oc get pods -n amq-interconnectCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE amq-interconnect-645db76c76-k8ghs 1/1 Running 0 23h interconnect-operator-5cb5fc7cc-4v7qm 1/1 Running 0 23h
NAME READY STATUS RESTARTS AGE amq-interconnect-645db76c76-k8ghs 1/1 Running 0 23h interconnect-operator-5cb5fc7cc-4v7qm 1/1 Running 0 23hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the required
bare-metal-event-relaybare-metal event producer pod is running in theopenshift-bare-metal-eventsnamespace:oc get pods -n openshift-bare-metal-events
$ oc get pods -n openshift-bare-metal-eventsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE hw-event-proxy-operator-controller-manager-74d5649b7c-dzgtl 2/2 Running 0 25s
NAME READY STATUS RESTARTS AGE hw-event-proxy-operator-controller-manager-74d5649b7c-dzgtl 2/2 Running 0 25sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
12.4. Subscribing to Redfish BMC bare-metal events for a cluster node Copy linkLink copied to clipboard!
You can subscribe to Redfish BMC events generated on a node in your cluster by creating a BMCEventSubscription custom resource (CR) for the node, creating a HardwareEvent CR for the event, and creating a Secret CR for the BMC.
12.4.1. Subscribing to bare-metal events Copy linkLink copied to clipboard!
You can configure the baseboard management controller (BMC) to send bare-metal events to subscribed applications running in an OpenShift Container Platform cluster. Example Redfish bare-metal events include an increase in device temperature, or removal of a device. You subscribe applications to bare-metal events using a REST API.
You can only create a BMCEventSubscription custom resource (CR) for physical hardware that supports Redfish and has a vendor interface set to redfish or idrac-redfish.
Use the BMCEventSubscription CR to subscribe to predefined Redfish events. The Redfish standard does not provide an option to create specific alerts and thresholds. For example, to receive an alert event when an enclosure’s temperature exceeds 40° Celsius, you must manually configure the event according to the vendor’s recommendations.
Perform the following procedure to subscribe to bare-metal events for the node using a BMCEventSubscription CR.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges. - Get the user name and password for the BMC.
Deploy a bare-metal node with a Redfish-enabled Baseboard Management Controller (BMC) in your cluster, and enable Redfish events on the BMC.
NoteEnabling Redfish events on specific hardware is outside the scope of this information. For more information about enabling Redfish events for your specific hardware, consult the BMC manufacturer documentation.
Procedure
Confirm that the node hardware has the Redfish
EventServiceenabled by running the followingcurlcommand:curl https://<bmc_ip_address>/redfish/v1/EventService --insecure -H 'Content-Type: application/json' -u "<bmc_username>:<password>"
$ curl https://<bmc_ip_address>/redfish/v1/EventService --insecure -H 'Content-Type: application/json' -u "<bmc_username>:<password>"Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- bmc_ip_address
- is the IP address of the BMC where the Redfish events are generated.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the Bare Metal Event Relay service route for the cluster by running the following command:
oc get route -n openshift-bare-metal-events
$ oc get route -n openshift-bare-metal-eventsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD hw-event-proxy hw-event-proxy-openshift-bare-metal-events.apps.compute-1.example.com hw-event-proxy-service 9087 edge None
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD hw-event-proxy hw-event-proxy-openshift-bare-metal-events.apps.compute-1.example.com hw-event-proxy-service 9087 edge NoneCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
BMCEventSubscriptionresource to subscribe to the Redfish events:Save the following YAML in the
bmc_sub.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies the name or UUID of the worker node where the Redfish events are generated.
- 2
- Specifies the bare-metal event proxy service, for example,
https://hw-event-proxy-openshift-bare-metal-events.apps.compute-1.example.com/webhook.
Create the
BMCEventSubscriptionCR:oc create -f bmc_sub.yaml
$ oc create -f bmc_sub.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Optional: To delete the BMC event subscription, run the following command:
oc delete -f bmc_sub.yaml
$ oc delete -f bmc_sub.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: To manually create a Redfish event subscription without creating a
BMCEventSubscriptionCR, run the followingcurlcommand, specifying the BMC username and password.curl -i -k -X POST -H "Content-Type: application/json" -d '{"Destination": "https://<proxy_service_url>", "Protocol" : "Redfish", "EventTypes": ["Alert"], "Context": "root"}' -u <bmc_username>:<password> 'https://<bmc_ip_address>/redfish/v1/EventService/Subscriptions' –v$ curl -i -k -X POST -H "Content-Type: application/json" -d '{"Destination": "https://<proxy_service_url>", "Protocol" : "Redfish", "EventTypes": ["Alert"], "Context": "root"}' -u <bmc_username>:<password> 'https://<bmc_ip_address>/redfish/v1/EventService/Subscriptions' –vCopy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- proxy_service_url
-
is the bare-metal event proxy service, for example,
https://hw-event-proxy-openshift-bare-metal-events.apps.compute-1.example.com/webhook.
- bmc_ip_address
- is the IP address of the BMC where the Redfish events are generated.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
12.4.2. Querying Redfish bare-metal event subscriptions with curl Copy linkLink copied to clipboard!
Some hardware vendors limit the amount of Redfish hardware event subscriptions. You can query the number of Redfish event subscriptions by using curl.
Prerequisites
- Get the user name and password for the BMC.
- Deploy a bare-metal node with a Redfish-enabled Baseboard Management Controller (BMC) in your cluster, and enable Redfish hardware events on the BMC.
Procedure
Check the current subscriptions for the BMC by running the following
curlcommand:curl --globoff -H "Content-Type: application/json" -k -X GET --user <bmc_username>:<password> https://<bmc_ip_address>/redfish/v1/EventService/Subscriptions
$ curl --globoff -H "Content-Type: application/json" -k -X GET --user <bmc_username>:<password> https://<bmc_ip_address>/redfish/v1/EventService/SubscriptionsCopy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- bmc_ip_address
- is the IP address of the BMC where the Redfish events are generated.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, a single subscription is configured:
/redfish/v1/EventService/Subscriptions/1.Optional: To remove the
/redfish/v1/EventService/Subscriptions/1subscription withcurl, run the following command, specifying the BMC username and password:curl --globoff -L -w "%{http_code} %{url_effective}\n" -k -u <bmc_username>:<password >-H "Content-Type: application/json" -d '{}' -X DELETE https://<bmc_ip_address>/redfish/v1/EventService/Subscriptions/1$ curl --globoff -L -w "%{http_code} %{url_effective}\n" -k -u <bmc_username>:<password >-H "Content-Type: application/json" -d '{}' -X DELETE https://<bmc_ip_address>/redfish/v1/EventService/Subscriptions/1Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- bmc_ip_address
- is the IP address of the BMC where the Redfish events are generated.
12.4.3. Creating the bare-metal event and Secret CRs Copy linkLink copied to clipboard!
To start using bare-metal events, create the HardwareEvent custom resource (CR) for the host where the Redfish hardware is present. Hardware events and faults are reported in the hw-event-proxy logs.
Prerequisites
-
You have installed the OpenShift Container Platform CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have installed the Bare Metal Event Relay.
-
You have created a
BMCEventSubscriptionCR for the BMC Redfish hardware.
Procedure
Create the
HardwareEventcustom resource (CR):NoteMultiple
HardwareEventresources are not permitted.Save the following YAML in the
hw-event.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Required. Use the
nodeSelectorfield to target nodes with the specified label, for example,node-role.kubernetes.io/hw-event: "".NoteIn OpenShift Container Platform 4.13 or later, you do not need to set the
spec.transportHostfield in theHardwareEventresource when you use HTTP transport for bare-metal events. SettransportHostonly when you use AMQP transport for bare-metal events. - 2
- Optional. The default value is
debug. Sets the log level inhw-event-proxylogs. The following log levels are available:fatal,error,warning,info,debug,trace. - 3
- Optional. Sets the timeout value in milliseconds for the Message Parser. If a message parsing request is not responded to within the timeout duration, the original hardware event message is passed to the cloud native event framework. The default value is 10.
Apply the
HardwareEventCR in the cluster:oc create -f hardware-event.yaml
$ oc create -f hardware-event.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a BMC username and password
SecretCR that enables the hardware events proxy to access the Redfish message registry for the bare-metal host.Save the following YAML in the
hw-event-bmc-secret.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Enter plain text values for the various items under
stringData.
Create the
SecretCR:oc create -f hw-event-bmc-secret.yaml
$ oc create -f hw-event-bmc-secret.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
12.5. Subscribing applications to bare-metal events REST API reference Copy linkLink copied to clipboard!
Use the bare-metal events REST API to subscribe an application to the bare-metal events that are generated on the parent node.
Subscribe applications to Redfish events by using the resource address /cluster/node/<node_name>/redfish/event, where <node_name> is the cluster node running the application.
Deploy your cloud-event-consumer application container and cloud-event-proxy sidecar container in a separate application pod. The cloud-event-consumer application subscribes to the cloud-event-proxy container in the application pod.
Use the following API endpoints to subscribe the cloud-event-consumer application to Redfish events posted by the cloud-event-proxy container at http://localhost:8089/api/ocloudNotifications/v1/ in the application pod:
/api/ocloudNotifications/v1/subscriptions-
POST: Creates a new subscription -
GET: Retrieves a list of subscriptions
-
/api/ocloudNotifications/v1/subscriptions/<subscription_id>-
PUT: Creates a new status ping request for the specified subscription ID
-
/api/ocloudNotifications/v1/health-
GET: Returns the health status ofocloudNotificationsAPI
-
9089 is the default port for the cloud-event-consumer container deployed in the application pod. You can configure a different port for your application as required.
api/ocloudNotifications/v1/subscriptions
HTTP method
GET api/ocloudNotifications/v1/subscriptions
Description
Returns a list of subscriptions. If subscriptions exist, a 200 OK status code is returned along with the list of subscriptions.
Example API response
HTTP method
POST api/ocloudNotifications/v1/subscriptions
Description
Creates a new subscription. If a subscription is successfully created, or if it already exists, a 201 Created status code is returned.
| Parameter | Type |
|---|---|
| subscription | data |
Example payload
{
"uriLocation": "http://localhost:8089/api/ocloudNotifications/v1/subscriptions",
"resource": "/cluster/node/openshift-worker-0.openshift.example.com/redfish/event"
}
{
"uriLocation": "http://localhost:8089/api/ocloudNotifications/v1/subscriptions",
"resource": "/cluster/node/openshift-worker-0.openshift.example.com/redfish/event"
}
api/ocloudNotifications/v1/subscriptions/<subscription_id>
HTTP method
GET api/ocloudNotifications/v1/subscriptions/<subscription_id>
Description
Returns details for the subscription with ID <subscription_id>
| Parameter | Type |
|---|---|
|
| string |
Example API response
api/ocloudNotifications/v1/health/
HTTP method
GET api/ocloudNotifications/v1/health/
Description
Returns the health status for the ocloudNotifications REST API.
Example API response
OK
OK
12.6. Migrating consumer applications to use HTTP transport for PTP or bare-metal events Copy linkLink copied to clipboard!
If you have previously deployed PTP or bare-metal events consumer applications, you need to update the applications to use HTTP message transport.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have updated the PTP Operator or Bare Metal Event Relay to version 4.13+ which uses HTTP transport by default.
Procedure
Update your events consumer application to use HTTP transport. Set the
http-event-publishersvariable for the cloud event sidecar deployment.For example, in a cluster with PTP events configured, the following YAML snippet illustrates a cloud event sidecar deployment:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The PTP Operator automatically resolves
NODE_NAMEto the host that is generating the PTP events. For example,compute-1.example.com.
In a cluster with bare-metal events configured, set the
http-event-publishersfield tohw-event-publisher-service.openshift-bare-metal-events.svc.cluster.local:9043in the cloud event sidecar deployment CR.Deploy the
consumer-events-subscription-serviceservice alongside the events consumer application. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 13. What huge pages do and how they are consumed by applications Copy linkLink copied to clipboard!
13.1. What huge pages do Copy linkLink copied to clipboard!
Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, and so on. CPUs have a built-in memory management unit that manages a list of these pages in hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of virtual-to-physical page mappings. If the virtual address passed in a hardware instruction can be found in the TLB, the mapping can be determined quickly. If not, a TLB miss occurs, and the system falls back to slower, software-based address translation, resulting in performance issues. Since the size of the TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the page size.
A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other architectures. To use huge pages, code must be written so that applications are aware of them. Transparent Huge Pages (THP) attempt to automate the management of huge pages without application knowledge, but they have limitations. In particular, they are limited to 2Mi page sizes. THP can lead to performance degradation on nodes with high memory utilization or fragmentation due to defragmenting efforts of THP, which can lock memory pages. For this reason, some applications may be designed to (or recommend) usage of pre-allocated huge pages instead of THP.
In OpenShift Container Platform, applications in a pod can allocate and consume pre-allocated huge pages.
13.2. How huge pages are consumed by apps Copy linkLink copied to clipboard!
Nodes must pre-allocate huge pages in order for the node to report its huge page capacity. A node can only pre-allocate huge pages for a single size.
Huge pages can be consumed through container-level resource requirements using the resource name hugepages-<size>, where size is the most compact binary notation using integer values supported on a particular node. For example, if a node supports 2048KiB page sizes, it exposes a schedulable resource hugepages-2Mi. Unlike CPU or memory, huge pages do not support over-commitment.
- 1
- Specify the amount of memory for
hugepagesas the exact amount to be allocated. Do not specify this value as the amount of memory forhugepagesmultiplied by the size of the page. For example, given a huge page size of 2MB, if you want to use 100MB of huge-page-backed RAM for your application, then you would allocate 50 huge pages. OpenShift Container Platform handles the math for you. As in the above example, you can specify100MBdirectly.
Allocating huge pages of a specific size
Some platforms support multiple huge page sizes. To allocate huge pages of a specific size, precede the huge pages boot command parameters with a huge page size selection parameter hugepagesz=<size>. The <size> value must be specified in bytes with an optional scale suffix [kKmMgG]. The default huge page size can be defined with the default_hugepagesz=<size> boot parameter.
Huge page requirements
- Huge page requests must equal the limits. This is the default if limits are specified, but requests are not.
- Huge pages are isolated at a pod scope. Container isolation is planned in a future iteration.
-
EmptyDirvolumes backed by huge pages must not consume more huge page memory than the pod request. -
Applications that consume huge pages via
shmget()withSHM_HUGETLBmust run with a supplemental group that matches proc/sys/vm/hugetlb_shm_group.
13.3. Consuming huge pages resources using the Downward API Copy linkLink copied to clipboard!
You can use the Downward API to inject information about the huge pages resources that are consumed by a container.
You can inject the resource allocation as environment variables, a volume plugin, or both. Applications that you develop and run in the container can determine the resources that are available by reading the environment variables or files in the specified volumes.
Procedure
Create a
hugepages-volume-pod.yamlfile that is similar to the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow <.> Specifies to read the resource use from
requests.hugepages-1Giand expose the value as theREQUESTS_HUGEPAGES_1GIenvironment variable. <.> Specifies to read the resource use fromrequests.hugepages-1Giand expose the value as the file/etc/podinfo/hugepages_1G_request.Create the pod from the
hugepages-volume-pod.yamlfile:oc create -f hugepages-volume-pod.yaml
$ oc create -f hugepages-volume-pod.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the value of the
REQUESTS_HUGEPAGES_1GIenvironment variable:oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \ -- env | grep REQUESTS_HUGEPAGES_1GI$ oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \ -- env | grep REQUESTS_HUGEPAGES_1GICopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
REQUESTS_HUGEPAGES_1GI=2147483648
REQUESTS_HUGEPAGES_1GI=2147483648Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the value of the
/etc/podinfo/hugepages_1G_requestfile:oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \ -- cat /etc/podinfo/hugepages_1G_request$ oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \ -- cat /etc/podinfo/hugepages_1G_requestCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
2
2Copy to Clipboard Copied! Toggle word wrap Toggle overflow
13.4. Configuring huge pages at boot time Copy linkLink copied to clipboard!
Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. There are two ways of reserving huge pages: at boot time and at run time. Reserving at boot time increases the possibility of success because the memory has not yet been significantly fragmented. The Node Tuning Operator currently supports boot time allocation of huge pages on specific nodes.
Procedure
To minimize node reboots, the order of the steps below needs to be followed:
Label all nodes that need the same huge pages setting by a label.
oc label node <node_using_hugepages> node-role.kubernetes.io/worker-hp=
$ oc label node <node_using_hugepages> node-role.kubernetes.io/worker-hp=Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a file with the following content and name it
hugepages-tuned-boottime.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the Tuned
hugepagesobjectoc create -f hugepages-tuned-boottime.yaml
$ oc create -f hugepages-tuned-boottime.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a file with the following content and name it
hugepages-mcp.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the machine config pool:
oc create -f hugepages-mcp.yaml
$ oc create -f hugepages-mcp.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Given enough non-fragmented memory, all the nodes in the worker-hp machine config pool should now have 50 2Mi huge pages allocated.
oc get node <node_using_hugepages> -o jsonpath="{.status.allocatable.hugepages-2Mi}"
$ oc get node <node_using_hugepages> -o jsonpath="{.status.allocatable.hugepages-2Mi}"
100Mi
The TuneD bootloader plugin only supports Red Hat Enterprise Linux CoreOS (RHCOS) worker nodes.
13.5. Disabling Transparent Huge Pages Copy linkLink copied to clipboard!
Transparent Huge Pages (THP) attempt to automate most aspects of creating, managing, and using huge pages. Since THP automatically manages the huge pages, this is not always handled optimally for all types of workloads. THP can lead to performance regressions, since many applications handle huge pages on their own. Therefore, consider disabling THP. The following steps describe how to disable THP using the Node Tuning Operator (NTO).
Procedure
Create a file with the following content and name it
thp-disable-tuned.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the Tuned object:
oc create -f thp-disable-tuned.yaml
$ oc create -f thp-disable-tuned.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the list of active profiles:
oc get profile -n openshift-cluster-node-tuning-operator
$ oc get profile -n openshift-cluster-node-tuning-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Log in to one of the nodes and do a regular THP check to verify if the nodes applied the profile successfully:
cat /sys/kernel/mm/transparent_hugepage/enabled
$ cat /sys/kernel/mm/transparent_hugepage/enabledCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
always madvise [never]
always madvise [never]Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 14. Understanding low latency tuning for cluster nodes Copy linkLink copied to clipboard!
Edge computing has a key role in reducing latency and congestion problems and improving application performance for telco and 5G network applications. Maintaining a network architecture with the lowest possible latency is key for meeting the network performance requirements of 5G. Compared to 4G technology, with an average latency of 50 ms, 5G is targeted to reach latency of 1 ms or less. This reduction in latency boosts wireless throughput by a factor of 10.
14.1. About low latency Copy linkLink copied to clipboard!
Many of the deployed applications in the Telco space require low latency that can only tolerate zero packet loss. Tuning for zero packet loss helps mitigate the inherent issues that degrade network performance. For more information, see Tuning for Zero Packet Loss in Red Hat OpenStack Platform (RHOSP).
The Edge computing initiative also comes in to play for reducing latency rates. Think of it as being on the edge of the cloud and closer to the user. This greatly reduces the distance between the user and distant data centers, resulting in reduced application response times and performance latency.
Administrators must be able to manage their many Edge sites and local services in a centralized way so that all of the deployments can run at the lowest possible management cost. They also need an easy way to deploy and configure certain nodes of their cluster for real-time low latency and high-performance purposes. Low latency nodes are useful for applications such as Cloud-native Network Functions (CNF) and Data Plane Development Kit (DPDK).
OpenShift Container Platform currently provides mechanisms to tune software on an OpenShift Container Platform cluster for real-time running and low latency (around <20 microseconds reaction time). This includes tuning the kernel and OpenShift Container Platform set values, installing a kernel, and reconfiguring the machine. But this method requires setting up four different Operators and performing many configurations that, when done manually, is complex and could be prone to mistakes.
OpenShift Container Platform uses the Node Tuning Operator to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications. The cluster administrator uses this performance profile configuration that makes it easier to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt, reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolate CPUs for application containers to run the workloads.
In OpenShift Container Platform 4.14, if you apply a performance profile to your cluster, all nodes in the cluster will reboot. This reboot includes control plane nodes and worker nodes that were not targeted by the performance profile. This is a known issue in OpenShift Container Platform 4.14 because this release uses Linux control group version 2 (cgroup v2) in alignment with RHEL 9. The low latency tuning features associated with the performance profile do not support cgroup v2, therefore the nodes reboot to switch back to the cgroup v1 configuration.
To revert all nodes in the cluster to the cgroups v2 configuration, you must edit the Node resource. (OCPBUGS-16976)
In Telco, clusters using PerformanceProfile for low latency, real-time, and Data Plane Development Kit (DPDK) workloads automatically revert to cgroups v1 due to the lack of cgroups v2 support. Enabling cgroup v2 is not supported if you are using PerformanceProfile.
OpenShift Container Platform also supports workload hints for the Node Tuning Operator that can tune the PerformanceProfile to meet the demands of different industry environments. Workload hints are available for highPowerConsumption (very low latency at the cost of increased power consumption) and realTime (priority given to optimum latency). A combination of true/false settings for these hints can be used to deal with application-specific workload profiles and requirements.
Workload hints simplify the fine-tuning of performance to industry sector settings. Instead of a “one size fits all” approach, workload hints can cater to usage patterns such as placing priority on:
- Low latency
- Real-time capability
- Efficient use of power
Ideally, all of the previously listed items are prioritized. Some of these items come at the expense of others however. The Node Tuning Operator is now aware of the workload expectations and better able to meet the demands of the workload. The cluster admin can now specify into which use case that workload falls. The Node Tuning Operator uses the PerformanceProfile to fine tune the performance settings for the workload.
The environment in which an application is operating influences its behavior. For a typical data center with no strict latency requirements, only minimal default tuning is needed that enables CPU partitioning for some high performance workload pods. For data centers and workloads where latency is a higher priority, measures are still taken to optimize power consumption. The most complicated cases are clusters close to latency-sensitive equipment such as manufacturing machinery and software-defined radios. This last class of deployment is often referred to as Far edge. For Far edge deployments, ultra-low latency is the ultimate priority, and is achieved at the expense of power management.
14.2. About Hyper-Threading for low latency and real-time applications Copy linkLink copied to clipboard!
Hyper-Threading is an Intel processor technology that allows a physical CPU processor core to function as two logical cores, executing two independent threads simultaneously. Hyper-Threading allows for better system throughput for certain workload types where parallel processing is beneficial. The default OpenShift Container Platform configuration expects Hyper-Threading to be enabled.
For telecommunications applications, it is important to design your application infrastructure to minimize latency as much as possible. Hyper-Threading can slow performance times and negatively affect throughput for compute-intensive workloads that require low latency. Disabling Hyper-Threading ensures predictable performance and can decrease processing times for these workloads.
Hyper-Threading implementation and configuration differs depending on the hardware you are running OpenShift Container Platform on. Consult the relevant host hardware tuning information for more details of the Hyper-Threading implementation specific to that hardware. Disabling Hyper-Threading can increase the cost per core of the cluster.
Chapter 15. Tuning nodes for low latency with the performance profile Copy linkLink copied to clipboard!
Tune nodes for low latency by using the cluster performance profile. You can restrict CPUs for infra and application containers, configure huge pages, Hyper-Threading, and configure CPU partitions for latency-sensitive processes.
15.1. Creating a performance profile Copy linkLink copied to clipboard!
You can create a cluster performance profile by using the Performance Profile Creator (PPC) tool. The PPC is a function of the Node Tuning Operator.
The PPC combines information about your cluster with user-supplied configurations to generate a performance profile that is appropriate to your hardware, topology and use-case.
Performance profiles are applicable only to bare-metal environments where the cluster has direct access to the underlying hardware resources. You can configure performances profiles for both single-node OpenShift and multi-node clusters.
The following is a high-level workflow for creating and applying a performance profile in your cluster:
-
Create a machine config pool (MCP) for nodes that you want to target with performance configurations. In single-node OpenShift clusters, you must use the
masterMCP because there is only one node in the cluster. -
Gather information about your cluster using the
must-gathercommand. Use the PPC tool to create a performance profile by using either of the following methods:
- Run the PPC tool by using Podman.
- Run the PPC tool by using a wrapper script.
- Configure the performance profile for your use case and apply the performance profile to your cluster.
In Telco, clusters using PerformanceProfile for low latency, real-time, and Data Plane Development Kit (DPDK) workloads automatically revert to cgroups v1 due to the lack of cgroups v2 support. Enabling cgroup v2 is not supported if you are using PerformanceProfile.
15.1.1. About the Performance Profile Creator Copy linkLink copied to clipboard!
The Performance Profile Creator (PPC) is a command-line tool, delivered with the Node Tuning Operator, that can help you to create a performance profile for your cluster.
Initially, you can use the PPC tool to process the must-gather data to display key performance configurations for your cluster, including the following information:
- NUMA cell partitioning with the allocated CPU IDs
- Hyper-Threading node configuration
You can use this information to help you configure the performance profile.
Running the PPC
Specify performance configuration arguments to the PPC tool to generate a proposed performance profile that is appropriate for your hardware, topology, and use-case.
You can run the PPC by using one of the following methods:
- Run the PPC by using Podman
- Run the PPC by using the wrapper script
Using the wrapper script abstracts some of the more granular Podman tasks into an executable script. For example, the wrapper script handles tasks such as pulling and running the required container image, mounting directories into the container, and providing parameters directly to the container through Podman. Both methods achieve the same result.
15.1.2. Creating a machine config pool to target nodes for performance tuning Copy linkLink copied to clipboard!
For multi-node clusters, you can define a machine config pool (MCP) to identify the target nodes that you want to configure with a performance profile.
In single-node OpenShift clusters, you must use the master MCP because there is only one node in the cluster. You do not need to create a separate MCP for single-node OpenShift clusters.
Prerequisites
-
You have
cluster-adminrole access. -
You installed the OpenShift CLI (
oc).
Procedure
Label the target nodes for configuration by running the following command:
oc label node <node_name> node-role.kubernetes.io/worker-cnf=""
$ oc label node <node_name> node-role.kubernetes.io/worker-cnf=""1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Replace
<node_name>with the name of your node. This example applies theworker-cnflabel.
Create a
MachineConfigPoolresource containing the target nodes:Create a YAML file that defines the
MachineConfigPoolresource:Example
mcp-worker-cnf.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the
MachineConfigPoolresource by running the following command:oc apply -f mcp-worker-cnf.yaml
$ oc apply -f mcp-worker-cnf.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
machineconfigpool.machineconfiguration.openshift.io/worker-cnf created
machineconfigpool.machineconfiguration.openshift.io/worker-cnf createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the machine config pools in your cluster by running the following command:
oc get mcp
$ oc get mcpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-58433c7c3c1b4ed5ffef95234d451490 True False False 3 3 3 0 6h46m worker rendered-worker-168f52b168f151e4f853259729b6azc4 True False False 2 2 2 0 6h46m worker-cnf rendered-worker-cnf-168f52b168f151e4f853259729b6azc4 True False False 1 1 1 0 73s
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-58433c7c3c1b4ed5ffef95234d451490 True False False 3 3 3 0 6h46m worker rendered-worker-168f52b168f151e4f853259729b6azc4 True False False 2 2 2 0 6h46m worker-cnf rendered-worker-cnf-168f52b168f151e4f853259729b6azc4 True False False 1 1 1 0 73sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
15.1.3. Gathering data about your cluster for the PPC Copy linkLink copied to clipboard!
The Performance Profile Creator (PPC) tool requires must-gather data. As a cluster administrator, run the must-gather command to capture information about your cluster.
Prerequisites
-
Access to the cluster as a user with the
cluster-adminrole. -
You installed the OpenShift CLI (
oc). - You identified a target MCP that you want to configure with a performance profile.
Procedure
-
Navigate to the directory where you want to store the
must-gatherdata. Collect cluster information by running the following command:
oc adm must-gather
$ oc adm must-gatherCopy to Clipboard Copied! Toggle word wrap Toggle overflow The command creates a folder with the
must-gatherdata in your local directory with a naming format similar to the following:must-gather.local.1971646453781853027.Optional: Create a compressed file from the
must-gatherdirectory:tar cvaf must-gather.tar.gz <must_gather_folder>
$ tar cvaf must-gather.tar.gz <must_gather_folder>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Replace with the name of the
must-gatherdata folder.
NoteCompressed output is required if you are running the Performance Profile Creator wrapper script.
15.1.4. Running the Performance Profile Creator using Podman Copy linkLink copied to clipboard!
As a cluster administrator, you can use Podman with the Performance Profile Creator (PPC) to create a performance profile.
For more information about the PPC arguments, see the section "Performance Profile Creator arguments".
The PPC uses the must-gather data from your cluster to create the performance profile. If you make any changes to your cluster, such as relabeling a node targeted for performance configuration, you must re-create the must-gather data before running PPC again.
Prerequisites
-
Access to the cluster as a user with the
cluster-adminrole. - A cluster installed on bare-metal hardware.
-
You installed
podmanand the OpenShift CLI (oc). - Access to the Node Tuning Operator image.
- You identified a machine config pool containing target nodes for configuration.
-
You have access to the
must-gatherdata for your cluster.
Procedure
Check the machine config pool by running the following command:
oc get mcp
$ oc get mcpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-58433c8c3c0b4ed5feef95434d455490 True False False 3 3 3 0 8h worker rendered-worker-668f56a164f151e4a853229729b6adc4 True False False 2 2 2 0 8h worker-cnf rendered-worker-cnf-668f56a164f151e4a853229729b6adc4 True False False 1 1 1 0 79m
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-58433c8c3c0b4ed5feef95434d455490 True False False 3 3 3 0 8h worker rendered-worker-668f56a164f151e4a853229729b6adc4 True False False 2 2 2 0 8h worker-cnf rendered-worker-cnf-668f56a164f151e4a853229729b6adc4 True False False 1 1 1 0 79mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use Podman to authenticate to
registry.redhat.ioby running the following command:podman login registry.redhat.io
$ podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Username: <user_name> Password: <password>
Username: <user_name> Password: <password>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Display help for the PPC tool by running the following command:
podman run --rm --entrypoint performance-profile-creator registry.redhat.io/openshift4/ose-cluster-node-tuning-rhel9-operator:v4.14 -h
$ podman run --rm --entrypoint performance-profile-creator registry.redhat.io/openshift4/ose-cluster-node-tuning-rhel9-operator:v4.14 -hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To display information about the cluster, run the PPC tool with the
logargument by running the following command:podman run --entrypoint performance-profile-creator -v <path_to_must_gather>:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-rhel9-operator:v4.14 --info log --must-gather-dir-path /must-gather
$ podman run --entrypoint performance-profile-creator -v <path_to_must_gather>:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-rhel9-operator:v4.14 --info log --must-gather-dir-path /must-gatherCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
--entrypoint performance-profile-creatordefines the performance profile creator as a new entry point topodman. -v <path_to_must_gather>specifies the path to either of the following components:-
The directory containing the
must-gatherdata. -
An existing directory containing the
must-gatherdecompressed .tar file.
-
The directory containing the
--info logspecifies a value for the output format.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Create a performance profile by running the following command. The example uses sample PPC arguments and values:
podman run --entrypoint performance-profile-creator -v <path_to_must_gather>:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-rhel9-operator:v4.14 --mcp-name=worker-cnf --reserved-cpu-count=1 --rt-kernel=true --split-reserved-cpus-across-numa=false --must-gather-dir-path /must-gather --power-consumption-mode=ultra-low-latency --offlined-cpu-count=1 > my-performance-profile.yaml
$ podman run --entrypoint performance-profile-creator -v <path_to_must_gather>:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-rhel9-operator:v4.14 --mcp-name=worker-cnf --reserved-cpu-count=1 --rt-kernel=true --split-reserved-cpus-across-numa=false --must-gather-dir-path /must-gather --power-consumption-mode=ultra-low-latency --offlined-cpu-count=1 > my-performance-profile.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow -v <path_to_must_gather>specifies the path to either of the following components:-
The directory containing the
must-gatherdata. -
The directory containing the
must-gatherdecompressed .tar file.
-
The directory containing the
-
--mcp-name=worker-cnfspecifies theworker-cnfmachine config pool. -
--reserved-cpu-count=1specifies one reserved CPU. -
--rt-kernel=trueenables the real-time kernel. -
--split-reserved-cpus-across-numa=falsedisables reserved CPUs splitting across NUMA nodes. -
--power-consumption-mode=ultra-low-latencyspecifies minimal latency at the cost of increased power consumption. --offlined-cpu-count=1specifies one offlined CPU.NoteThe
mcp-nameargument in this example is set toworker-cnfbased on the output of the commandoc get mcp. For single-node OpenShift use--mcp-name=master.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Review the created YAML file by running the following command:
cat my-performance-profile.yaml
$ cat my-performance-profile.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the generated profile:
oc apply -f my-performance-profile.yaml
$ oc apply -f my-performance-profile.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
performanceprofile.performance.openshift.io/performance created
performanceprofile.performance.openshift.io/performance createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow
15.1.5. Running the Performance Profile Creator wrapper script Copy linkLink copied to clipboard!
The wrapper script simplifies the process of creating a performance profile with the Performance Profile Creator (PPC) tool. The script handles tasks such as pulling and running the required container image, mounting directories into the container, and providing parameters directly to the container through Podman.
For more information about the Performance Profile Creator arguments, see the section "Performance Profile Creator arguments".
The PPC uses the must-gather data from your cluster to create the performance profile. If you make any changes to your cluster, such as relabeling a node targeted for performance configuration, you must re-create the must-gather data before running PPC again.
Prerequisites
-
Access to the cluster as a user with the
cluster-adminrole. - A cluster installed on bare-metal hardware.
-
You installed
podmanand the OpenShift CLI (oc). - Access to the Node Tuning Operator image.
- You identified a machine config pool containing target nodes for configuration.
-
Access to the
must-gathertarball.
Procedure
Create a file on your local machine named, for example,
run-perf-profile-creator.sh:vi run-perf-profile-creator.sh
$ vi run-perf-profile-creator.shCopy to Clipboard Copied! Toggle word wrap Toggle overflow Paste the following code into the file:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add execute permissions for everyone on this script:
chmod a+x run-perf-profile-creator.sh
$ chmod a+x run-perf-profile-creator.shCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use Podman to authenticate to
registry.redhat.ioby running the following command:podman login registry.redhat.io
$ podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Username: <user_name> Password: <password>
Username: <user_name> Password: <password>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Display help for the PPC tool by running the following command:
./run-perf-profile-creator.sh -h
$ ./run-perf-profile-creator.sh -hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou can optionally set a path for the Node Tuning Operator image using the
-poption. If you do not set a path, the wrapper script uses the default image:registry.redhat.io/openshift4/ose-cluster-node-tuning-rhel9-operator:v4.14.To display information about the cluster, run the PPC tool with the
logargument by running the following command:./run-perf-profile-creator.sh -t /<path_to_must_gather_dir>/must-gather.tar.gz -- --info=log
$ ./run-perf-profile-creator.sh -t /<path_to_must_gather_dir>/must-gather.tar.gz -- --info=logCopy to Clipboard Copied! Toggle word wrap Toggle overflow -t /<path_to_must_gather_dir>/must-gather.tar.gzspecifies the path to directory containing the must-gather tarball. This is a required argument for the wrapper script.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a performance profile by running the following command.
./run-perf-profile-creator.sh -t /path-to-must-gather/must-gather.tar.gz -- --mcp-name=worker-cnf --reserved-cpu-count=1 --rt-kernel=true --split-reserved-cpus-across-numa=false --power-consumption-mode=ultra-low-latency --offlined-cpu-count=1 > my-performance-profile.yaml
$ ./run-perf-profile-creator.sh -t /path-to-must-gather/must-gather.tar.gz -- --mcp-name=worker-cnf --reserved-cpu-count=1 --rt-kernel=true --split-reserved-cpus-across-numa=false --power-consumption-mode=ultra-low-latency --offlined-cpu-count=1 > my-performance-profile.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow This example uses sample PPC arguments and values.
-
--mcp-name=worker-cnfspecifies theworker-cnfmachine config pool. -
--reserved-cpu-count=1specifies one reserved CPU. -
--rt-kernel=trueenables the real-time kernel. -
--split-reserved-cpus-across-numa=falsedisables reserved CPUs splitting across NUMA nodes. -
--power-consumption-mode=ultra-low-latencyspecifies minimal latency at the cost of increased power consumption. --offlined-cpu-count=1specifies one offlined CPUs.NoteThe
mcp-nameargument in this example is set toworker-cnfbased on the output of the commandoc get mcp. For single-node OpenShift use--mcp-name=master.
-
Review the created YAML file by running the following command:
cat my-performance-profile.yaml
$ cat my-performance-profile.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the generated profile:
oc apply -f my-performance-profile.yaml
$ oc apply -f my-performance-profile.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
performanceprofile.performance.openshift.io/performance created
performanceprofile.performance.openshift.io/performance createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow
15.1.6. Performance Profile Creator arguments Copy linkLink copied to clipboard!
| Argument | Description |
|---|---|
|
|
Name for MCP; for example, |
|
| The path of the must gather directory.
This argument is only required if you run the PPC tool by using Podman. If you use the PPC with the wrapper script, do not use this argument. Instead, specify the directory path to the |
|
| Number of reserved CPUs. Use a natural number greater than zero. |
|
| Enables real-time kernel.
Possible values: |
| Argument | Description |
|---|---|
|
| Disable Hyper-Threading.
Possible values:
Default: Warning
If this argument is set to |
|
|
This captures cluster information. This argument also requires the Possible values:
Default: |
|
| Number of offlined CPUs. Note Use a natural number greater than zero. If not enough logical processors are offlined, then error messages are logged. The messages are: Error: failed to compute the reserved and isolated CPUs: please ensure that reserved-cpu-count plus offlined-cpu-count should be in the range [0,1]
Error: failed to compute the reserved and isolated CPUs: please specify the offlined CPU count in the range [0,1]
|
|
| The power consumption mode. Possible values:
Default: |
|
|
Enable per pod power management. You cannot use this argument if you configured
Possible values:
Default: |
|
| Name of the performance profile to create.
Default: |
|
| Split the reserved CPUs across NUMA nodes.
Possible values:
Default: |
|
| Kubelet Topology Manager policy of the performance profile to be created. Possible values:
Default: |
|
| Run with user level networking (DPDK) enabled.
Possible values:
Default: |
15.1.7. Reference performance profiles Copy linkLink copied to clipboard!
Use the following reference performance profiles as the basis to develop your own custom profiles.
15.1.7.1. Performance profile template for clusters that use OVS-DPDK on OpenStack Copy linkLink copied to clipboard!
To maximize machine performance in a cluster that uses Open vSwitch with the Data Plane Development Kit (OVS-DPDK) on Red Hat OpenStack Platform (RHOSP), you can use a performance profile.
You can use the following performance profile template to create a profile for your deployment.
Performance profile template for clusters that use OVS-DPDK
Insert values that are appropriate for your configuration for the CPU_ISOLATED, CPU_RESERVED, and HUGEPAGES_COUNT keys.
15.1.7.2. Telco RAN DU reference design performance profile Copy linkLink copied to clipboard!
The following performance profile configures node-level performance settings for OpenShift Container Platform clusters on commodity hardware to host telco RAN DU workloads.
Telco RAN DU reference design performance profile
15.1.7.3. Telco core reference design performance profile Copy linkLink copied to clipboard!
The following performance profile configures node-level performance settings for OpenShift Container Platform clusters on commodity hardware to host telco core workloads.
Telco core reference design performance profile
15.2. Supported performance profile API versions Copy linkLink copied to clipboard!
The Node Tuning Operator supports v2, v1, and v1alpha1 for the performance profile apiVersion field. The v1 and v1alpha1 APIs are identical. The v2 API includes an optional boolean field globallyDisableIrqLoadBalancing with a default value of false.
Upgrading the performance profile to use device interrupt processing
When you upgrade the Node Tuning Operator performance profile custom resource definition (CRD) from v1 or v1alpha1 to v2, globallyDisableIrqLoadBalancing is set to true on existing profiles.
globallyDisableIrqLoadBalancing toggles whether IRQ load balancing will be disabled for the Isolated CPU set. When the option is set to true it disables IRQ load balancing for the Isolated CPU set. Setting the option to false allows the IRQs to be balanced across all CPUs.
Upgrading Node Tuning Operator API from v1alpha1 to v1
When upgrading Node Tuning Operator API version from v1alpha1 to v1, the v1alpha1 performance profiles are converted on-the-fly using a "None" Conversion strategy and served to the Node Tuning Operator with API version v1.
Upgrading Node Tuning Operator API from v1alpha1 or v1 to v2
When upgrading from an older Node Tuning Operator API version, the existing v1 and v1alpha1 performance profiles are converted using a conversion webhook that injects the globallyDisableIrqLoadBalancing field with a value of true.
15.3. Configuring node power consumption and realtime processing with workload hints Copy linkLink copied to clipboard!
Procedure
-
Create a
PerformanceProfileappropriate for the environment’s hardware and topology by using the Performance Profile Creator (PPC) tool. The following table describes the possible values set for thepower-consumption-modeflag associated with the PPC tool and the workload hint that is applied.
| Performance Profile creator setting | Hint | Environment | Description |
|---|---|---|---|
| Default |
workloadHints: highPowerConsumption: false realTime: false
| High throughput cluster without latency requirements | Performance achieved through CPU partitioning only. |
| Low-latency |
workloadHints: highPowerConsumption: false realTime: true
| Regional data-centers | Both energy savings and low-latency are desirable: compromise between power management, latency and throughput. |
| Ultra-low-latency |
workloadHints: highPowerConsumption: true realTime: true
| Far edge clusters, latency critical workloads | Optimized for absolute minimal latency and maximum determinism at the cost of increased power consumption. |
| Per-pod power management |
workloadHints: realTime: true highPowerConsumption: false perPodPowerManagement: true
| Critical and non-critical workloads | Allows for power management per pod. |
Example
The following configuration is commonly used in a telco RAN DU deployment.
- 1
- Disables some debugging and monitoring features that can affect system latency.
When the realTime workload hint flag is set to true in a performance profile, add the cpu-quota.crio.io: disable annotation to every guaranteed pod with pinned CPUs. This annotation is necessary to prevent the degradation of the process performance within the pod. If the realTime workload hint is not explicitly set, it defaults to true.
For more information how combinations of power consumption and real-time settings impact latency, see Understanding workload hints.
15.4. Configuring power saving for nodes that run colocated high and low priority workloads Copy linkLink copied to clipboard!
You can enable power savings for a node that has low priority workloads that are colocated with high priority workloads without impacting the latency or throughput of the high priority workloads. Power saving is possible without modifications to the workloads themselves.
The feature is supported on Intel Ice Lake and later generations of Intel CPUs. The capabilities of the processor might impact the latency and throughput of the high priority workloads.
Prerequisites
- You enabled C-states and operating system controlled P-states in the BIOS
Procedure
Generate a
PerformanceProfilewith theper-pod-power-managementargument set totrue:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
power-consumption-modeargument must bedefaultorlow-latencywhen theper-pod-power-managementargument is set totrue.
Example
PerformanceProfilewithperPodPowerManagementCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set the default
cpufreqgovernor as an additional kernel argument in thePerformanceProfilecustom resource (CR):Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Using the
schedutilgovernor is recommended, however, you can use other governors such as theondemandorpowersavegovernors.
Set the maximum CPU frequency in the
TunedPerformancePatchCR:spec: profile: - data: | [sysfs] /sys/devices/system/cpu/intel_pstate/max_perf_pct = <x>spec: profile: - data: | [sysfs] /sys/devices/system/cpu/intel_pstate/max_perf_pct = <x>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
max_perf_pctcontrols the maximum frequency that thecpufreqdriver is allowed to set as a percentage of the maximum supported cpu frequency. This value applies to all CPUs. You can check the maximum supported frequency in/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq. As a starting point, you can use a percentage that caps all CPUs at theAll Cores Turbofrequency. TheAll Cores Turbofrequency is the frequency that all cores will run at when the cores are all fully occupied.
15.5. Restricting CPUs for infra and application containers Copy linkLink copied to clipboard!
Generic housekeeping and workload tasks use CPUs in a way that may impact latency-sensitive processes. By default, the container runtime uses all online CPUs to run all containers together, which can result in context switches and spikes in latency. Partitioning the CPUs prevents noisy processes from interfering with latency-sensitive processes by separating them from each other. The following table describes how processes run on a CPU after you have tuned the node using the Node Tuning Operator:
| Process type | Details |
|---|---|
|
| Runs on any CPU except where low latency workload is running |
| Infrastructure pods | Runs on any CPU except where low latency workload is running |
| Interrupts | Redirects to reserved CPUs (optional in OpenShift Container Platform 4.7 and later) |
| Kernel processes | Pins to reserved CPUs |
| Latency-sensitive workload pods | Pins to a specific set of exclusive CPUs from the isolated pool |
| OS processes/systemd services | Pins to reserved CPUs |
The allocatable capacity of cores on a node for pods of all QoS process types, Burstable, BestEffort, or Guaranteed, is equal to the capacity of the isolated pool. The capacity of the reserved pool is removed from the node’s total core capacity for use by the cluster and operating system housekeeping duties.
Example 1
A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 25 cores to QoS Guaranteed pods and 25 cores for BestEffort or Burstable pods. This matches the capacity of the isolated pool.
Example 2
A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 50 cores to QoS Guaranteed pods and one core for BestEffort or Burstable pods. This exceeds the capacity of the isolated pool by one core. Pod scheduling fails because of insufficient CPU capacity.
The exact partitioning pattern to use depends on many factors like hardware, workload characteristics and the expected system load. Some sample use cases are as follows:
- If the latency-sensitive workload uses specific hardware, such as a network interface controller (NIC), ensure that the CPUs in the isolated pool are as close as possible to this hardware. At a minimum, you should place the workload in the same Non-Uniform Memory Access (NUMA) node.
- The reserved pool is used for handling all interrupts. When depending on system networking, allocate a sufficiently-sized reserve pool to handle all the incoming packet interrupts. In 4.14 and later versions, workloads can optionally be labeled as sensitive.
The decision regarding which specific CPUs should be used for reserved and isolated partitions requires detailed analysis and measurements. Factors like NUMA affinity of devices and memory play a role. The selection also depends on the workload architecture and the specific use case.
The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node.
To ensure that housekeeping tasks and workloads do not interfere with each other, specify two groups of CPUs in the spec section of the performance profile.
-
isolated- Specifies the CPUs for the application container workloads. These CPUs have the lowest latency. Processes in this group have no interruptions and can, for example, reach much higher DPDK zero packet loss bandwidth. -
reserved- Specifies the CPUs for the cluster and operating system housekeeping duties. Threads in thereservedgroup are often busy. Do not run latency-sensitive applications in thereservedgroup. Latency-sensitive applications run in theisolatedgroup.
Procedure
- Create a performance profile appropriate for the environment’s hardware and topology.
Add the
reservedandisolatedparameters with the CPUs you want reserved and isolated for the infra and application containers:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
15.6. Configuring Hyper-Threading for a cluster Copy linkLink copied to clipboard!
To configure Hyper-Threading for an OpenShift Container Platform cluster, set the CPU threads in the performance profile to the same cores that are configured for the reserved or isolated CPU pools.
If you configure a performance profile, and subsequently change the Hyper-Threading configuration for the host, ensure that you update the CPU isolated and reserved fields in the PerformanceProfile YAML to match the new configuration.
Disabling a previously enabled host Hyper-Threading configuration can cause the CPU core IDs listed in the PerformanceProfile YAML to be incorrect. This incorrect configuration can cause the node to become unavailable because the listed CPUs can no longer be found.
Prerequisites
-
Access to the cluster as a user with the
cluster-adminrole. - Install the OpenShift CLI (oc).
Procedure
Ascertain which threads are running on what CPUs for the host you want to configure.
You can view which threads are running on the host CPUs by logging in to the cluster and running the following command:
lscpu --all --extended
$ lscpu --all --extendedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, there are eight logical CPU cores running on four physical CPU cores. CPU0 and CPU4 are running on physical Core0, CPU1 and CPU5 are running on physical Core 1, and so on.
Alternatively, to view the threads that are set for a particular physical CPU core (
cpu0in the example below), open a shell prompt and run the following:cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
$ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
0-4
0-4Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the isolated and reserved CPUs in the
PerformanceProfileYAML. For example, you can set logical cores CPU0 and CPU4 asisolated, and logical cores CPU1 to CPU3 and CPU5 to CPU7 asreserved. When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.... cpu: isolated: 0,4 reserved: 1-3,5-7 ...... cpu: isolated: 0,4 reserved: 1-3,5-7 ...Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node.
Hyper-Threading is enabled by default on most Intel processors. If you enable Hyper-Threading, all threads processed by a particular core must be isolated or processed on the same core.
When Hyper-Threading is enabled, all guaranteed pods must use multiples of the simultaneous multi-threading (SMT) level to avoid a "noisy neighbor" situation that can cause the pod to fail. See Static policy options for more information.
15.6.1. Disabling Hyper-Threading for low latency applications Copy linkLink copied to clipboard!
When configuring clusters for low latency processing, consider whether you want to disable Hyper-Threading before you deploy the cluster. To disable Hyper-Threading, perform the following steps:
- Create a performance profile that is appropriate for your hardware and topology.
Set
nosmtas an additional kernel argument. The following example performance profile illustrates this setting:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteWhen you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.
15.7. Managing device interrupt processing for guaranteed pod isolated CPUs Copy linkLink copied to clipboard!
The Node Tuning Operator can manage host CPUs by dividing them into reserved CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolated CPUs for application containers to run the workloads. This allows you to set CPUs for low latency workloads as isolated.
Device interrupts are load balanced between all isolated and reserved CPUs to avoid CPUs being overloaded, with the exception of CPUs where there is a guaranteed pod running. Guaranteed pod CPUs are prevented from processing device interrupts when the relevant annotations are set for the pod.
In the performance profile, globallyDisableIrqLoadBalancing is used to manage whether device interrupts are processed or not. For certain workloads, the reserved CPUs are not always sufficient for dealing with device interrupts, and for this reason, device interrupts are not globally disabled on the isolated CPUs. By default, Node Tuning Operator does not disable device interrupts on isolated CPUs.
15.7.1. Finding the effective IRQ affinity setting for a node Copy linkLink copied to clipboard!
Some IRQ controllers lack support for IRQ affinity setting and will always expose all online CPUs as the IRQ mask. These IRQ controllers effectively run on CPU 0.
The following are examples of drivers and hardware that Red Hat are aware lack support for IRQ affinity setting. The list is, by no means, exhaustive:
-
Some RAID controller drivers, such as
megaraid_sas - Many non-volatile memory express (NVMe) drivers
- Some LAN on motherboard (LOM) network controllers
-
The driver uses
managed_irqs
The reason they do not support IRQ affinity setting might be associated with factors such as the type of processor, the IRQ controller, or the circuitry connections in the motherboard.
If the effective affinity of any IRQ is set to an isolated CPU, it might be a sign of some hardware or driver not supporting IRQ affinity setting. To find the effective affinity, log in to the host and run the following command:
find /proc/irq -name effective_affinity -printf "%p: " -exec cat {} \;
$ find /proc/irq -name effective_affinity -printf "%p: " -exec cat {} \;
Example output
Some drivers use managed_irqs, whose affinity is managed internally by the kernel and userspace cannot change the affinity. In some cases, these IRQs might be assigned to isolated CPUs. For more information about managed_irqs, see Affinity of managed interrupts cannot be changed even if they target isolated CPU.
15.7.2. Configuring a node for IRQ dynamic load balancing Copy linkLink copied to clipboard!
Configure a cluster node for IRQ dynamic load balancing to control which cores can receive device interrupt requests (IRQ).
Prerequisites
- For core isolation, all server hardware components must support IRQ affinity. To check if the hardware components of your server support IRQ affinity, view the server’s hardware specifications or contact your hardware provider.
Procedure
- Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.
-
Set the performance profile
apiVersionto useperformance.openshift.io/v2. -
Remove the
globallyDisableIrqLoadBalancingfield or set it tofalse. Set the appropriate isolated and reserved CPUs. The following snippet illustrates a profile that reserves 2 CPUs. IRQ load-balancing is enabled for pods running on the
isolatedCPU set:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteWhen you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.
Create the pod that uses exclusive CPUs, and set
irq-load-balancing.crio.ioandcpu-quota.crio.ioannotations todisable. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Enter the pod
runtimeClassNamein the form performance-<profile_name>, where <profile_name> is thenamefrom thePerformanceProfileYAML, in this example,performance-dynamic-irq-profile. - Set the node selector to target a cnf-worker.
Ensure the pod is running correctly. Status should be
running, and the correct cnf-worker node should be set:oc get pod -o wide
$ oc get pod -o wideCopy to Clipboard Copied! Toggle word wrap Toggle overflow Expected output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dynamic-irq-pod 1/1 Running 0 5h33m <ip-address> <node-name> <none> <none>
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dynamic-irq-pod 1/1 Running 0 5h33m <ip-address> <node-name> <none> <none>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the CPUs that the pod configured for IRQ dynamic load balancing runs on:
oc exec -it dynamic-irq-pod -- /bin/bash -c "grep Cpus_allowed_list /proc/self/status | awk '{print $2}'"$ oc exec -it dynamic-irq-pod -- /bin/bash -c "grep Cpus_allowed_list /proc/self/status | awk '{print $2}'"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Expected output
Cpus_allowed_list: 2-3
Cpus_allowed_list: 2-3Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure the node configuration is applied correctly. Log in to the node to verify the configuration.
oc debug node/<node-name>
$ oc debug node/<node-name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Expected output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that you can use the node file system:
chroot /host
sh-4.4# chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Expected output
sh-4.4#
sh-4.4#Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure the default system CPU affinity mask does not include the
dynamic-irq-podCPUs, for example, CPUs 2 and 3.cat /proc/irq/default_smp_affinity
$ cat /proc/irq/default_smp_affinityCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
33
33Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure the system IRQs are not configured to run on the
dynamic-irq-podCPUs:find /proc/irq/ -name smp_affinity_list -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;find /proc/irq/ -name smp_affinity_list -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
15.8. Configuring huge pages Copy linkLink copied to clipboard!
Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. Use the Node Tuning Operator to allocate huge pages on a specific node.
OpenShift Container Platform provides a method for creating and allocating huge pages. Node Tuning Operator provides an easier method for doing this using the performance profile.
For example, in the hugepages pages section of the performance profile, you can specify multiple blocks of size, count, and, optionally, node:
- 1
nodeis the NUMA node in which the huge pages are allocated. If you omitnode, the pages are evenly spread across all NUMA nodes.
Wait for the relevant machine config pool status that indicates the update is finished.
These are the only configuration steps you need to do to allocate huge pages.
Verification
To verify the configuration, see the
/proc/meminfofile on the node:oc debug node/ip-10-0-141-105.ec2.internal
$ oc debug node/ip-10-0-141-105.ec2.internalCopy to Clipboard Copied! Toggle word wrap Toggle overflow grep -i huge /proc/meminfo
# grep -i huge /proc/meminfoCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use
oc describeto report the new size:oc describe node worker-0.ocp4poc.example.com | grep -i huge
$ oc describe node worker-0.ocp4poc.example.com | grep -i hugeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
hugepages-1g=true hugepages-###: ### hugepages-###: ###
hugepages-1g=true hugepages-###: ### hugepages-###: ###Copy to Clipboard Copied! Toggle word wrap Toggle overflow
15.8.1. Allocating multiple huge page sizes Copy linkLink copied to clipboard!
You can request huge pages with different sizes under the same container. This allows you to define more complicated pods consisting of containers with different huge page size needs.
For example, you can define sizes 1G and 2M and the Node Tuning Operator will configure both sizes on the node, as shown here:
15.9. Reducing NIC queues using the Node Tuning Operator Copy linkLink copied to clipboard!
The Node Tuning Operator facilitates reducing NIC queues for enhanced performance. Adjustments are made using the performance profile, allowing customization of queues for different network devices.
15.9.1. Adjusting the NIC queues with the performance profile Copy linkLink copied to clipboard!
The performance profile lets you adjust the queue count for each network device.
Supported network devices:
- Non-virtual network devices
- Network devices that support multiple queues (channels)
Unsupported network devices:
- Pure software network interfaces
- Block devices
- Intel DPDK virtual functions
Prerequisites
-
Access to the cluster as a user with the
cluster-adminrole. -
Install the OpenShift CLI (
oc).
Procedure
-
Log in to the OpenShift Container Platform cluster running the Node Tuning Operator as a user with
cluster-adminprivileges. - Create and apply a performance profile appropriate for your hardware and topology. For guidance on creating a profile, see the "Creating a performance profile" section.
Edit this created performance profile:
oc edit -f <your_profile_name>.yaml
$ oc edit -f <your_profile_name>.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Populate the
specfield with thenetobject. The object list can contain two fields:-
userLevelNetworkingis a required field specified as a boolean flag. IfuserLevelNetworkingistrue, the queue count is set to the reserved CPU count for all supported devices. The default isfalse. devicesis an optional field specifying a list of devices that will have the queues set to the reserved CPU count. If the device list is empty, the configuration applies to all network devices. The configuration is as follows:interfaceName: This field specifies the interface name, and it supports shell-style wildcards, which can be positive or negative.-
Example wildcard syntax is as follows:
<string> .* -
Negative rules are prefixed with an exclamation mark. To apply the net queue changes to all devices other than the excluded list, use
!<device>, for example,!eno1.
-
Example wildcard syntax is as follows:
-
vendorID: The network device vendor ID represented as a 16-bit hexadecimal number with a0xprefix. deviceID: The network device ID (model) represented as a 16-bit hexadecimal number with a0xprefix.NoteWhen a
deviceIDis specified, thevendorIDmust also be defined. A device that matches all of the device identifiers specified in a device entryinterfaceName,vendorID, or a pair ofvendorIDplusdeviceIDqualifies as a network device. This network device then has its net queues count set to the reserved CPU count.When two or more devices are specified, the net queues count is set to any net device that matches one of them.
-
Set the queue count to the reserved CPU count for all devices by using this example performance profile:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the queue count to the reserved CPU count for all devices matching any of the defined device identifiers by using this example performance profile:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the queue count to the reserved CPU count for all devices starting with the interface name
ethby using this example performance profile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the queue count to the reserved CPU count for all devices with an interface named anything other than
eno1by using this example performance profile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the queue count to the reserved CPU count for all devices that have an interface name
eth0,vendorIDof0x1af4, anddeviceIDof0x1000by using this example performance profile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the updated performance profile:
oc apply -f <your_profile_name>.yaml
$ oc apply -f <your_profile_name>.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
15.9.2. Verifying the queue status Copy linkLink copied to clipboard!
In this section, a number of examples illustrate different performance profiles and how to verify the changes are applied.
Example 1
In this example, the net queue count is set to the reserved CPU count (2) for all supported devices.
The relevant section from the performance profile is:
Display the status of the queues associated with a device using the following command:
NoteRun this command on the node where the performance profile was applied.
ethtool -l <device>
$ ethtool -l <device>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the queue status before the profile is applied:
ethtool -l ens4
$ ethtool -l ens4Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the queue status after the profile is applied:
ethtool -l ens4
$ ethtool -l ens4Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- 1
- The combined channel shows that the total count of reserved CPUs for all supported devices is 2. This matches what is configured in the performance profile.
Example 2
In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices with a specific vendorID.
The relevant section from the performance profile is:
Display the status of the queues associated with a device using the following command:
NoteRun this command on the node where the performance profile was applied.
ethtool -l <device>
$ ethtool -l <device>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the queue status after the profile is applied:
ethtool -l ens4
$ ethtool -l ens4Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- 1
- The total count of reserved CPUs for all supported devices with
vendorID=0x1af4is 2. For example, if there is another network deviceens2withvendorID=0x1af4it will also have total net queues of 2. This matches what is configured in the performance profile.
Example 3
In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices that match any of the defined device identifiers.
The command udevadm info provides a detailed report on a device. In this example the devices are:
Set the net queues to 2 for a device with
interfaceNameequal toeth0and any devices that have avendorID=0x1af4with the following performance profile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the queue status after the profile is applied:
ethtool -l ens4
$ ethtool -l ens4Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The total count of reserved CPUs for all supported devices with
vendorID=0x1af4is set to 2. For example, if there is another network deviceens2withvendorID=0x1af4, it will also have the total net queues set to 2. Similarly, a device withinterfaceNameequal toeth0will have total net queues set to 2.
15.9.3. Logging associated with adjusting NIC queues Copy linkLink copied to clipboard!
Log messages detailing the assigned devices are recorded in the respective Tuned daemon logs. The following messages might be recorded to the /var/log/tuned/tuned.log file:
An
INFOmessage is recorded detailing the successfully assigned devices:INFO tuned.plugins.base: instance net_test (net): assigning devices ens1, ens2, ens3
INFO tuned.plugins.base: instance net_test (net): assigning devices ens1, ens2, ens3Copy to Clipboard Copied! Toggle word wrap Toggle overflow A
WARNINGmessage is recorded if none of the devices can be assigned:WARNING tuned.plugins.base: instance net_test: no matching devices available
WARNING tuned.plugins.base: instance net_test: no matching devices availableCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 16. Provisioning real-time and low latency workloads Copy linkLink copied to clipboard!
Many organizations need high performance computing and low, predictable latency, especially in the financial and telecommunications industries.
OpenShift Container Platform provides the Node Tuning Operator to implement automatic tuning to achieve low latency performance and consistent response time for OpenShift Container Platform applications. You use the performance profile configuration to make these changes. You can update the kernel to kernel-rt, reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, isolate CPUs for application containers to run the workloads, and disable unused CPUs to reduce power consumption.
When writing your applications, follow the general recommendations described in RHEL for Real Time processes and threads.
16.1. Scheduling a low latency workload onto a worker with real-time capabilities Copy linkLink copied to clipboard!
You can schedule low latency workloads onto a worker node where a performance profile that configures real-time capabilities is applied.
To schedule the workload on specific nodes, use label selectors in the Pod custom resource (CR). The label selectors must match the nodes that are attached to the machine config pool that was configured for low latency by the Node Tuning Operator.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have applied a performance profile in the cluster that tunes worker nodes for low latency workloads.
Procedure
Create a
PodCR for the low latency workload and apply it in the cluster, for example:Example
Podspec configured to use real-time processingCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Disables the CPU completely fair scheduler (CFS) quota at the pod run time.
- 2
- Disables CPU load balancing.
- 3
- Opts the pod out of interrupt handling on the node.
- 4
- The
nodeSelectorlabel must match the label that you specify in theNodeCR. - 5
runtimeClassNamemust match the name of the performance profile configured in the cluster.
-
Enter the pod
runtimeClassNamein the form performance-<profile_name>, where <profile_name> is thenamefrom thePerformanceProfileYAML. In the previous example, thenameisperformance-dynamic-low-latency-profile. Ensure the pod is running correctly. Status should be
running, and the correct cnf-worker node should be set:oc get pod -o wide
$ oc get pod -o wideCopy to Clipboard Copied! Toggle word wrap Toggle overflow Expected output
NAME READY STATUS RESTARTS AGE IP NODE dynamic-low-latency-pod 1/1 Running 0 5h33m 10.131.0.10 cnf-worker.example.com
NAME READY STATUS RESTARTS AGE IP NODE dynamic-low-latency-pod 1/1 Running 0 5h33m 10.131.0.10 cnf-worker.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the CPUs that the pod configured for IRQ dynamic load balancing runs on:
oc exec -it dynamic-low-latency-pod -- /bin/bash -c "grep Cpus_allowed_list /proc/self/status | awk '{print $2}'"$ oc exec -it dynamic-low-latency-pod -- /bin/bash -c "grep Cpus_allowed_list /proc/self/status | awk '{print $2}'"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Expected output
Cpus_allowed_list: 2-3
Cpus_allowed_list: 2-3Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Ensure the node configuration is applied correctly.
Log in to the node to verify the configuration.
oc debug node/<node-name>
$ oc debug node/<node-name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that you can use the node file system:
chroot /host
sh-4.4# chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Expected output
sh-4.4#
sh-4.4#Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure the default system CPU affinity mask does not include the
dynamic-low-latency-podCPUs, for example, CPUs 2 and 3.cat /proc/irq/default_smp_affinity
sh-4.4# cat /proc/irq/default_smp_affinityCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
33
33Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure the system IRQs are not configured to run on the
dynamic-low-latency-podCPUs:find /proc/irq/ -name smp_affinity_list -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;sh-4.4# find /proc/irq/ -name smp_affinity_list -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
When you tune nodes for low latency, the usage of execution probes in conjunction with applications that require guaranteed CPUs can cause latency spikes. Use other probes, such as a properly configured set of network probes, as an alternative.
16.2. Creating a pod with a guaranteed QoS class Copy linkLink copied to clipboard!
You can create a pod with a quality of service (QoS) class of Guaranteed for high-performance workloads. Configuring a pod with a QoS class of Guaranteed ensures that the pod has priority access to the specified CPU and memory resources.
To create a pod with a QoS class of Guaranteed, you must apply the following specifications:
- Set identical values for the memory limit and memory request fields for each container in the pod.
- Set identical values for CPU limit and CPU request fields for each container in the pod.
In general, a pod with a QoS class of Guaranteed will not be evicted from a node. One exception is during resource contention caused by system daemons exceeding reserved resources. In this scenario, the kubelet might evict pods to preserve node stability, starting with the lowest priority pods.
Prerequisites
-
Access to the cluster as a user with the
cluster-adminrole -
The OpenShift CLI (
oc)
Procedure
Create a namespace for the pod by running the following command:
oc create namespace qos-example
$ oc create namespace qos-example1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- This example uses the
qos-examplenamespace.
Example output
namespace/qos-example created
namespace/qos-example createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
Podresource:Create a YAML file that defines the
Podresource:Example
qos-example.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- This example uses a public
hello-openshiftimage. - 2
- Sets the memory limit to 200 MB.
- 3
- Sets the CPU limit to 1 CPU.
- 4
- Sets the memory request to 200 MB.
- 5
- Sets the CPU request to 1 CPU.Note
If you specify a memory limit for a container, but do not specify a memory request, OpenShift Container Platform automatically assigns a memory request that matches the limit. Similarly, if you specify a CPU limit for a container, but do not specify a CPU request, OpenShift Container Platform automatically assigns a CPU request that matches the limit.
Create the
Podresource by running the following command:oc apply -f qos-example.yaml --namespace=qos-example
$ oc apply -f qos-example.yaml --namespace=qos-exampleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
pod/qos-demo created
pod/qos-demo createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
View the
qosClassvalue for the pod by running the following command:oc get pod qos-demo --namespace=qos-example --output=yaml | grep qosClass
$ oc get pod qos-demo --namespace=qos-example --output=yaml | grep qosClassCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
qosClass: Guaranteed
qosClass: GuaranteedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
16.3. Disabling CPU load balancing in a Pod Copy linkLink copied to clipboard!
Functionality to disable or enable CPU load balancing is implemented on the CRI-O level. The code under the CRI-O disables or enables CPU load balancing only when the following requirements are met.
The pod must use the
performance-<profile-name>runtime class. You can get the proper name by looking at the status of the performance profile, as shown here:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
The Node Tuning Operator is responsible for the creation of the high-performance runtime handler config snippet under relevant nodes and for creation of the high-performance runtime class under the cluster. It will have the same content as the default runtime handler except that it enables the CPU load balancing configuration functionality.
To disable the CPU load balancing for the pod, the Pod specification must include the following fields:
Only disable CPU load balancing when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU load balancing can affect the performance of other containers in the cluster.
16.4. Disabling power saving mode for high priority pods Copy linkLink copied to clipboard!
You can configure pods to ensure that high priority workloads are unaffected when you configure power saving for the node that the workloads run on.
When you configure a node with a power saving configuration, you must configure high priority workloads with performance configuration at the pod level, which means that the configuration applies to all the cores used by the pod.
By disabling P-states and C-states at the pod level, you can configure high priority workloads for best performance and lowest latency.
| Annotation | Possible Values | Description |
|---|---|---|
|
|
|
This annotation allows you to enable or disable C-states for each CPU. Alternatively, you can also specify a maximum latency in microseconds for the C-states. For example, enable C-states with a maximum latency of 10 microseconds with the setting |
|
|
Any supported |
Sets the |
Prerequisites
- You have configured power saving in the performance profile for the node where the high priority workload pods are scheduled.
Procedure
Add the required annotations to your high priority workload pods. The annotations override the
defaultsettings.Example high priority workload annotation
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Restart the pods to apply the annotation.
16.5. Disabling CPU CFS quota Copy linkLink copied to clipboard!
To eliminate CPU throttling for pinned pods, create a pod with the cpu-quota.crio.io: "disable" annotation. This annotation disables the CPU completely fair scheduler (CFS) quota when the pod runs.
Example pod specification with cpu-quota.crio.io disabled
Only disable CPU CFS quota when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. For example, pods that contain CPU-pinned containers. Otherwise, disabling CPU CFS quota can affect the performance of other containers in the cluster.
16.6. Disabling interrupt processing for CPUs where pinned containers are running Copy linkLink copied to clipboard!
To achieve low latency for workloads, some containers require that the CPUs they are pinned to do not process device interrupts. A pod annotation, irq-load-balancing.crio.io, is used to define whether device interrupts are processed or not on the CPUs where the pinned containers are running. When configured, CRI-O disables device interrupts where the pod containers are running.
To disable interrupt processing for CPUs where containers belonging to individual pods are pinned, ensure that globallyDisableIrqLoadBalancing is set to false in the performance profile. Then, in the pod specification, set the irq-load-balancing.crio.io pod annotation to disable.
The following pod specification contains this annotation:
Chapter 17. Debugging low latency node tuning status Copy linkLink copied to clipboard!
Use the PerformanceProfile custom resource (CR) status fields for reporting tuning status and debugging latency issues in the cluster node.
17.1. Debugging low latency CNF tuning status Copy linkLink copied to clipboard!
The PerformanceProfile custom resource (CR) contains status fields for reporting tuning status and debugging latency degradation issues. These fields report on conditions that describe the state of the operator’s reconciliation functionality.
A typical issue can arise when the status of machine config pools that are attached to the performance profile are in a degraded state, causing the PerformanceProfile status to degrade. In this case, the machine config pool issues a failure message.
The Node Tuning Operator contains the performanceProfile.spec.status.Conditions status field:
The Status field contains Conditions that specify Type values that indicate the status of the performance profile:
Available- All machine configs and Tuned profiles have been created successfully and are available for cluster components are responsible to process them (NTO, MCO, Kubelet).
Upgradeable- Indicates whether the resources maintained by the Operator are in a state that is safe to upgrade.
Progressing- Indicates that the deployment process from the performance profile has started.
DegradedIndicates an error if:
- Validation of the performance profile has failed.
- Creation of all relevant components did not complete successfully.
Each of these types contain the following fields:
Status-
The state for the specific type (
trueorfalse). Timestamp- The transaction timestamp.
Reason string- The machine readable reason.
Message string- The human readable reason describing the state and error details, if any.
17.1.1. Machine config pools Copy linkLink copied to clipboard!
A performance profile and its created products are applied to a node according to an associated machine config pool (MCP). The MCP holds valuable information about the progress of applying the machine configurations created by performance profiles that encompass kernel args, kube config, huge pages allocation, and deployment of rt-kernel. The Performance Profile controller monitors changes in the MCP and updates the performance profile status accordingly.
The only conditions returned by the MCP to the performance profile status is when the MCP is Degraded, which leads to performanceProfile.status.condition.Degraded = true.
Example
The following example is for a performance profile with an associated machine config pool (worker-cnf) that was created for it:
The associated machine config pool is in a degraded state:
oc get mcp
# oc get mcpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-2ee57a93fa6c9181b546ca46e1571d2d True False False 3 3 3 0 2d21h worker rendered-worker-d6b2bdc07d9f5a59a6b68950acf25e5f True False False 2 2 2 0 2d21h worker-cnf rendered-worker-cnf-6c838641b8a08fff08dbd8b02fb63f7c False True True 2 1 1 1 2d20h
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-2ee57a93fa6c9181b546ca46e1571d2d True False False 3 3 3 0 2d21h worker rendered-worker-d6b2bdc07d9f5a59a6b68950acf25e5f True False False 2 2 2 0 2d21h worker-cnf rendered-worker-cnf-6c838641b8a08fff08dbd8b02fb63f7c False True True 2 1 1 1 2d20hCopy to Clipboard Copied! Toggle word wrap Toggle overflow The
describesection of the MCP shows the reason:oc describe mcp worker-cnf
# oc describe mcp worker-cnfCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Message: Node node-worker-cnf is reporting: "prepping update: machineconfig.machineconfiguration.openshift.io \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not found" Reason: 1 nodes are reporting degraded status on syncMessage: Node node-worker-cnf is reporting: "prepping update: machineconfig.machineconfiguration.openshift.io \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not found" Reason: 1 nodes are reporting degraded status on syncCopy to Clipboard Copied! Toggle word wrap Toggle overflow The degraded state should also appear under the performance profile
statusfield marked asdegraded = true:oc describe performanceprofiles performance
# oc describe performanceprofiles performanceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
17.2. Collecting low latency tuning debugging data for Red Hat Support Copy linkLink copied to clipboard!
When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.
The must-gather tool enables you to collect diagnostic information about your OpenShift Container Platform cluster, including node tuning, NUMA topology, and other information needed to debug issues with low latency setup.
For prompt support, supply diagnostic information for both OpenShift Container Platform and low latency tuning.
17.2.1. About the must-gather tool Copy linkLink copied to clipboard!
The oc adm must-gather CLI command collects the information from your cluster that is most likely needed for debugging issues, such as:
- Resource definitions
- Audit logs
- Service logs
You can specify one or more images when you run the command by including the --image argument. When you specify an image, the tool collects data related to that feature or product. When you run oc adm must-gather, a new pod is created on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local. This directory is created in your current working directory.
17.2.2. Gathering low latency tuning data Copy linkLink copied to clipboard!
Use the oc adm must-gather CLI command to collect information about your cluster, including features and objects associated with low latency tuning, including:
- The Node Tuning Operator namespaces and child objects.
-
MachineConfigPooland associatedMachineConfigobjects. - The Node Tuning Operator and associated Tuned objects.
- Linux kernel command-line options.
- CPU and NUMA topology
- Basic PCI device information and NUMA locality.
Prerequisites
-
Access to the cluster as a user with the
cluster-adminrole. - The OpenShift Container Platform CLI (oc) installed.
Procedure
-
Navigate to the directory where you want to store the
must-gatherdata. Collect debugging information by running the following command:
oc adm must-gather
$ oc adm must-gatherCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a compressed file from the
must-gatherdirectory that was created in your working directory. For example, on a computer that uses a Linux operating system, run the following command:tar cvaf must-gather.tar.gz must-gather-local.5421342344627712289
$ tar cvaf must-gather.tar.gz must-gather-local.54213423446277122891 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Replace
must-gather-local.5421342344627712289//with the directory name created by themust-gathertool.
NoteCreate a compressed file to attach the data to a support case or to use with the Performance Profile Creator wrapper script when you create a performance profile.
- Attach the compressed file to your support case on the Red Hat Customer Portal.
Chapter 18. Performing latency tests for platform verification Copy linkLink copied to clipboard!
You can use the Cloud-native Network Functions (CNF) tests image to run latency tests on a CNF-enabled OpenShift Container Platform cluster, where all the components required for running CNF workloads are installed. Run the latency tests to validate node tuning for your workload.
The cnf-tests container image is available at registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14.
The cnf-tests image also includes several tests that are not supported by Red Hat at this time. Only the latency tests are supported by Red Hat.
18.1. Prerequisites for running latency tests Copy linkLink copied to clipboard!
Your cluster must meet the following requirements before you can run the latency tests:
-
You have applied all the required CNF configurations. This includes the
PerformanceProfilecluster and other configuration according to the reference design specifications (RDS) or your specific requirements. -
You have logged in to
registry.redhat.iowith your Customer Portal credentials by using thepodman logincommand.
18.2. About discovery mode for latency tests Copy linkLink copied to clipboard!
Use discovery mode to validate the functionality of a cluster without altering its configuration. Existing environment configurations are used for the tests. The tests can find the configuration items needed and use those items to execute the tests. If resources needed to run a specific test are not found, the test is skipped, providing an appropriate message to the user. After the tests are finished, no cleanup of the preconfigured configuration items is done, and the test environment can be immediately used for another test run.
When running the latency tests, always run the tests with -e DISCOVERY_MODE=true and -ginkgo.focus set to the appropriate latency test. If you do not run the latency tests in discovery mode, your existing live cluster performance profile configuration will be modified by the test run.
Limiting the nodes used during tests
The nodes on which the tests are executed can be limited by specifying a NODES_SELECTOR environment variable, for example, -e NODES_SELECTOR=node-role.kubernetes.io/worker-cnf. Any resources created by the test are limited to nodes with matching labels.
If you want to override the default worker pool, pass the -e ROLE_WORKER_CNF=<custom_worker_pool> variable to the command specifying an appropriate label.
18.3. Measuring latency Copy linkLink copied to clipboard!
The cnf-tests image uses three tools to measure the latency of the system:
-
hwlatdetect -
cyclictest -
oslat
Each tool has a specific use. Use the tools in sequence to achieve reliable test results.
- hwlatdetect
-
Measures the baseline that the bare-metal hardware can achieve. Before proceeding with the next latency test, ensure that the latency reported by
hwlatdetectmeets the required threshold because you cannot fix hardware latency spikes by operating system tuning. - cyclictest
-
Verifies the real-time kernel scheduler latency after
hwlatdetectpasses validation. Thecyclictesttool schedules a repeated timer and measures the difference between the desired and the actual trigger times. The difference can uncover basic issues with the tuning caused by interrupts or process priorities. The tool must run on a real-time kernel. - oslat
- Behaves similarly to a CPU-intensive DPDK application and measures all the interruptions and disruptions to the busy loop that simulates CPU heavy data processing.
The tests introduce the following environment variables:
| Environment variables | Description |
|---|---|
|
| Specifies the amount of time in seconds after which the test starts running. You can use the variable to allow the CPU manager reconcile loop to update the default CPU pool. The default value is 0. |
|
| Specifies the number of CPUs that the pod running the latency tests uses. If you do not set the variable, the default configuration includes all isolated CPUs. |
|
| Specifies the amount of time in seconds that the latency test must run. The default value is 300 seconds. Note
To prevent the Ginkgo 2.0 test suite from timing out before the latency tests complete, set the |
|
|
Specifies the maximum acceptable hardware latency in microseconds for the workload and operating system. If you do not set the value of |
|
|
Specifies the maximum latency in microseconds that all threads expect before waking up during the |
|
|
Specifies the maximum acceptable latency in microseconds for the |
|
| Unified variable that specifies the maximum acceptable latency in microseconds. Applicable for all available latency tools. |
|
|
Boolean parameter that indicates whether the tests should run. |
Variables that are specific to a latency tool take precedence over unified variables. For example, if OSLAT_MAXIMUM_LATENCY is set to 30 microseconds and MAXIMUM_LATENCY is set to 10 microseconds, the oslat test will run with maximum acceptable latency of 30 microseconds.
18.4. Running the latency tests Copy linkLink copied to clipboard!
Run the cluster latency tests to validate node tuning for your Cloud-native Network Functions (CNF) workload.
Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. Depending on your local operating system and SELinux configuration, you might also experience issues running these commands from your home directory. To make the podman commands work, run the commands from a folder that is not your home/<username> directory, and append :Z to the volumes creation. For example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.
This procedure runs the three individual tests hwlatdetect, cyclictest, and oslat. For details on these individual tests, see their individual sections.
Procedure
Open a shell prompt in the directory containing the
kubeconfigfile.You provide the test image with a
kubeconfigfile in current directory and its related$KUBECONFIGenvironment variable, mounted through a volume. This allows the running container to use thekubeconfigfile from inside the container.NoteIn the following command, your local
kubeconfigis mounted to kubeconfig/kubeconfig in the cnf-tests container, which allows access to the cluster.To run the latency tests, run the following command, substituting variable values as appropriate:
podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUNTIME=600\ -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 /usr/bin/test-run.sh \ --ginkgo.v --ginkgo.timeout="24h"
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUNTIME=600\ -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 /usr/bin/test-run.sh \ --ginkgo.v --ginkgo.timeout="24h"Copy to Clipboard Copied! Toggle word wrap Toggle overflow The LATENCY_TEST_RUNTIME is shown in seconds, in this case 600 seconds (10 minutes). The test runs successfully when the maximum observed latency is lower than MAXIMUM_LATENCY (20 μs).
If the results exceed the latency threshold, the test fails.
-
Optional: Append
--ginkgo.dry-runflag to run the latency tests in dry-run mode. This is useful for checking what commands the tests run. -
Optional: Append
-ginkgo.vflag to run the tests with increased verbosity. Optional: Append
--ginkgo.timeout="24h"flag to ensure the Ginkgo 2.0 test suite does not timeout before the latency tests complete.ImportantDuring testing shorter time periods, as shown, can be used to run the tests. However, for final verification and valid results, the test should run for at least 12 hours (43200 seconds).
18.4.1. Running hwlatdetect Copy linkLink copied to clipboard!
The hwlatdetect tool is available in the rt-kernel package with a regular subscription of Red Hat Enterprise Linux (RHEL) 9.x.
Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. Depending on your local operating system and SELinux configuration, you might also experience issues running these commands from your home directory. To make the podman commands work, run the commands from a folder that is not your home/<username> directory, and append :Z to the volumes creation. For example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.
Prerequisites
- You have reviewed the prerequisites for running latency tests.
Procedure
To run the
hwlatdetecttests, run the following command, substituting variable values as appropriate:podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="hwlatdetect" --ginkgo.timeout="24h"
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="hwlatdetect" --ginkgo.timeout="24h"Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
hwlatdetecttest runs for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower thanMAXIMUM_LATENCY(20 μs).If the results exceed the latency threshold, the test fails.
ImportantDuring testing shorter time periods, as shown, can be used to run the tests. However, for final verification and valid results, the test should run for at least 12 hours (43200 seconds).
Example failure output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Example hwlatdetect test results
You can capture the following types of results:
- Rough results that are gathered after each run to create a history of impact on any changes made throughout the test.
- The combined set of the rough tests with the best results and configuration settings.
Example of good results
The hwlatdetect tool only provides output if the sample exceeds the specified threshold.
Example of bad results
The output of hwlatdetect shows that multiple samples exceed the threshold. However, the same output can indicate different results based on the following factors:
- The duration of the test
- The number of CPU cores
- The host firmware settings
Before proceeding with the next latency test, ensure that the latency reported by hwlatdetect meets the required threshold. Fixing latencies introduced by hardware might require you to contact the system vendor support.
Not all latency spikes are hardware related. Ensure that you tune the host firmware to meet your workload requirements. For more information, see Setting firmware parameters for system tuning.
18.4.2. Running cyclictest Copy linkLink copied to clipboard!
The cyclictest tool measures the real-time kernel scheduler latency on the specified CPUs.
Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. Depending on your local operating system and SELinux configuration, you might also experience issues running these commands from your home directory. To make the podman commands work, run the commands from a folder that is not your home/<username> directory, and append :Z to the volumes creation. For example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.
Prerequisites
- You have reviewed the prerequisites for running latency tests.
Procedure
To perform the
cyclictest, run the following command, substituting variable values as appropriate:podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_CPUS=10 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="cyclictest" --ginkgo.timeout="24h"
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_CPUS=10 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="cyclictest" --ginkgo.timeout="24h"Copy to Clipboard Copied! Toggle word wrap Toggle overflow The command runs the
cyclictesttool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower thanMAXIMUM_LATENCY(in this example, 20 μs). Latency spikes of 20 μs and above are generally not acceptable for telco RAN workloads.If the results exceed the latency threshold, the test fails.
ImportantDuring testing shorter time periods, as shown, can be used to run the tests. However, for final verification and valid results, the test should run for at least 12 hours (43200 seconds).
Example failure output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Example cyclictest results
The same output can indicate different results for different workloads. For example, spikes up to 18μs are acceptable for 4G DU workloads, but not for 5G DU workloads.
Example of good results
Example of bad results
18.4.3. Running oslat Copy linkLink copied to clipboard!
The oslat test simulates a CPU-intensive DPDK application and measures all the interruptions and disruptions to test how the cluster handles CPU heavy data processing.
Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. Depending on your local operating system and SELinux configuration, you might also experience issues running these commands from your home directory. To make the podman commands work, run the commands from a folder that is not your home/<username> directory, and append :Z to the volumes creation. For example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.
Prerequisites
- You have reviewed the prerequisites for running latency tests.
Procedure
To perform the
oslattest, run the following command, substituting variable values as appropriate:podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_CPUS=10 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="oslat" --ginkgo.timeout="24h"
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_CPUS=10 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="oslat" --ginkgo.timeout="24h"Copy to Clipboard Copied! Toggle word wrap Toggle overflow LATENCY_TEST_CPUSspecifies the list of CPUs to test with theoslatcommand.The command runs the
oslattool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower thanMAXIMUM_LATENCY(20 μs).If the results exceed the latency threshold, the test fails.
ImportantDuring testing shorter time periods, as shown, can be used to run the tests. However, for final verification and valid results, the test should run for at least 12 hours (43200 seconds).
Example failure output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- In this example, the measured latency is outside the maximum allowed value.
18.5. Generating a latency test failure report Copy linkLink copied to clipboard!
Use the following procedures to generate a JUnit latency test output and test failure report.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges.
Procedure
Create a test failure report with information about the cluster state and resources for troubleshooting by passing the
--reportparameter with the path to where the report is dumped:podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/reportdest:<report_folder_path> \ -e KUBECONFIG=/kubeconfig/kubeconfig -e DISCOVERY_MODE=true -e FEATURES=performance \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh --report <report_folder_path> \ -ginkgo.focus="\[performance\]\ Latency\ Test"
$ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/reportdest:<report_folder_path> \ -e KUBECONFIG=/kubeconfig/kubeconfig -e DISCOVERY_MODE=true -e FEATURES=performance \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh --report <report_folder_path> \ -ginkgo.focus="\[performance\]\ Latency\ Test"Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- <report_folder_path>
- Is the path to the folder where the report is generated.
18.6. Generating a JUnit latency test report Copy linkLink copied to clipboard!
Use the following procedures to generate a JUnit latency test output and test failure report.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges.
Procedure
Create a JUnit-compliant XML report by passing the
--junitparameter together with the path to where the report is dumped:podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/junitdest:<junit_folder_path> \ -e KUBECONFIG=/kubeconfig/kubeconfig -e DISCOVERY_MODE=true -e FEATURES=performance \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh --junit <junit_folder_path> \ -ginkgo.focus="\[performance\]\ Latency\ Test"
$ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/junitdest:<junit_folder_path> \ -e KUBECONFIG=/kubeconfig/kubeconfig -e DISCOVERY_MODE=true -e FEATURES=performance \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh --junit <junit_folder_path> \ -ginkgo.focus="\[performance\]\ Latency\ Test"Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- <junit_folder_path>
- Is the path to the folder where the junit report is generated
18.7. Running latency tests on a single-node OpenShift cluster Copy linkLink copied to clipboard!
You can run latency tests on single-node OpenShift clusters.
Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. To make the podman command work, append :Z to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges.
Procedure
To run the latency tests on a single-node OpenShift cluster, run the following command:
podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=master \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test" --ginkgo.timeout="24h"
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=master \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test" --ginkgo.timeout="24h"Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteROLE_WORKER_CNF=masteris required because master is the only machine pool to which the node belongs. For more information about setting the requiredMachineConfigPoolfor the latency tests, see "Prerequisites for running latency tests".After running the test suite, all the dangling resources are cleaned up.
18.8. Running latency tests in a disconnected cluster Copy linkLink copied to clipboard!
The CNF tests image can run tests in a disconnected cluster that is not able to reach external registries. This requires two steps:
-
Mirroring the
cnf-testsimage to the custom disconnected registry. - Instructing the tests to consume the images from the custom disconnected registry.
Mirroring the images to a custom registry accessible from the cluster
A mirror executable is shipped in the image to provide the input required by oc to mirror the test image to a local registry.
Run this command from an intermediate machine that has access to the cluster and registry.redhat.io:
podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/mirror -registry <disconnected_registry> | oc image mirror -f -
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ /usr/bin/mirror -registry <disconnected_registry> | oc image mirror -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- <disconnected_registry>
-
Is the disconnected mirror registry you have configured, for example,
my.local.registry:5000/.
When you have mirrored the
cnf-testsimage into the disconnected registry, you must override the original registry used to fetch the images when running the tests, for example:podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e FEATURES=performance -e IMAGE_REGISTRY="<disconnected_registry>" \ -e CNF_TESTS_IMAGE="cnf-tests-rhel8:v4.14" \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test" --ginkgo.timeout="24h"
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e FEATURES=performance -e IMAGE_REGISTRY="<disconnected_registry>" \ -e CNF_TESTS_IMAGE="cnf-tests-rhel8:v4.14" \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test" --ginkgo.timeout="24h"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Configuring the tests to consume images from a custom registry
You can run the latency tests using a custom test image and image registry using CNF_TESTS_IMAGE and IMAGE_REGISTRY variables.
To configure the latency tests to use a custom test image and image registry, run the following command:
podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e IMAGE_REGISTRY="<custom_image_registry>" \ -e CNF_TESTS_IMAGE="<custom_cnf-tests_image>" \ -e FEATURES=performance \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 /usr/bin/test-run.sh --ginkgo.timeout="24h"
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e IMAGE_REGISTRY="<custom_image_registry>" \ -e CNF_TESTS_IMAGE="<custom_cnf-tests_image>" \ -e FEATURES=performance \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 /usr/bin/test-run.sh --ginkgo.timeout="24h"Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- <custom_image_registry>
-
is the custom image registry, for example,
custom.registry:5000/. - <custom_cnf-tests_image>
-
is the custom cnf-tests image, for example,
custom-cnf-tests-image:latest.
Mirroring images to the cluster OpenShift image registry
OpenShift Container Platform provides a built-in container image registry, which runs as a standard workload on the cluster.
Procedure
Gain external access to the registry by exposing it with a route:
oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Fetch the registry endpoint by running the following command:
REGISTRY=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')$ REGISTRY=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a namespace for exposing the images:
oc create ns cnftests
$ oc create ns cnftestsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Make the image stream available to all the namespaces used for tests. This is required to allow the tests namespaces to fetch the images from the
cnf-testsimage stream. Run the following commands:oc policy add-role-to-user system:image-puller system:serviceaccount:cnf-features-testing:default --namespace=cnftests
$ oc policy add-role-to-user system:image-puller system:serviceaccount:cnf-features-testing:default --namespace=cnftestsCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc policy add-role-to-user system:image-puller system:serviceaccount:performance-addon-operators-testing:default --namespace=cnftests
$ oc policy add-role-to-user system:image-puller system:serviceaccount:performance-addon-operators-testing:default --namespace=cnftestsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Retrieve the docker secret name and auth token by running the following commands:
SECRET=$(oc -n cnftests get secret | grep builder-docker | awk {'print $1'}$ SECRET=$(oc -n cnftests get secret | grep builder-docker | awk {'print $1'}Copy to Clipboard Copied! Toggle word wrap Toggle overflow TOKEN=$(oc -n cnftests get secret $SECRET -o jsonpath="{.data['\.dockercfg']}" | base64 --decode | jq '.["image-registry.openshift-image-registry.svc:5000"].auth')$ TOKEN=$(oc -n cnftests get secret $SECRET -o jsonpath="{.data['\.dockercfg']}" | base64 --decode | jq '.["image-registry.openshift-image-registry.svc:5000"].auth')Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
dockerauth.jsonfile, for example:echo "{\"auths\": { \"$REGISTRY\": { \"auth\": $TOKEN } }}" > dockerauth.json$ echo "{\"auths\": { \"$REGISTRY\": { \"auth\": $TOKEN } }}" > dockerauth.jsonCopy to Clipboard Copied! Toggle word wrap Toggle overflow Do the image mirroring:
podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:4.14 \ /usr/bin/mirror -registry $REGISTRY/cnftests | oc image mirror --insecure=true \ -a=$(pwd)/dockerauth.json -f -
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:4.14 \ /usr/bin/mirror -registry $REGISTRY/cnftests | oc image mirror --insecure=true \ -a=$(pwd)/dockerauth.json -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the tests:
podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e FEATURES=performance -e IMAGE_REGISTRY=image-registry.openshift-image-registry.svc:5000/cnftests \ cnf-tests-local:latest /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test" --ginkgo.timeout="24h"
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e FEATURES=performance -e IMAGE_REGISTRY=image-registry.openshift-image-registry.svc:5000/cnftests \ cnf-tests-local:latest /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test" --ginkgo.timeout="24h"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Mirroring a different set of test images
You can optionally change the default upstream images that are mirrored for the latency tests.
Procedure
The
mirrorcommand tries to mirror the upstream images by default. This can be overridden by passing a file with the following format to the image:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pass the file to the
mirrorcommand, for example saving it locally asimages.json. With the following command, the local path is mounted in/kubeconfiginside the container and that can be passed to the mirror command.podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 /usr/bin/mirror \ --registry "my.local.registry:5000/" --images "/kubeconfig/images.json" \ | oc image mirror -f -
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 /usr/bin/mirror \ --registry "my.local.registry:5000/" --images "/kubeconfig/images.json" \ | oc image mirror -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow
18.9. Troubleshooting errors with the cnf-tests container Copy linkLink copied to clipboard!
To run latency tests, the cluster must be accessible from within the cnf-tests container.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges.
Procedure
Verify that the cluster is accessible from inside the
cnf-testscontainer by running the following command:podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ oc get nodes
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14 \ oc get nodesCopy to Clipboard Copied! Toggle word wrap Toggle overflow If this command does not work, an error related to spanning across DNS, MTU size, or firewall access might be occurring.
Chapter 19. Improving cluster stability in high latency environments using worker latency profiles Copy linkLink copied to clipboard!
If the cluster administrator has performed latency tests for platform verification, they can discover the need to adjust the operation of the cluster to ensure stability in cases of high latency. The cluster administrator need change only one parameter, recorded in a file, which controls four parameters affecting how supervisory processes read status and interpret the health of the cluster. Changing only the one parameter provides cluster tuning in an easy, supportable manner.
The Kubelet process provides the starting point for monitoring cluster health. The Kubelet sets status values for all nodes in the OpenShift Container Platform cluster. The Kubernetes Controller Manager (kube controller) reads the status values every 10 seconds, by default. If the kube controller cannot read a node status value, it loses contact with that node after a configured period. The default behavior is:
-
The node controller on the control plane updates the node health to
Unhealthyand marks the nodeReadycondition`Unknown`. - In response, the scheduler stops scheduling pods to that node.
-
The Node Lifecycle Controller adds a
node.kubernetes.io/unreachabletaint with aNoExecuteeffect to the node and schedules any pods on the node for eviction after five minutes, by default.
This behavior can cause problems if your network is prone to latency issues, especially if you have nodes at the network edge. In some cases, the Kubernetes Controller Manager might not receive an update from a healthy node due to network latency. The Kubelet evicts pods from the node even though the node is healthy.
To avoid this problem, you can use worker latency profiles to adjust the frequency that the Kubelet and the Kubernetes Controller Manager wait for status updates before taking action. These adjustments help to ensure that your cluster runs properly if network latency between the control plane and the worker nodes is not optimal.
These worker latency profiles contain three sets of parameters that are pre-defined with carefully tuned values to control the reaction of the cluster to increased latency. No need to experimentally find the best values manually.
You can configure worker latency profiles when installing a cluster or at any time you notice increased latency in your cluster network.
19.1. Understanding worker latency profiles Copy linkLink copied to clipboard!
Worker latency profiles are four different categories of carefully-tuned parameters. The four parameters which implement these values are node-status-update-frequency, node-monitor-grace-period, default-not-ready-toleration-seconds and default-unreachable-toleration-seconds. These parameters can use values which allow you control the reaction of the cluster to latency issues without needing to determine the best values using manual methods.
Setting these parameters manually is not supported. Incorrect parameter settings adversely affect cluster stability.
All worker latency profiles configure the following parameters:
- node-status-update-frequency
- Specifies how often the kubelet posts node status to the API server.
- node-monitor-grace-period
-
Specifies the amount of time in seconds that the Kubernetes Controller Manager waits for an update from a kubelet before marking the node unhealthy and adding the
node.kubernetes.io/not-readyornode.kubernetes.io/unreachabletaint to the node. - default-not-ready-toleration-seconds
- Specifies the amount of time in seconds after marking a node unhealthy that the Kube API Server Operator waits before evicting pods from that node.
- default-unreachable-toleration-seconds
- Specifies the amount of time in seconds after marking a node unreachable that the Kube API Server Operator waits before evicting pods from that node.
The following Operators monitor the changes to the worker latency profiles and respond accordingly:
-
The Machine Config Operator (MCO) updates the
node-status-update-frequencyparameter on the worker nodes. -
The Kubernetes Controller Manager updates the
node-monitor-grace-periodparameter on the control plane nodes. -
The Kubernetes API Server Operator updates the
default-not-ready-toleration-secondsanddefault-unreachable-toleration-secondsparameters on the control plane nodes.
Although the default configuration works in most cases, OpenShift Container Platform offers two other worker latency profiles for situations where the network is experiencing higher latency than usual. The three worker latency profiles are described in the following sections:
- Default worker latency profile
With the
Defaultprofile, eachKubeletupdates it’s status every 10 seconds (node-status-update-frequency). TheKube Controller Managerchecks the statuses ofKubeletevery 5 seconds (node-monitor-grace-period).The Kubernetes Controller Manager waits 40 seconds for a status update from
Kubeletbefore considering theKubeletunhealthy. If no status is made available to the Kubernetes Controller Manager, it then marks the node with thenode.kubernetes.io/not-readyornode.kubernetes.io/unreachabletaint and evicts the pods on that node.If a pod on that node has the
NoExecutetaint, the pod is run according totolerationSeconds. If the pod has no taint, it will be evicted in 300 seconds (default-not-ready-toleration-secondsanddefault-unreachable-toleration-secondssettings of theKube API Server).Expand Profile Component Parameter Value Default
kubelet
node-status-update-frequency10s
Kubelet Controller Manager
node-monitor-grace-period40s
Kubernetes API Server Operator
default-not-ready-toleration-seconds300s
Kubernetes API Server Operator
default-unreachable-toleration-seconds300s
- Medium worker latency profile
Use the
MediumUpdateAverageReactionprofile if the network latency is slightly higher than usual.The
MediumUpdateAverageReactionprofile reduces the frequency of kubelet updates to 20 seconds and changes the period that the Kubernetes Controller Manager waits for those updates to 2 minutes. The pod eviction period for a pod on that node is reduced to 60 seconds. If the pod has thetolerationSecondsparameter, the eviction waits for the period specified by that parameter.The Kubernetes Controller Manager waits for 2 minutes to consider a node unhealthy. In another minute, the eviction process starts.
Expand Profile Component Parameter Value MediumUpdateAverageReaction
kubelet
node-status-update-frequency20s
Kubelet Controller Manager
node-monitor-grace-period2m
Kubernetes API Server Operator
default-not-ready-toleration-seconds60s
Kubernetes API Server Operator
default-unreachable-toleration-seconds60s
- Low worker latency profile
Use the
LowUpdateSlowReactionprofile if the network latency is extremely high.The
LowUpdateSlowReactionprofile reduces the frequency of kubelet updates to 1 minute and changes the period that the Kubernetes Controller Manager waits for those updates to 5 minutes. The pod eviction period for a pod on that node is reduced to 60 seconds. If the pod has thetolerationSecondsparameter, the eviction waits for the period specified by that parameter.The Kubernetes Controller Manager waits for 5 minutes to consider a node unhealthy. In another minute, the eviction process starts.
Expand Profile Component Parameter Value LowUpdateSlowReaction
kubelet
node-status-update-frequency1m
Kubelet Controller Manager
node-monitor-grace-period5m
Kubernetes API Server Operator
default-not-ready-toleration-seconds60s
Kubernetes API Server Operator
default-unreachable-toleration-seconds60s
The latency profiles do not support custom machine config pools, only the default worker machine config pools.
19.2. Implementing worker latency profiles at cluster creation Copy linkLink copied to clipboard!
To edit the configuration of the installer, you will first need to use the command openshift-install create manifests to create the default node manifest as well as other manifest YAML files. This file structure must exist before we can add workerLatencyProfile. The platform on which you are installing may have varying requirements. Refer to the Installing section of the documentation for your specific platform.
The workerLatencyProfile must be added to the manifest in the following sequence:
- Create the manifest needed to build the cluster, using a folder name appropriate for your installation.
-
Create a YAML file to define
config.node. The file must be in themanifestsdirectory. -
When defining
workerLatencyProfilein the manifest for the first time, specify any of the profiles at cluster creation time:Default,MediumUpdateAverageReactionorLowUpdateSlowReaction.
Verification
Here is an example manifest creation showing the
spec.workerLatencyProfileDefaultvalue in the manifest file:openshift-install create manifests --dir=<cluster-install-dir>
$ openshift-install create manifests --dir=<cluster-install-dir>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the manifest and add the value. In this example we use
vito show an example manifest file with the "Default"workerLatencyProfilevalue added:vi <cluster-install-dir>/manifests/config-node-default-profile.yaml
$ vi <cluster-install-dir>/manifests/config-node-default-profile.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
19.3. Using and changing worker latency profiles Copy linkLink copied to clipboard!
To change a worker latency profile to deal with network latency, edit the node.config object to add the name of the profile. You can change the profile at any time as latency increases or decreases.
You must move one worker latency profile at a time. For example, you cannot move directly from the Default profile to the LowUpdateSlowReaction worker latency profile. You must move from the Default worker latency profile to the MediumUpdateAverageReaction profile first, then to LowUpdateSlowReaction. Similarly, when returning to the Default profile, you must move from the low profile to the medium profile first, then to Default.
You can also configure worker latency profiles upon installing an OpenShift Container Platform cluster.
Procedure
To move from the default worker latency profile:
Move to the medium worker latency profile:
Edit the
node.configobject:oc edit nodes.config/cluster
$ oc edit nodes.config/clusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add
spec.workerLatencyProfile: MediumUpdateAverageReaction:Example
node.configobjectCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies the medium worker latency policy.
Scheduling on each worker node is disabled as the change is being applied.
Optional: Move to the low worker latency profile:
Edit the
node.configobject:oc edit nodes.config/cluster
$ oc edit nodes.config/clusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change the
spec.workerLatencyProfilevalue toLowUpdateSlowReaction:Example
node.configobjectCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies use of the low worker latency policy.
Scheduling on each worker node is disabled as the change is being applied.
Verification
When all nodes return to the
Readycondition, you can use the following command to look in the Kubernetes Controller Manager to ensure it was applied:oc get KubeControllerManager -o yaml | grep -i workerlatency -A 5 -B 5
$ oc get KubeControllerManager -o yaml | grep -i workerlatency -A 5 -B 5Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies that the profile is applied and active.
To change the medium profile to default or change the default to medium, edit the node.config object and set the spec.workerLatencyProfile parameter to the appropriate value.
19.4. Example steps for displaying resulting values of workerLatencyProfile Copy linkLink copied to clipboard!
You can display the values in the workerLatencyProfile with the following commands.
Verification
Check the
default-not-ready-toleration-secondsanddefault-unreachable-toleration-secondsfields output by the Kube API Server:oc get KubeAPIServer -o yaml | grep -A 1 default-
$ oc get KubeAPIServer -o yaml | grep -A 1 default-Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
default-not-ready-toleration-seconds: - "300" default-unreachable-toleration-seconds: - "300"
default-not-ready-toleration-seconds: - "300" default-unreachable-toleration-seconds: - "300"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the values of the
node-monitor-grace-periodfield from the Kube Controller Manager:oc get KubeControllerManager -o yaml | grep -A 1 node-monitor
$ oc get KubeControllerManager -o yaml | grep -A 1 node-monitorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
node-monitor-grace-period: - 40s
node-monitor-grace-period: - 40sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the
nodeStatusUpdateFrequencyvalue from the Kubelet. Set the directory/hostas the root directory within the debug shell. By changing the root directory to/host, you can run binaries contained in the host’s executable paths:oc debug node/<worker-node-name> chroot /host cat /etc/kubernetes/kubelet.conf|grep nodeStatusUpdateFrequency
$ oc debug node/<worker-node-name> $ chroot /host # cat /etc/kubernetes/kubelet.conf|grep nodeStatusUpdateFrequencyCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
“nodeStatusUpdateFrequency”: “10s”
“nodeStatusUpdateFrequency”: “10s”Copy to Clipboard Copied! Toggle word wrap Toggle overflow
These outputs validate the set of timing variables for the Worker Latency Profile.
Chapter 20. Workload partitioning Copy linkLink copied to clipboard!
In resource-constrained environments, you can use workload partitioning to isolate OpenShift Container Platform services, cluster management workloads, and infrastructure pods to run on a reserved set of CPUs.
The minimum number of reserved CPUs required for the cluster management is four CPU Hyper-Threads (HTs). With workload partitioning, you annotate the set of cluster management pods and a set of typical add-on Operators for inclusion in the cluster management workload partition. These pods operate normally within the minimum size CPU configuration. Additional Operators or workloads outside of the set of minimum cluster management pods require additional CPUs to be added to the workload partition.
Workload partitioning isolates user workloads from platform workloads using standard Kubernetes scheduling capabilities.
The following changes are required for workload partitioning:
In the
install-config.yamlfile, add the additional field:cpuPartitioningMode.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Sets up a cluster for CPU partitioning at install time. The default value is
None.
NoteWorkload partitioning can only be enabled during cluster installation. You cannot disable workload partitioning postinstallation.
In the performance profile, specify the
isolatedandreservedCPUs.Recommended performance profile configuration
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Expand Table 20.1. PerformanceProfile CR options for single-node OpenShift clusters PerformanceProfile CR field Description metadata.nameEnsure that
namematches the following fields set in related GitOps ZTP custom resources (CRs):-
include=openshift-node-performance-${PerformanceProfile.metadata.name}inTunedPerformancePatch.yaml -
name: 50-performance-${PerformanceProfile.metadata.name}invalidatorCRs/informDuValidator.yaml
spec.additionalKernelArgs"efi=runtime"Configures UEFI secure boot for the cluster host.spec.cpu.isolatedSet the isolated CPUs. Ensure all of the Hyper-Threading pairs match.
ImportantThe reserved and isolated CPU pools must not overlap and together must span all available cores. CPU cores that are not accounted for cause an undefined behaviour in the system.
spec.cpu.reservedSet the reserved CPUs. When workload partitioning is enabled, system processes, kernel threads, and system container threads are restricted to these CPUs. All CPUs that are not isolated should be reserved.
spec.hugepages.pages-
Set the number of huge pages (
count) -
Set the huge pages size (
size). -
Set
nodeto the NUMA node where thehugepagesare allocated (node)
spec.realTimeKernelSet
enabledtotrueto use the realtime kernel.spec.workloadHintsUse
workloadHintsto define the set of top level flags for different type of workloads. The example configuration configures the cluster for low latency and high performance.-
Workload partitioning introduces an extended management.workload.openshift.io/cores resource type for platform pods. kubelet advertises the resources and CPU requests by pods allocated to the pool within the corresponding resource. When workload partitioning is enabled, the management.workload.openshift.io/cores resource allows the scheduler to correctly assign pods based on the cpushares capacity of the host, not just the default cpuset.
Chapter 21. Requesting CRI-O and Kubelet profiling data by using the Node Observability Operator Copy linkLink copied to clipboard!
The Node Observability Operator collects and stores the CRI-O and Kubelet profiling data of worker nodes. You can query the profiling data to analyze the CRI-O and Kubelet performance trends and debug the performance-related issues.
The Node Observability Operator is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
21.1. Workflow of the Node Observability Operator Copy linkLink copied to clipboard!
The following workflow outlines on how to query the profiling data using the Node Observability Operator:
- Install the Node Observability Operator in the OpenShift Container Platform cluster.
- Create a NodeObservability custom resource to enable the CRI-O profiling on the worker nodes of your choice.
- Run the profiling query to generate the profiling data.
21.2. Installing the Node Observability Operator Copy linkLink copied to clipboard!
The Node Observability Operator is not installed in OpenShift Container Platform by default. You can install the Node Observability Operator by using the OpenShift Container Platform CLI or the web console.
21.2.1. Installing the Node Observability Operator using the CLI Copy linkLink copied to clipboard!
You can install the Node Observability Operator by using the OpenShift CLI (oc).
Prerequisites
- You have installed the OpenShift CLI (oc).
-
You have access to the cluster with
cluster-adminprivileges.
Procedure
Confirm that the Node Observability Operator is available by running the following command:
oc get packagemanifests -n openshift-marketplace node-observability-operator
$ oc get packagemanifests -n openshift-marketplace node-observability-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CATALOG AGE node-observability-operator Red Hat Operators 9h
NAME CATALOG AGE node-observability-operator Red Hat Operators 9hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
node-observability-operatornamespace by running the following command:oc new-project node-observability-operator
$ oc new-project node-observability-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create an
OperatorGroupobject YAML file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
Subscriptionobject YAML file to subscribe a namespace to an Operator:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
View the install plan name by running the following command:
oc -n node-observability-operator get sub node-observability-operator -o yaml | yq '.status.installplan.name'
$ oc -n node-observability-operator get sub node-observability-operator -o yaml | yq '.status.installplan.name'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
install-dt54w
install-dt54wCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the install plan status by running the following command:
oc -n node-observability-operator get ip <install_plan_name> -o yaml | yq '.status.phase'
$ oc -n node-observability-operator get ip <install_plan_name> -o yaml | yq '.status.phase'Copy to Clipboard Copied! Toggle word wrap Toggle overflow <install_plan_name>is the install plan name that you obtained from the output of the previous command.Example output
COMPLETE
COMPLETECopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the Node Observability Operator is up and running:
oc get deploy -n node-observability-operator
$ oc get deploy -n node-observability-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY UP-TO-DATE AVAILABLE AGE node-observability-operator-controller-manager 1/1 1 1 40h
NAME READY UP-TO-DATE AVAILABLE AGE node-observability-operator-controller-manager 1/1 1 1 40hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
21.2.2. Installing the Node Observability Operator using the web console Copy linkLink copied to clipboard!
You can install the Node Observability Operator from the OpenShift Container Platform web console.
Prerequisites
-
You have access to the cluster with
cluster-adminprivileges. - You have access to the OpenShift Container Platform web console.
Procedure
- Log in to the OpenShift Container Platform web console.
- In the Administrator’s navigation panel, expand Operators → OperatorHub.
- In the All items field, enter Node Observability Operator and select the Node Observability Operator tile.
- Click Install.
On the Install Operator page, configure the following settings:
- In the Update channel area, click alpha.
- In the Installation mode area, click A specific namespace on the cluster.
- From the Installed Namespace list, select node-observability-operator from the list.
- In the Update approval area, select Automatic.
- Click Install.
Verification
- In the Administrator’s navigation panel, expand Operators → Installed Operators.
- Verify that the Node Observability Operator is listed in the Operators list.
21.3. Creating the Node Observability custom resource Copy linkLink copied to clipboard!
You must create and run the NodeObservability custom resource (CR) before you run the profiling query. When you run the NodeObservability CR, it creates the necessary machine config and machine config pool CRs to enable the CRI-O profiling on the worker nodes matching the nodeSelector.
If CRI-O profiling is not enabled on the worker nodes, the NodeObservabilityMachineConfig resource gets created. Worker nodes matching the nodeSelector specified in NodeObservability CR restarts. This might take 10 or more minutes to complete.
Kubelet profiling is enabled by default.
The CRI-O unix socket of the node is mounted on the agent pod, which allows the agent to communicate with CRI-O to run the pprof request. Similarly, the kubelet-serving-ca certificate chain is mounted on the agent pod, which allows secure communication between the agent and node’s kubelet endpoint.
Prerequisites
- You have installed the Node Observability Operator.
- You have installed the OpenShift CLI (oc).
-
You have access to the cluster with
cluster-adminprivileges.
Procedure
Log in to the OpenShift Container Platform CLI by running the following command:
oc login -u kubeadmin https://<HOSTNAME>:6443
$ oc login -u kubeadmin https://<HOSTNAME>:6443Copy to Clipboard Copied! Toggle word wrap Toggle overflow Switch back to the
node-observability-operatornamespace by running the following command:oc project node-observability-operator
$ oc project node-observability-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a CR file named
nodeobservability.yamlthat contains the following text:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
NodeObservabilityCR:oc apply -f nodeobservability.yaml
oc apply -f nodeobservability.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
nodeobservability.olm.openshift.io/cluster created
nodeobservability.olm.openshift.io/cluster createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Review the status of the
NodeObservabilityCR by running the following command:oc get nob/cluster -o yaml | yq '.status.conditions'
$ oc get nob/cluster -o yaml | yq '.status.conditions'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NodeObservabilityCR run is completed when the reason isReadyand the status isTrue.
21.4. Running the profiling query Copy linkLink copied to clipboard!
To run the profiling query, you must create a NodeObservabilityRun resource. The profiling query is a blocking operation that fetches CRI-O and Kubelet profiling data for a duration of 30 seconds. After the profiling query is complete, you must retrieve the profiling data inside the container file system /run/node-observability directory. The lifetime of data is bound to the agent pod through the emptyDir volume, so you can access the profiling data while the agent pod is in the running status.
You can request only one profiling query at any point of time.
Prerequisites
- You have installed the Node Observability Operator.
-
You have created the
NodeObservabilitycustom resource (CR). -
You have access to the cluster with
cluster-adminprivileges.
Procedure
Create a
NodeObservabilityRunresource file namednodeobservabilityrun.yamlthat contains the following text:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Trigger the profiling query by running the
NodeObservabilityRunresource:oc apply -f nodeobservabilityrun.yaml
$ oc apply -f nodeobservabilityrun.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Review the status of the
NodeObservabilityRunby running the following command:oc get nodeobservabilityrun nodeobservabilityrun -o yaml | yq '.status.conditions'
$ oc get nodeobservabilityrun nodeobservabilityrun -o yaml | yq '.status.conditions'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The profiling query is complete once the status is
Trueand type isFinished.Retrieve the profiling data from the container’s
/run/node-observabilitypath by running the following bash script:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 22. Clusters at the network far edge Copy linkLink copied to clipboard!
22.1. Challenges of the network far edge Copy linkLink copied to clipboard!
Edge computing presents complex challenges when managing many sites in geographically displaced locations. Use GitOps Zero Touch Provisioning (ZTP) to provision and manage sites at the far edge of the network.
22.1.1. Overcoming the challenges of the network far edge Copy linkLink copied to clipboard!
Today, service providers want to deploy their infrastructure at the edge of the network. This presents significant challenges:
- How do you handle deployments of many edge sites in parallel?
- What happens when you need to deploy sites in disconnected environments?
- How do you manage the lifecycle of large fleets of clusters?
GitOps Zero Touch Provisioning (ZTP) and GitOps meets these challenges by allowing you to provision remote edge sites at scale with declarative site definitions and configurations for bare-metal equipment. Template or overlay configurations install OpenShift Container Platform features that are required for CNF workloads. The full lifecycle of installation and upgrades is handled through the GitOps ZTP pipeline.
GitOps ZTP uses GitOps for infrastructure deployments. With GitOps, you use declarative YAML files and other defined patterns stored in Git repositories. Red Hat Advanced Cluster Management (RHACM) uses your Git repositories to drive the deployment of your infrastructure.
GitOps provides traceability, role-based access control (RBAC), and a single source of truth for the desired state of each site. Scalability issues are addressed by Git methodologies and event driven operations through webhooks.
You start the GitOps ZTP workflow by creating declarative site definition and configuration custom resources (CRs) that the GitOps ZTP pipeline delivers to the edge nodes.
The following diagram shows how GitOps ZTP works within the far edge framework.
22.1.2. Using GitOps ZTP to provision clusters at the network far edge Copy linkLink copied to clipboard!
Red Hat Advanced Cluster Management (RHACM) manages clusters in a hub-and-spoke architecture, where a single hub cluster manages many spoke clusters. Hub clusters running RHACM provision and deploy the managed clusters by using GitOps Zero Touch Provisioning (ZTP) and the assisted service that is deployed when you install RHACM.
The assisted service handles provisioning of OpenShift Container Platform on single node clusters, three-node clusters, or standard clusters running on bare metal.
A high-level overview of using GitOps ZTP to provision and maintain bare-metal hosts with OpenShift Container Platform is as follows:
- A hub cluster running RHACM manages an OpenShift image registry that mirrors the OpenShift Container Platform release images. RHACM uses the OpenShift image registry to provision the managed clusters.
- You manage the bare-metal hosts in a YAML format inventory file, versioned in a Git repository.
- You make the hosts ready for provisioning as managed clusters, and use RHACM and the assisted service to install the bare-metal hosts on site.
Installing and deploying the clusters is a two-stage process, involving an initial installation phase, and a subsequent configuration and deployment phase. The following diagram illustrates this workflow:
22.1.3. Installing managed clusters with SiteConfig resources and RHACM Copy linkLink copied to clipboard!
GitOps Zero Touch Provisioning (ZTP) uses SiteConfig custom resources (CRs) in a Git repository to manage the processes that install OpenShift Container Platform clusters. The SiteConfig CR contains cluster-specific parameters required for installation. It has options for applying select configuration CRs during installation including user defined extra manifests.
The GitOps ZTP plugin processes SiteConfig CRs to generate a collection of CRs on the hub cluster. This triggers the assisted service in Red Hat Advanced Cluster Management (RHACM) to install OpenShift Container Platform on the bare-metal host. You can find installation status and error messages in these CRs on the hub cluster.
You can provision single clusters manually or in batches with GitOps ZTP:
- Provisioning a single cluster
-
Create a single
SiteConfigCR and related installation and configuration CRs for the cluster, and apply them in the hub cluster to begin cluster provisioning. This is a good way to test your CRs before deploying on a larger scale. - Provisioning many clusters
-
Install managed clusters in batches of up to 400 by defining
SiteConfigand related CRs in a Git repository. ArgoCD uses theSiteConfigCRs to deploy the sites. The RHACM policy generator creates the manifests and applies them to the hub cluster. This starts the cluster provisioning process.
22.1.4. Configuring managed clusters with policies and PolicyGenTemplate resources Copy linkLink copied to clipboard!
GitOps Zero Touch Provisioning (ZTP) uses Red Hat Advanced Cluster Management (RHACM) to configure clusters by using a policy-based governance approach to applying the configuration.
The policy generator or PolicyGen is a plugin for the GitOps Operator that enables the creation of RHACM policies from a concise template. The tool can combine multiple CRs into a single policy, and you can generate multiple policies that apply to various subsets of clusters in your fleet.
For scalability and to reduce the complexity of managing configurations across the fleet of clusters, use configuration CRs with as much commonality as possible.
- Where possible, apply configuration CRs using a fleet-wide common policy.
- The next preference is to create logical groupings of clusters to manage as much of the remaining configurations as possible under a group policy.
- When a configuration is unique to an individual site, use RHACM templating on the hub cluster to inject the site-specific data into a common or group policy. Alternatively, apply an individual site policy for the site.
The following diagram shows how the policy generator interacts with GitOps and RHACM in the configuration phase of cluster deployment.
For large fleets of clusters, it is typical for there to be a high-level of consistency in the configuration of those clusters.
The following recommended structuring of policies combines configuration CRs to meet several goals:
- Describe common configurations once and apply to the fleet.
- Minimize the number of maintained and managed policies.
- Support flexibility in common configurations for cluster variants.
| Policy category | Description |
|---|---|
| Common |
A policy that exists in the common category is applied to all clusters in the fleet. Use common |
| Groups |
A policy that exists in the groups category is applied to a group of clusters in the fleet. Use group |
| Sites | A policy that exists in the sites category is applied to a specific cluster site. Any cluster can have its own specific policies maintained. |
22.2. Preparing the hub cluster for ZTP Copy linkLink copied to clipboard!
To use RHACM in a disconnected environment, create a mirror registry that mirrors the OpenShift Container Platform release images and Operator Lifecycle Manager (OLM) catalog that contains the required Operator images. OLM manages, installs, and upgrades Operators and their dependencies in the cluster. You can also use a disconnected mirror host to serve the RHCOS ISO and RootFS disk images that are used to provision the bare-metal hosts.
22.2.1. Telco RAN DU 4.14 validated software components Copy linkLink copied to clipboard!
The Red Hat telco RAN DU 4.14 solution has been validated using the following Red Hat software products for OpenShift Container Platform managed clusters and hub clusters.
| Component | Software version |
|---|---|
| Managed cluster version | 4.14 |
| Cluster Logging Operator | 5.8 |
| Local Storage Operator | 4.14 |
| PTP Operator | 4.14 |
| SRIOV Operator | 4.14 |
| Node Tuning Operator | 4.14 |
| Logging Operator | 4.14 |
| SRIOV-FEC Operator | 2.7 |
| Component | Software version |
|---|---|
| Hub cluster version | 4.14 |
| GitOps ZTP plugin | 4.14 |
| Red Hat Advanced Cluster Management (RHACM) | 2.9, 2.10 |
| Red Hat OpenShift GitOps | 1.16 |
| Topology Aware Lifecycle Manager (TALM) | 4.14 |
22.2.2. Recommended hub cluster specifications and managed cluster limits for GitOps ZTP Copy linkLink copied to clipboard!
With GitOps Zero Touch Provisioning (ZTP), you can manage thousands of clusters in geographically dispersed regions and networks. The Red Hat Performance and Scale lab successfully created and managed 3500 virtual single-node OpenShift clusters with a reduced DU profile from a single Red Hat Advanced Cluster Management (RHACM) hub cluster in a lab environment.
In real-world situations, the scaling limits for the number of clusters that you can manage will vary depending on various factors affecting the hub cluster. For example:
- Hub cluster resources
- Available hub cluster host resources (CPU, memory, storage) are an important factor in determining how many clusters the hub cluster can manage. The more resources allocated to the hub cluster, the more managed clusters it can accommodate.
- Hub cluster storage
- The hub cluster host storage IOPS rating and whether the hub cluster hosts use NVMe storage can affect hub cluster performance and the number of clusters it can manage.
- Network bandwidth and latency
- Slow or high-latency network connections between the hub cluster and managed clusters can impact how the hub cluster manages multiple clusters.
- Managed cluster size and complexity
- The size and complexity of the managed clusters also affects the capacity of the hub cluster. Larger managed clusters with more nodes, namespaces, and resources require additional processing and management resources. Similarly, clusters with complex configurations such as the RAN DU profile or diverse workloads can require more resources from the hub cluster.
- Number of managed policies
- The number of policies managed by the hub cluster scaled over the number of managed clusters bound to those policies is an important factor that determines how many clusters can be managed.
- Monitoring and management workloads
- RHACM continuously monitors and manages the managed clusters. The number and complexity of monitoring and management workloads running on the hub cluster can affect its capacity. Intensive monitoring or frequent reconciliation operations can require additional resources, potentially limiting the number of manageable clusters.
- RHACM version and configuration
- Different versions of RHACM can have varying performance characteristics and resource requirements. Additionally, the configuration settings of RHACM, such as the number of concurrent reconciliations or the frequency of health checks, can affect the managed cluster capacity of the hub cluster.
Use the following representative configuration and network specifications to develop your own Hub cluster and network specifications.
The following guidelines are based on internal lab benchmark testing only and do not represent complete bare-metal host specifications.
| Requirement | Description |
|---|---|
| Server hardware | 3 x Dell PowerEdge R650 rack servers |
| NVMe hard disks |
|
| SSD hard disks |
|
| Number of applied DU profile policies | 5 |
The following network specifications are representative of a typical real-world RAN network and were applied to the scale lab environment during testing.
| Specification | Description |
|---|---|
| Round-trip time (RTT) latency | 50 ms |
| Packet loss | 0.02% packet loss |
| Network bandwidth limit | 20 Mbps |
22.2.3. Installing GitOps ZTP in a disconnected environment Copy linkLink copied to clipboard!
Use Red Hat Advanced Cluster Management (RHACM), Red Hat OpenShift GitOps, and Topology Aware Lifecycle Manager (TALM) on the hub cluster in the disconnected environment to manage the deployment of multiple managed clusters.
Prerequisites
-
You have installed the OpenShift Container Platform CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. You have configured a disconnected mirror registry for use in the cluster.
NoteThe disconnected mirror registry that you create must contain a version of TALM backup and pre-cache images that matches the version of TALM running in the hub cluster. The spoke cluster must be able to resolve these images in the disconnected mirror registry.
Procedure
- Install RHACM in the hub cluster. See Installing RHACM in a disconnected environment.
- Install GitOps and TALM in the hub cluster.
22.2.4. Adding RHCOS ISO and RootFS images to the disconnected mirror host Copy linkLink copied to clipboard!
Before you begin installing clusters in the disconnected environment with Red Hat Advanced Cluster Management (RHACM), you must first host Red Hat Enterprise Linux CoreOS (RHCOS) images for it to use. Use a disconnected mirror to host the RHCOS images.
Prerequisites
- Deploy and configure an HTTP server to host the RHCOS image resources on the network. You must be able to access the HTTP server from your computer, and from the machines that you create.
The RHCOS images might not change with every release of OpenShift Container Platform. You must download images with the highest version that is less than or equal to the version that you install. Use the image versions that match your OpenShift Container Platform version if they are available. You require ISO and RootFS images to install RHCOS on the hosts. RHCOS QCOW2 images are not supported for this installation type.
Procedure
- Log in to the mirror host.
Obtain the RHCOS ISO and RootFS images from mirror.openshift.com, for example:
Export the required image names and OpenShift Container Platform version as environment variables:
export ISO_IMAGE_NAME=<iso_image_name>
$ export ISO_IMAGE_NAME=<iso_image_name>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow export ROOTFS_IMAGE_NAME=<rootfs_image_name>
$ export ROOTFS_IMAGE_NAME=<rootfs_image_name>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow export OCP_VERSION=<ocp_version>
$ export OCP_VERSION=<ocp_version>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow Download the required images:
sudo wget https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.14/${OCP_VERSION}/${ISO_IMAGE_NAME} -O /var/www/html/${ISO_IMAGE_NAME}$ sudo wget https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.14/${OCP_VERSION}/${ISO_IMAGE_NAME} -O /var/www/html/${ISO_IMAGE_NAME}Copy to Clipboard Copied! Toggle word wrap Toggle overflow sudo wget https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.14/${OCP_VERSION}/${ROOTFS_IMAGE_NAME} -O /var/www/html/${ROOTFS_IMAGE_NAME}$ sudo wget https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.14/${OCP_VERSION}/${ROOTFS_IMAGE_NAME} -O /var/www/html/${ROOTFS_IMAGE_NAME}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that the images downloaded successfully and are being served on the disconnected mirror host, for example:
wget http://$(hostname)/${ISO_IMAGE_NAME}$ wget http://$(hostname)/${ISO_IMAGE_NAME}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Saving to: rhcos-4.14.1-x86_64-live.x86_64.iso rhcos-4.14.1-x86_64-live.x86_64.iso- 11%[====> ] 10.01M 4.71MB/s
Saving to: rhcos-4.14.1-x86_64-live.x86_64.iso rhcos-4.14.1-x86_64-live.x86_64.iso- 11%[====> ] 10.01M 4.71MB/sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.2.5. Enabling the assisted service Copy linkLink copied to clipboard!
Red Hat Advanced Cluster Management (RHACM) uses the assisted service to deploy OpenShift Container Platform clusters. The assisted service is deployed automatically when you enable the MultiClusterHub Operator on Red Hat Advanced Cluster Management (RHACM). After that, you need to configure the Provisioning resource to watch all namespaces and to update the AgentServiceConfig custom resource (CR) with references to the ISO and RootFS images that are hosted on the mirror registry HTTP server.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. - You have RHACM with MultiClusterHub enabled.
Procedure
-
Enable the
Provisioningresource to watch all namespaces and configure mirrors for disconnected environments. For more information, see Enabling the central infrastructure management service. Update the
AgentServiceConfigCR by running the following command:oc edit AgentServiceConfig
$ oc edit AgentServiceConfigCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the following entry to the
items.spec.osImagesfield in the CR:- cpuArchitecture: x86_64 openshiftVersion: "4.14" rootFSUrl: https://<host>/<path>/rhcos-live-rootfs.x86_64.img url: https://<host>/<path>/rhcos-live.x86_64.iso- cpuArchitecture: x86_64 openshiftVersion: "4.14" rootFSUrl: https://<host>/<path>/rhcos-live-rootfs.x86_64.img url: https://<host>/<path>/rhcos-live.x86_64.isoCopy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- <host>
- Is the fully qualified domain name (FQDN) for the target mirror registry HTTP server.
- <path>
- Is the path to the image on the target mirror registry.
Save and quit the editor to apply the changes.
22.2.6. Configuring the hub cluster to use a disconnected mirror registry Copy linkLink copied to clipboard!
You can configure the hub cluster to use a disconnected mirror registry for a disconnected environment.
Prerequisites
- You have a disconnected hub cluster installation with Red Hat Advanced Cluster Management (RHACM) 2.8 installed.
-
You have hosted the
rootfsandisoimages on an HTTP server. See the Additional resources section for guidance about Mirroring the OpenShift Container Platform image repository.
If you enable TLS for the HTTP server, you must confirm the root certificate is signed by an authority trusted by the client and verify the trusted certificate chain between your OpenShift Container Platform hub and managed clusters and the HTTP server. Using a server configured with an untrusted certificate prevents the images from being downloaded to the image creation service. Using untrusted HTTPS servers is not supported.
Procedure
Create a
ConfigMapcontaining the mirror registry config:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
ConfigMapnamespace must be set tomulticluster-engine. - 2
- The mirror registry’s certificate that is used when creating the mirror registry.
- 3
- The configuration file for the mirror registry. The mirror registry configuration adds mirror information to the
/etc/containers/registries.conffile in the discovery image. The mirror information is stored in theimageContentSourcessection of theinstall-config.yamlfile when the information is passed to the installation program. The Assisted Service pod that runs on the hub cluster fetches the container images from the configured mirror registry. - 4
- The URL of the mirror registry. You must use the URL from the
imageContentSourcessection by running theoc adm release mirrorcommand when you configure the mirror registry. For more information, see the Mirroring the OpenShift Container Platform image repository section. - 5
- The registries defined in the
registries.conffile must be scoped by repository, not by registry. In this example, both thequay.io/example-repositoryand themirror1.registry.corp.com:5000/example-repositoryrepositories are scoped by theexample-repositoryrepository.
This updates
mirrorRegistryRefin theAgentServiceConfigcustom resource, as shown below:Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Set the
AgentServiceConfignamespace tomulticluster-engineto match theConfigMapnamespace. - 2
- Set
mirrorRegistryRef.nameto match the definition specified in the relatedConfigMapCR. - 3
- Set the OpenShift Container Platform version to either the x.y or x.y.z format.
- 4
- Set the URL for the ISO hosted on the
httpdserver.
A valid NTP server is required during cluster installation. Ensure that a suitable NTP server is available and can be reached from the installed clusters through the disconnected network.
22.2.7. Configuring the hub cluster to use unauthenticated registries Copy linkLink copied to clipboard!
You can configure the hub cluster to use unauthenticated registries. Unauthenticated registries does not require authentication to access and download images.
Prerequisites
- You have installed and configured a hub cluster and installed Red Hat Advanced Cluster Management (RHACM) on the hub cluster.
- You have installed the OpenShift Container Platform CLI (oc).
-
You have logged in as a user with
cluster-adminprivileges. - You have configured an unauthenticated registry for use with the hub cluster.
Procedure
Update the
AgentServiceConfigcustom resource (CR) by running the following command:oc edit AgentServiceConfig agent
$ oc edit AgentServiceConfig agentCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the
unauthenticatedRegistriesfield in the CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Unauthenticated registries are listed under
spec.unauthenticatedRegistriesin theAgentServiceConfigresource. Any registry on this list is not required to have an entry in the pull secret used for the spoke cluster installation.assisted-servicevalidates the pull secret by making sure it contains the authentication information for every image registry used for installation.
Mirror registries are automatically added to the ignore list and do not need to be added under spec.unauthenticatedRegistries. Specifying the PUBLIC_CONTAINER_REGISTRIES environment variable in the ConfigMap overrides the default values with the specified value. The PUBLIC_CONTAINER_REGISTRIES defaults are quay.io and registry.svc.ci.openshift.org.
Verification
Verify that you can access the newly added registry from the hub cluster by running the following commands:
Open a debug shell prompt to the hub cluster:
oc debug node/<node_name>
$ oc debug node/<node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Test access to the unauthenticated registry by running the following command:
podman login -u kubeadmin -p $(oc whoami -t) <unauthenticated_registry>
sh-4.4# podman login -u kubeadmin -p $(oc whoami -t) <unauthenticated_registry>Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- <unauthenticated_registry>
-
Is the new registry, for example,
unauthenticated-image-registry.openshift-image-registry.svc:5000.
Example output
Login Succeeded!
Login Succeeded!Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.2.8. Configuring the hub cluster with ArgoCD Copy linkLink copied to clipboard!
You can configure the hub cluster with a set of ArgoCD applications that generate the required installation and policy custom resources (CRs) for each site with GitOps Zero Touch Provisioning (ZTP).
Red Hat Advanced Cluster Management (RHACM) uses SiteConfig CRs to generate the Day 1 managed cluster installation CRs for ArgoCD. Each ArgoCD application can manage a maximum of 300 SiteConfig CRs.
Prerequisites
- You have a OpenShift Container Platform hub cluster with Red Hat Advanced Cluster Management (RHACM) and Red Hat OpenShift GitOps installed.
-
You have extracted the reference deployment from the GitOps ZTP plugin container as described in the "Preparing the GitOps ZTP site configuration repository" section. Extracting the reference deployment creates the
out/argocd/deploymentdirectory referenced in the following procedure.
Procedure
Prepare the ArgoCD pipeline configuration:
- Create a Git repository with the directory structure similar to the example directory. For more information, see "Preparing the GitOps ZTP site configuration repository".
Configure access to the repository using the ArgoCD UI. Under Settings configure the following:
-
Repositories - Add the connection information. The URL must end in
.git, for example,https://repo.example.com/repo.gitand credentials. - Certificates - Add the public certificate for the repository, if needed.
-
Repositories - Add the connection information. The URL must end in
Modify the two ArgoCD applications,
out/argocd/deployment/clusters-app.yamlandout/argocd/deployment/policies-app.yaml, based on your Git repository:-
Update the URL to point to the Git repository. The URL ends with
.git, for example,https://repo.example.com/repo.git. -
The
targetRevisionindicates which Git repository branch to monitor. -
pathspecifies the path to theSiteConfigandPolicyGenTemplateCRs, respectively.
-
Update the URL to point to the Git repository. The URL ends with
To install the GitOps ZTP plugin, patch the ArgoCD instance in the hub cluster with the relevant multicluster engine (MCE) subscription image. Customize the patch file that you previously extracted into the
out/argocd/deployment/directory for your environment.Select the
multicluster-operators-subscriptionimage that matches your RHACM version.-
For RHACM 2.8 and 2.9, use the
registry.redhat.io/rhacm2/multicluster-operators-subscription-rhel8:v<rhacm_version>image. -
For RHACM 2.10 and later, use the
registry.redhat.io/rhacm2/multicluster-operators-subscription-rhel9:v<rhacm_version>image.
ImportantThe version of the
multicluster-operators-subscriptionimage must match the RHACM version. Beginning with the MCE 2.10 release, RHEL 9 is the base image formulticluster-operators-subscriptionimages.Click
[Expand for Operator list]in the "Platform Aligned Operators" table in OpenShift Operator Life Cycles to view the complete supported Operators matrix for OpenShift Container Platform.-
For RHACM 2.8 and 2.9, use the
Add the following configuration to the
out/argocd/deployment/argocd-openshift-gitops-patch.jsonfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Optional: For RHEL 9 images, copy the required universal executable in the
/policy-generator/PolicyGenerator-not-fips-compliantfolder for the ArgoCD version. - 2
- Match the
multicluster-operators-subscriptionimage to the RHACM version. - 3
- In disconnected environments, replace the URL for the
multicluster-operators-subscriptionimage with the disconnected registry equivalent for your environment.
Patch the ArgoCD instance. Run the following command:
oc patch argocd openshift-gitops \ -n openshift-gitops --type=merge \ --patch-file out/argocd/deployment/argocd-openshift-gitops-patch.json
$ oc patch argocd openshift-gitops \ -n openshift-gitops --type=merge \ --patch-file out/argocd/deployment/argocd-openshift-gitops-patch.jsonCopy to Clipboard Copied! Toggle word wrap Toggle overflow
In RHACM 2.7 and later, the multicluster engine enables the
cluster-proxy-addonfeature by default. Apply the following patch to disable thecluster-proxy-addonfeature and remove the relevant hub cluster and managed pods that are responsible for this add-on. Run the following command:oc patch multiclusterengines.multicluster.openshift.io multiclusterengine --type=merge --patch-file out/argocd/deployment/disable-cluster-proxy-addon.json
$ oc patch multiclusterengines.multicluster.openshift.io multiclusterengine --type=merge --patch-file out/argocd/deployment/disable-cluster-proxy-addon.jsonCopy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the pipeline configuration to your hub cluster by running the following command:
oc apply -k out/argocd/deployment
$ oc apply -k out/argocd/deploymentCopy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If you have existing ArgoCD applications, verify that the
PrunePropagationPolicy=backgroundpolicy is set in theApplicationresource by running the following command:oc -n openshift-gitops get applications.argoproj.io \ clusters -o jsonpath='{.spec.syncPolicy.syncOptions}' |jq$ oc -n openshift-gitops get applications.argoproj.io \ clusters -o jsonpath='{.spec.syncPolicy.syncOptions}' |jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for an existing policy
[ "CreateNamespace=true", "PrunePropagationPolicy=background", "RespectIgnoreDifferences=true" ]
[ "CreateNamespace=true", "PrunePropagationPolicy=background", "RespectIgnoreDifferences=true" ]Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the
spec.syncPolicy.syncOptionfield does not contain aPrunePropagationPolicyparameter orPrunePropagationPolicyis set to theforegroundvalue, set the policy tobackgroundin theApplicationresource. See the following example:kind: Application spec: syncPolicy: syncOptions: - PrunePropagationPolicy=backgroundkind: Application spec: syncPolicy: syncOptions: - PrunePropagationPolicy=backgroundCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Setting the
backgrounddeletion policy ensures that theManagedClusterCR and all its associated resources are deleted.
22.2.9. Preparing the GitOps ZTP site configuration repository Copy linkLink copied to clipboard!
Before you can use the GitOps Zero Touch Provisioning (ZTP) pipeline, you need to prepare the Git repository to host the site configuration data.
Prerequisites
- You have configured the hub cluster GitOps applications for generating the required installation and policy custom resources (CRs).
- You have deployed the managed clusters using GitOps ZTP.
Procedure
Create a directory structure with separate paths for the
SiteConfigandPolicyGenTemplateCRs.NoteKeep
SiteConfigandPolicyGenTemplateCRs in separate directories. Both theSiteConfigandPolicyGenTemplatedirectories must contain akustomization.yamlfile that explicitly includes the files in that directory.Export the
argocddirectory from theztp-site-generatecontainer image using the following commands:podman pull registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14
$ podman pull registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14Copy to Clipboard Copied! Toggle word wrap Toggle overflow mkdir -p ./out
$ mkdir -p ./outCopy to Clipboard Copied! Toggle word wrap Toggle overflow podman run --log-driver=none --rm registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 extract /home/ztp --tar | tar x -C ./out
$ podman run --log-driver=none --rm registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 extract /home/ztp --tar | tar x -C ./outCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the
outdirectory contains the following subdirectories:-
out/extra-manifestcontains the source CR files thatSiteConfiguses to generate extra manifestconfigMap. -
out/source-crscontains the source CR files thatPolicyGenTemplateuses to generate the Red Hat Advanced Cluster Management (RHACM) policies. -
out/argocd/deploymentcontains patches and YAML files to apply on the hub cluster for use in the next step of this procedure. -
out/argocd/examplecontains the examples forSiteConfigandPolicyGenTemplatefiles that represent the recommended configuration.
-
-
Copy the
out/source-crsfolder and contents to thePolicyGentemplatedirectory. The out/extra-manifests directory contains the reference manifests for a RAN DU cluster. Copy the
out/extra-manifestsdirectory into theSiteConfigfolder. This directory should contain CRs from theztp-site-generatecontainer only. Do not add user-provided CRs here. If you want to work with user-provided CRs you must create another directory for that content. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Commit the directory structure and the
kustomization.yamlfiles and push to your Git repository. The initial push to Git should include thekustomization.yamlfiles.
You can use the directory structure under out/argocd/example as a reference for the structure and content of your Git repository. That structure includes SiteConfig and PolicyGenTemplate reference CRs for single-node, three-node, and standard clusters. Remove references to cluster types that you are not using.
For all cluster types, you must:
-
Add the
source-crssubdirectory to thepolicygentemplatedirectory. -
Add the
extra-manifestsdirectory to thesiteconfigdirectory.
The following example describes a set of CRs for a network of single-node clusters:
22.2.9.1. Preparing the GitOps ZTP site configuration repository for version independence Copy linkLink copied to clipboard!
You can use GitOps ZTP to manage source custom resources (CRs) for managed clusters that are running different versions of OpenShift Container Platform. This means that the version of OpenShift Container Platform running on the hub cluster can be independent of the version running on the managed clusters.
Procedure
-
Create a directory structure with separate paths for the
SiteConfigandPolicyGenTemplateCRs. Within the
PolicyGenTemplatedirectory, create a directory for each OpenShift Container Platform version you want to make available. For each version, create the following resources:-
kustomization.yamlfile that explicitly includes the files in that directory source-crsdirectory to contain reference CR configuration files from theztp-site-generatecontainerIf you want to work with user-provided CRs, you must create a separate directory for them.
-
In the
/siteconfigdirectory, create a subdirectory for each OpenShift Container Platform version you want to make available. For each version, create at least one directory for reference CRs to be copied from the container. There is no restriction on the naming of directories or on the number of reference directories. If you want to work with custom manifests, you must create a separate directory for them.The following example describes a structure using user-provided manifests and CRs for different versions of OpenShift Container Platform:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Create a top-level
kustomizationYAML file. - 2 7
- Create the version-specific directories within the custom
/policygentemplatesdirectory. - 3 8
- Create a
kustomization.yamlfile for each version. - 4 9
- Create a
source-crsdirectory for each version to contain reference CRs from theztp-site-generatecontainer. - 5 10
- Create the
reference-crsdirectory for policy CRs that are extracted from the ZTP container. - 6 11
- Optional: Create a
custom-crsdirectory for user-provided CRs. - 12 14
- Create a directory within the custom
/siteconfigdirectory to contain extra manifests from theztp-site-generatecontainer. - 13 15
- Create a folder to hold user-provided manifests.
NoteIn the previous example, each version subdirectory in the custom
/siteconfigdirectory contains two further subdirectories, one containing the reference manifests copied from the container, the other for custom manifests that you provide. The names assigned to those directories are examples. If you use user-provided CRs, the last directory listed underextraManifests.searchPathsin theSiteConfigCR must be the directory containing user-provided CRs.Edit the
SiteConfigCR to include the search paths of any directories you have created. The first directory that is listed underextraManifests.searchPathsmust be the directory containing the reference manifests. Consider the order in which the directories are listed. In cases where directories contain files with the same name, the file in the final directory takes precedence.Example SiteConfig CR
extraManifests: searchPaths: - extra-manifest/ - custom-manifest/extraManifests: searchPaths: - extra-manifest/1 - custom-manifest/2 Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the top-level
kustomization.yamlfile to control which OpenShift Container Platform versions are active. The following is an example of akustomization.yamlfile at the top level:resources: - version_4.13 #- version_4.14
resources: - version_4.131 #- version_4.142 Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.3. Updating GitOps ZTP Copy linkLink copied to clipboard!
You can update the GitOps Zero Touch Provisioning (ZTP) infrastructure independently from the hub cluster, Red Hat Advanced Cluster Management (RHACM), and the managed OpenShift Container Platform clusters.
You can update the Red Hat OpenShift GitOps Operator when new versions become available. When updating the GitOps ZTP plugin, review the updated files in the reference configuration and ensure that the changes meet your requirements.
22.3.1. Overview of the GitOps ZTP update process Copy linkLink copied to clipboard!
You can update GitOps Zero Touch Provisioning (ZTP) for a fully operational hub cluster running an earlier version of the GitOps ZTP infrastructure. The update process avoids impact on managed clusters.
Any changes to policy settings, including adding recommended content, results in updated polices that must be rolled out to the managed clusters and reconciled.
At a high level, the strategy for updating the GitOps ZTP infrastructure is as follows:
-
Label all existing clusters with the
ztp-donelabel. - Stop the ArgoCD applications.
- Install the new GitOps ZTP tools.
- Update required content and optional changes in the Git repository.
- Update and restart the application configuration.
22.3.2. Preparing for the upgrade Copy linkLink copied to clipboard!
Use the following procedure to prepare your site for the GitOps Zero Touch Provisioning (ZTP) upgrade.
Procedure
- Get the latest version of the GitOps ZTP container that has the custom resources (CRs) used to configure Red Hat OpenShift GitOps for use with GitOps ZTP.
Extract the
argocd/deploymentdirectory by using the following commands:mkdir -p ./update
$ mkdir -p ./updateCopy to Clipboard Copied! Toggle word wrap Toggle overflow podman run --log-driver=none --rm registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 extract /home/ztp --tar | tar x -C ./update
$ podman run --log-driver=none --rm registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 extract /home/ztp --tar | tar x -C ./updateCopy to Clipboard Copied! Toggle word wrap Toggle overflow The
/updatedirectory contains the following subdirectories:-
update/extra-manifest: contains the source CR files that theSiteConfigCR uses to generate the extra manifestconfigMap. -
update/source-crs: contains the source CR files that thePolicyGenTemplateCR uses to generate the Red Hat Advanced Cluster Management (RHACM) policies. -
update/argocd/deployment: contains patches and YAML files to apply on the hub cluster for use in the next step of this procedure. -
update/argocd/example: contains exampleSiteConfigandPolicyGenTemplatefiles that represent the recommended configuration.
-
Update the
clusters-app.yamlandpolicies-app.yamlfiles to reflect the name of your applications and the URL, branch, and path for your Git repository.If the upgrade includes changes that results in obsolete policies, the obsolete policies should be removed prior to performing the upgrade.
Diff the changes between the configuration and deployment source CRs in the
/updatefolder and Git repo where you manage your fleet site CRs. Apply and push the required changes to your site repository.ImportantWhen you update GitOps ZTP to the latest version, you must apply the changes from the
update/argocd/deploymentdirectory to your site repository. Do not use older versions of theargocd/deployment/files.
22.3.3. Labeling the existing clusters Copy linkLink copied to clipboard!
To ensure that existing clusters remain untouched by the tool updates, label all existing managed clusters with the ztp-done label.
This procedure only applies when updating clusters that were not provisioned with Topology Aware Lifecycle Manager (TALM). Clusters that you provision with TALM are automatically labeled with ztp-done.
Procedure
Find a label selector that lists the managed clusters that were deployed with GitOps Zero Touch Provisioning (ZTP), such as
local-cluster!=true:oc get managedcluster -l 'local-cluster!=true'
$ oc get managedcluster -l 'local-cluster!=true'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the resulting list contains all the managed clusters that were deployed with GitOps ZTP, and then use that selector to add the
ztp-donelabel:oc label managedcluster -l 'local-cluster!=true' ztp-done=
$ oc label managedcluster -l 'local-cluster!=true' ztp-done=Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.3.4. Stopping the existing GitOps ZTP applications Copy linkLink copied to clipboard!
Removing the existing applications ensures that any changes to existing content in the Git repository are not rolled out until the new version of the tools is available.
Use the application files from the deployment directory. If you used custom names for the applications, update the names in these files first.
Procedure
Perform a non-cascaded delete on the
clustersapplication to leave all generated resources in place:oc delete -f update/argocd/deployment/clusters-app.yaml
$ oc delete -f update/argocd/deployment/clusters-app.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Perform a cascaded delete on the
policiesapplication to remove all previous policies:oc patch -f policies-app.yaml -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type merge$ oc patch -f policies-app.yaml -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -f update/argocd/deployment/policies-app.yaml
$ oc delete -f update/argocd/deployment/policies-app.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.3.5. Required changes to the Git repository Copy linkLink copied to clipboard!
When upgrading the ztp-site-generate container from an earlier release of GitOps Zero Touch Provisioning (ZTP) to 4.10 or later, there are additional requirements for the contents of the Git repository. Existing content in the repository must be updated to reflect these changes.
Make required changes to
PolicyGenTemplatefiles:All
PolicyGenTemplatefiles must be created in aNamespaceprefixed withztp. This ensures that the GitOps ZTP application is able to manage the policy CRs generated by GitOps ZTP without conflicting with the way Red Hat Advanced Cluster Management (RHACM) manages the policies internally.Add the
kustomization.yamlfile to the repository:All
SiteConfigandPolicyGenTemplateCRs must be included in akustomization.yamlfile under their respective directory trees. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe files listed in the
generatorsections must contain eitherSiteConfigorPolicyGenTemplateCRs only. If your existing YAML files contain other CRs, for example,Namespace, these other CRs must be pulled out into separate files and listed in theresourcessection.The
PolicyGenTemplatekustomization file must contain allPolicyGenTemplateYAML files in thegeneratorsection andNamespaceCRs in theresourcessection. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
SiteConfigkustomization file must contain allSiteConfigYAML files in thegeneratorsection and any other CRs in the resources:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the
pre-sync.yamlandpost-sync.yamlfiles.In OpenShift Container Platform 4.10 and later, the
pre-sync.yamlandpost-sync.yamlfiles are no longer required. Theupdate/deployment/kustomization.yamlCR manages the policies deployment on the hub cluster.NoteThere is a set of
pre-sync.yamlandpost-sync.yamlfiles under both theSiteConfigandPolicyGenTemplatetrees.Review and incorporate recommended changes
Each release may include additional recommended changes to the configuration applied to deployed clusters. Typically these changes result in lower CPU use by the OpenShift platform, additional features, or improved tuning of the platform.
Review the reference
SiteConfigandPolicyGenTemplateCRs applicable to the types of cluster in your network. These examples can be found in theargocd/exampledirectory extracted from the GitOps ZTP container.
22.3.6. Installing the new GitOps ZTP applications Copy linkLink copied to clipboard!
Using the extracted argocd/deployment directory, and after ensuring that the applications point to your site Git repository, apply the full contents of the deployment directory. Applying the full contents of the directory ensures that all necessary resources for the applications are correctly configured.
Procedure
To install the GitOps ZTP plugin, patch the ArgoCD instance in the hub cluster with the relevant multicluster engine (MCE) subscription image. Customize the patch file that you previously extracted into the
out/argocd/deployment/directory for your environment.Select the
multicluster-operators-subscriptionimage that matches your RHACM version.-
For RHACM 2.8 and 2.9, use the
registry.redhat.io/rhacm2/multicluster-operators-subscription-rhel8:v<rhacm_version>image. -
For RHACM 2.10 and later, use the
registry.redhat.io/rhacm2/multicluster-operators-subscription-rhel9:v<rhacm_version>image.
ImportantThe version of the
multicluster-operators-subscriptionimage must match the RHACM version. Beginning with the MCE 2.10 release, RHEL 9 is the base image formulticluster-operators-subscriptionimages.Click
[Expand for Operator list]in the "Platform Aligned Operators" table in OpenShift Operator Life Cycles to view the complete supported Operators matrix for OpenShift Container Platform.-
For RHACM 2.8 and 2.9, use the
Add the following configuration to the
out/argocd/deployment/argocd-openshift-gitops-patch.jsonfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Optional: For RHEL 9 images, copy the required universal executable in the
/policy-generator/PolicyGenerator-not-fips-compliantfolder for the ArgoCD version. - 2
- Match the
multicluster-operators-subscriptionimage to the RHACM version. - 3
- In disconnected environments, replace the URL for the
multicluster-operators-subscriptionimage with the disconnected registry equivalent for your environment.
Patch the ArgoCD instance. Run the following command:
oc patch argocd openshift-gitops \ -n openshift-gitops --type=merge \ --patch-file out/argocd/deployment/argocd-openshift-gitops-patch.json
$ oc patch argocd openshift-gitops \ -n openshift-gitops --type=merge \ --patch-file out/argocd/deployment/argocd-openshift-gitops-patch.jsonCopy to Clipboard Copied! Toggle word wrap Toggle overflow
In RHACM 2.7 and later, the multicluster engine enables the
cluster-proxy-addonfeature by default. Apply the following patch to disable thecluster-proxy-addonfeature and remove the relevant hub cluster and managed pods that are responsible for this add-on. Run the following command:oc patch multiclusterengines.multicluster.openshift.io multiclusterengine --type=merge --patch-file out/argocd/deployment/disable-cluster-proxy-addon.json
$ oc patch multiclusterengines.multicluster.openshift.io multiclusterengine --type=merge --patch-file out/argocd/deployment/disable-cluster-proxy-addon.jsonCopy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the pipeline configuration to your hub cluster by running the following command:
oc apply -k out/argocd/deployment
$ oc apply -k out/argocd/deploymentCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.3.7. Rolling out the GitOps ZTP configuration changes Copy linkLink copied to clipboard!
If any configuration changes were included in the upgrade due to implementing recommended changes, the upgrade process results in a set of policy CRs on the hub cluster in the Non-Compliant state. With the GitOps Zero Touch Provisioning (ZTP) version 4.10 and later ztp-site-generate container, these policies are set to inform mode and are not pushed to the managed clusters without an additional step by the user. This ensures that potentially disruptive changes to the clusters can be managed in terms of when the changes are made, for example, during a maintenance window, and how many clusters are updated concurrently.
To roll out the changes, create one or more ClusterGroupUpgrade CRs as detailed in the TALM documentation. The CR must contain the list of Non-Compliant policies that you want to push out to the managed clusters as well as a list or selector of which clusters should be included in the update.
22.4. Installing managed clusters with RHACM and SiteConfig resources Copy linkLink copied to clipboard!
You can provision OpenShift Container Platform clusters at scale with Red Hat Advanced Cluster Management (RHACM) using the assisted service and the GitOps plugin policy generator with core-reduction technology enabled. The GitOps Zero Touch Provisioning (ZTP) pipeline performs the cluster installations. GitOps ZTP can be used in a disconnected environment.
22.4.1. GitOps ZTP and Topology Aware Lifecycle Manager Copy linkLink copied to clipboard!
GitOps Zero Touch Provisioning (ZTP) generates installation and configuration CRs from manifests stored in Git. These artifacts are applied to a centralized hub cluster where Red Hat Advanced Cluster Management (RHACM), the assisted service, and the Topology Aware Lifecycle Manager (TALM) use the CRs to install and configure the managed cluster. The configuration phase of the GitOps ZTP pipeline uses the TALM to orchestrate the application of the configuration CRs to the cluster. There are several key integration points between GitOps ZTP and the TALM.
- Inform policies
-
By default, GitOps ZTP creates all policies with a remediation action of
inform. These policies cause RHACM to report on compliance status of clusters relevant to the policies but does not apply the desired configuration. During the GitOps ZTP process, after OpenShift installation, the TALM steps through the createdinformpolicies and enforces them on the target managed cluster(s). This applies the configuration to the managed cluster. Outside of the GitOps ZTP phase of the cluster lifecycle, this allows you to change policies without the risk of immediately rolling those changes out to affected managed clusters. You can control the timing and the set of remediated clusters by using TALM. - Automatic creation of ClusterGroupUpgrade CRs
To automate the initial configuration of newly deployed clusters, TALM monitors the state of all
ManagedClusterCRs on the hub cluster. AnyManagedClusterCR that does not have aztp-donelabel applied, including newly createdManagedClusterCRs, causes the TALM to automatically create aClusterGroupUpgradeCR with the following characteristics:-
The
ClusterGroupUpgradeCR is created and enabled in theztp-installnamespace. -
ClusterGroupUpgradeCR has the same name as theManagedClusterCR. -
The cluster selector includes only the cluster associated with that
ManagedClusterCR. -
The set of managed policies includes all policies that RHACM has bound to the cluster at the time the
ClusterGroupUpgradeis created. - Pre-caching is disabled.
- Timeout set to 4 hours (240 minutes).
The automatic creation of an enabled
ClusterGroupUpgradeensures that initial zero-touch deployment of clusters proceeds without the need for user intervention. Additionally, the automatic creation of aClusterGroupUpgradeCR for anyManagedClusterwithout theztp-donelabel allows a failed GitOps ZTP installation to be restarted by simply deleting theClusterGroupUpgradeCR for the cluster.-
The
- Waves
Each policy generated from a
PolicyGenTemplateCR includes aztp-deploy-waveannotation. This annotation is based on the same annotation from each CR which is included in that policy. The wave annotation is used to order the policies in the auto-generatedClusterGroupUpgradeCR. The wave annotation is not used other than for the auto-generatedClusterGroupUpgradeCR.NoteAll CRs in the same policy must have the same setting for the
ztp-deploy-waveannotation. The default value of this annotation for each CR can be overridden in thePolicyGenTemplate. The wave annotation in the source CR is used for determining and setting the policy wave annotation. This annotation is removed from each built CR which is included in the generated policy at runtime.The TALM applies the configuration policies in the order specified by the wave annotations. The TALM waits for each policy to be compliant before moving to the next policy. It is important to ensure that the wave annotation for each CR takes into account any prerequisites for those CRs to be applied to the cluster. For example, an Operator must be installed before or concurrently with the configuration for the Operator. Similarly, the
CatalogSourcefor an Operator must be installed in a wave before or concurrently with the Operator Subscription. The default wave value for each CR takes these prerequisites into account.Multiple CRs and policies can share the same wave number. Having fewer policies can result in faster deployments and lower CPU usage. It is a best practice to group many CRs into relatively few waves.
To check the default wave value in each source CR, run the following command against the out/source-crs directory that is extracted from the ztp-site-generate container image:
grep -r "ztp-deploy-wave" out/source-crs
$ grep -r "ztp-deploy-wave" out/source-crs
- Phase labels
The
ClusterGroupUpgradeCR is automatically created and includes directives to annotate theManagedClusterCR with labels at the start and end of the GitOps ZTP process.When GitOps ZTP configuration postinstallation commences, the
ManagedClusterhas theztp-runninglabel applied. When all policies are remediated to the cluster and are fully compliant, these directives cause the TALM to remove theztp-runninglabel and apply theztp-donelabel.For deployments that make use of the
informDuValidatorpolicy, theztp-donelabel is applied when the cluster is fully ready for deployment of applications. This includes all reconciliation and resulting effects of the GitOps ZTP applied configuration CRs. Theztp-donelabel affects automaticClusterGroupUpgradeCR creation by TALM. Do not manipulate this label after the initial GitOps ZTP installation of the cluster.- Linked CRs
-
The automatically created
ClusterGroupUpgradeCR has the owner reference set as theManagedClusterfrom which it was derived. This reference ensures that deleting theManagedClusterCR causes the instance of theClusterGroupUpgradeto be deleted along with any supporting resources.
22.4.2. Overview of deploying managed clusters with GitOps ZTP Copy linkLink copied to clipboard!
Red Hat Advanced Cluster Management (RHACM) uses GitOps Zero Touch Provisioning (ZTP) to deploy single-node OpenShift Container Platform clusters, three-node clusters, and standard clusters. You manage site configuration data as OpenShift Container Platform custom resources (CRs) in a Git repository. GitOps ZTP uses a declarative GitOps approach for a develop once, deploy anywhere model to deploy the managed clusters.
The deployment of the clusters includes:
- Installing the host operating system (RHCOS) on a blank server
- Deploying OpenShift Container Platform
- Creating cluster policies and site subscriptions
- Making the necessary network configurations to the server operating system
- Deploying profile Operators and performing any needed software-related configuration, such as performance profile, PTP, and SR-IOV
22.4.2.1. Overview of the managed site installation process Copy linkLink copied to clipboard!
After you apply the managed site custom resources (CRs) on the hub cluster, the following actions happen automatically:
- A Discovery image ISO file is generated and booted on the target host.
- When the ISO file successfully boots on the target host it reports the host hardware information to RHACM.
- After all hosts are discovered, OpenShift Container Platform is installed.
-
When OpenShift Container Platform finishes installing, the hub installs the
klusterletservice on the target cluster. - The requested add-on services are installed on the target cluster.
The Discovery image ISO process is complete when the Agent CR for the managed cluster is created on the hub cluster.
The target bare-metal host must meet the networking, firmware, and hardware requirements listed in Recommended single-node OpenShift cluster configuration for vDU application workloads.
22.4.3. Creating the managed bare-metal host secrets Copy linkLink copied to clipboard!
Add the required Secret custom resources (CRs) for the managed bare-metal host to the hub cluster. You need a secret for the GitOps Zero Touch Provisioning (ZTP) pipeline to access the Baseboard Management Controller (BMC) and a secret for the assisted installer service to pull cluster installation images from the registry.
The secrets are referenced from the SiteConfig CR by name. The namespace must match the SiteConfig namespace.
Procedure
Create a YAML secret file containing credentials for the host Baseboard Management Controller (BMC) and a pull secret required for installing OpenShift and all add-on cluster Operators:
Save the following YAML as the file
example-sno-secret.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Add the relative path to
example-sno-secret.yamlto thekustomization.yamlfile that you use to install the cluster.
22.4.4. Configuring Discovery ISO kernel arguments for installations using GitOps ZTP Copy linkLink copied to clipboard!
The GitOps Zero Touch Provisioning (ZTP) workflow uses the Discovery ISO as part of the OpenShift Container Platform installation process on managed bare-metal hosts. You can edit the InfraEnv resource to specify kernel arguments for the Discovery ISO. This is useful for cluster installations with specific environmental requirements. For example, configure the rd.net.timeout.carrier kernel argument for the Discovery ISO to facilitate static networking for the cluster or to receive a DHCP address before downloading the root file system during installation.
In OpenShift Container Platform 4.14, you can only add kernel arguments. You can not replace or delete kernel arguments.
Prerequisites
- You have installed the OpenShift CLI (oc).
- You have logged in to the hub cluster as a user with cluster-admin privileges.
Procedure
Create the
InfraEnvCR and edit thespec.kernelArgumentsspecification to configure kernel arguments.Save the following YAML in an
InfraEnv-example.yamlfile:NoteThe
InfraEnvCR in this example uses template syntax such as{{ .Cluster.ClusterName }}that is populated based on values in theSiteConfigCR. TheSiteConfigCR automatically populates values for these templates during deployment. Do not edit the templates manually.Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Commit the
InfraEnv-example.yamlCR to the same location in your Git repository that has theSiteConfigCR and push your changes. The following example shows a sample Git repository structure:~/example-ztp/install └── site-install ├── siteconfig-example.yaml ├── InfraEnv-example.yaml ...~/example-ztp/install └── site-install ├── siteconfig-example.yaml ├── InfraEnv-example.yaml ...Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
spec.clusters.crTemplatesspecification in theSiteConfigCR to reference theInfraEnv-example.yamlCR in your Git repository:clusters: crTemplates: InfraEnv: "InfraEnv-example.yaml"clusters: crTemplates: InfraEnv: "InfraEnv-example.yaml"Copy to Clipboard Copied! Toggle word wrap Toggle overflow When you are ready to deploy your cluster by committing and pushing the
SiteConfigCR, the build pipeline uses the customInfraEnv-exampleCR in your Git repository to configure the infrastructure environment, including the custom kernel arguments.
Verification
To verify that the kernel arguments are applied, after the Discovery image verifies that OpenShift Container Platform is ready for installation, you can SSH to the target host before the installation process begins. At that point, you can view the kernel arguments for the Discovery ISO in the /proc/cmdline file.
Begin an SSH session with the target host:
ssh -i /path/to/privatekey core@<host_name>
$ ssh -i /path/to/privatekey core@<host_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow View the system’s kernel arguments by using the following command:
cat /proc/cmdline
$ cat /proc/cmdlineCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.4.5. Deploying a managed cluster with SiteConfig and GitOps ZTP Copy linkLink copied to clipboard!
Use the following procedure to create a SiteConfig custom resource (CR) and related files and initiate the GitOps Zero Touch Provisioning (ZTP) cluster deployment.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. - You configured the hub cluster for generating the required installation and policy CRs.
You created a Git repository where you manage your custom site configuration data. The repository must be accessible from the hub cluster and you must configure it as a source repository for the ArgoCD application. See "Preparing the GitOps ZTP site configuration repository" for more information.
NoteWhen you create the source repository, ensure that you patch the ArgoCD application with the
argocd/deployment/argocd-openshift-gitops-patch.jsonpatch-file that you extract from theztp-site-generatecontainer. See "Configuring the hub cluster with ArgoCD".To be ready for provisioning managed clusters, you require the following for each bare-metal host:
- Network connectivity
- Your network requires DNS. Managed cluster hosts should be reachable from the hub cluster. Ensure that Layer 3 connectivity exists between the hub cluster and the managed cluster host.
- Baseboard Management Controller (BMC) details
-
GitOps ZTP uses BMC username and password details to connect to the BMC during cluster installation. The GitOps ZTP plugin manages the
ManagedClusterCRs on the hub cluster based on theSiteConfigCR in your site Git repo. You create individualBMCSecretCRs for each host manually.
Procedure
Create the required managed cluster secrets on the hub cluster. These resources must be in a namespace with a name matching the cluster name. For example, in
out/argocd/example/siteconfig/example-sno.yaml, the cluster name and namespace isexample-sno.Export the cluster namespace by running the following command:
export CLUSTERNS=example-sno
$ export CLUSTERNS=example-snoCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the namespace:
oc create namespace $CLUSTERNS
$ oc create namespace $CLUSTERNSCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Create pull secret and BMC
SecretCRs for the managed cluster. The pull secret must contain all the credentials necessary for installing OpenShift Container Platform and all required Operators. See "Creating the managed bare-metal host secrets" for more information.NoteThe secrets are referenced from the
SiteConfigcustom resource (CR) by name. The namespace must match theSiteConfignamespace.Create a
SiteConfigCR for your cluster in your local clone of the Git repository:Choose the appropriate example for your CR from the
out/argocd/example/siteconfig/folder. The folder includes example files for single node, three-node, and standard clusters:-
example-sno.yaml -
example-3node.yaml -
example-standard.yaml
-
Change the cluster and host details in the example file to match the type of cluster you want. For example:
Example single-node OpenShift SiteConfig CR
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteFor more information about BMC addressing, see the "Additional resources" section. The
installConfigOverridesandignitionConfigOverridefields are expanded in the example for ease of readability.-
You can inspect the default set of extra-manifest
MachineConfigCRs inout/argocd/extra-manifest. It is automatically applied to the cluster when it is installed. Optional: To provision additional install-time manifests on the provisioned cluster, create a directory in your Git repository, for example,
sno-extra-manifest/, and add your custom manifest CRs to this directory. If yourSiteConfig.yamlrefers to this directory in theextraManifestPathfield, any CRs in this referenced directory are appended to the default set of extra manifests.Enabling the crun OCI container runtimeFor optimal cluster performance, enable crun for master and worker nodes in single-node OpenShift, single-node OpenShift with additional worker nodes, three-node OpenShift, and standard clusters.
Enable crun in a
ContainerRuntimeConfigCR as an additional Day 0 install-time manifest to avoid the cluster having to reboot.The
enable-crun-master.yamlandenable-crun-worker.yamlCR files are in theout/source-crs/optional-extra-manifest/folder that you can extract from theztp-site-generatecontainer. For more information, see "Customizing extra installation manifests in the GitOps ZTP pipeline".
-
Add the
SiteConfigCR to thekustomization.yamlfile in thegeneratorssection, similar to the example shown inout/argocd/example/siteconfig/kustomization.yaml. Commit the
SiteConfigCR and associatedkustomization.yamlchanges in your Git repository and push the changes.The ArgoCD pipeline detects the changes and begins the managed cluster deployment.
Verification
Verify that the custom roles and labels are applied after the node is deployed:
oc describe node example-node.example.com
$ oc describe node example-node.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Example output
- 1
- The custom label is applied to the node.
22.4.5.1. Single-node OpenShift SiteConfig CR installation reference Copy linkLink copied to clipboard!
| SiteConfig CR field | Description |
|---|---|
|
|
Configure workload partitioning by setting the value for Note
Configuring workload partitioning by using the |
|
|
Set |
|
|
Configure the image set available on the hub cluster for all the clusters in the site. To see the list of supported versions on your hub cluster, run |
|
|
Set the Important
Use the reference configuration as specified in the example |
|
|
Specifies the cluster image set used to deploy an individual cluster. If defined, it overrides the |
|
|
Configure cluster labels to correspond to the |
|
|
Optional. Set |
|
|
For single-node deployments, define a single host. For three-node deployments, define three hosts. For standard deployments, define three hosts with |
|
| Specify custom roles for your nodes in your managed clusters. These are additional roles are not used by any OpenShift Container Platform components, only by the user. When you add a custom role, it can be associated with a custom machine config pool that references a specific configuration for that role. Adding custom labels or roles during installation makes the deployment process more effective and prevents the need for additional reboots after the installation is complete. |
|
|
Optional. Uncomment and set the value to |
|
| BMC address that you use to access the host. Applies to all cluster types. GitOps ZTP supports iPXE and virtual media booting by using Redfish or IPMI protocols. To use iPXE booting, you must use RHACM 2.8 or later. For more information about BMC addressing, see the "Additional resources" section. |
|
| BMC address that you use to access the host. Applies to all cluster types. GitOps ZTP supports iPXE and virtual media booting by using Redfish or IPMI protocols. To use iPXE booting, you must use RHACM 2.8 or later. For more information about BMC addressing, see the "Additional resources" section. Note In far edge Telco use cases, only virtual media is supported for use with GitOps ZTP. |
|
|
Configure the |
|
|
Set the boot mode for the host to |
|
|
Specifies the device for deployment. Identifiers that are stable across reboots are recommended. For example, |
|
| Optional. Use this field to assign partitions for persistent storage. Adjust disk ID and size to the specific hardware. |
|
| Configure the network settings for the node. |
|
| Configure the IPv6 address for the host. For single-node OpenShift clusters with static IP addresses, the node-specific API and Ingress IPs should be the same. |
22.4.6. Monitoring managed cluster installation progress Copy linkLink copied to clipboard!
The ArgoCD pipeline uses the SiteConfig CR to generate the cluster configuration CRs and syncs it with the hub cluster. You can monitor the progress of the synchronization in the ArgoCD dashboard.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges.
Procedure
When the synchronization is complete, the installation generally proceeds as follows:
The Assisted Service Operator installs OpenShift Container Platform on the cluster. You can monitor the progress of cluster installation from the RHACM dashboard or from the command line by running the following commands:
Export the cluster name:
export CLUSTER=<clusterName>
$ export CLUSTER=<clusterName>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Query the
AgentClusterInstallCR for the managed cluster:oc get agentclusterinstall -n $CLUSTER $CLUSTER -o jsonpath='{.status.conditions[?(@.type=="Completed")]}' | jq$ oc get agentclusterinstall -n $CLUSTER $CLUSTER -o jsonpath='{.status.conditions[?(@.type=="Completed")]}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the installation events for the cluster:
curl -sk $(oc get agentclusterinstall -n $CLUSTER $CLUSTER -o jsonpath='{.status.debugInfo.eventsURL}') | jq '.[-2,-1]'$ curl -sk $(oc get agentclusterinstall -n $CLUSTER $CLUSTER -o jsonpath='{.status.debugInfo.eventsURL}') | jq '.[-2,-1]'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.4.7. Troubleshooting GitOps ZTP by validating the installation CRs Copy linkLink copied to clipboard!
The ArgoCD pipeline uses the SiteConfig and PolicyGenTemplate custom resources (CRs) to generate the cluster configuration CRs and Red Hat Advanced Cluster Management (RHACM) policies. Use the following steps to troubleshoot issues that might occur during this process.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges.
Procedure
Check that the installation CRs were created by using the following command:
oc get AgentClusterInstall -n <cluster_name>
$ oc get AgentClusterInstall -n <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow If no object is returned, use the following steps to troubleshoot the ArgoCD pipeline flow from
SiteConfigfiles to the installation CRs.Verify that the
ManagedClusterCR was generated using theSiteConfigCR on the hub cluster:oc get managedcluster
$ oc get managedclusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow If the
ManagedClusteris missing, check if theclustersapplication failed to synchronize the files from the Git repository to the hub cluster:oc get applications.argoproj.io -n openshift-gitops clusters -o yaml
$ oc get applications.argoproj.io -n openshift-gitops clusters -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow To identify error logs for the managed cluster, inspect the
status.operationState.syncResult.resourcesfield. For example, if an invalid value is assigned to theextraManifestPathin theSiteConfigCR, an error similar to the following is generated:Copy to Clipboard Copied! Toggle word wrap Toggle overflow To see a more detailed
SiteConfigerror, complete the following steps:- In the Argo CD dashboard, click the SiteConfig resource that Argo CD is trying to sync.
Check the DESIRED MANIFEST tab to find the
siteConfigErrorfield.siteConfigError: >- Error: could not build the entire SiteConfig defined by /tmp/kust-plugin-config-1081291903: stat sno-extra-manifest: no such file or directory
siteConfigError: >- Error: could not build the entire SiteConfig defined by /tmp/kust-plugin-config-1081291903: stat sno-extra-manifest: no such file or directoryCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Check the
Status.Syncfield. If there are log errors, theStatus.Syncfield could indicate anUnknownerror:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.4.8. Troubleshooting GitOps ZTP virtual media booting on Supermicro servers Copy linkLink copied to clipboard!
SuperMicro X11 servers do not support virtual media installations when the image is served using the https protocol. As a result, single-node OpenShift deployments for this environment fail to boot on the target node. To avoid this issue, log in to the hub cluster and disable Transport Layer Security (TLS) in the Provisioning resource. This ensures the image is not served with TLS even though the image address uses the https scheme.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges.
Procedure
Disable TLS in the
Provisioningresource by running the following command:oc patch provisioning provisioning-configuration --type merge -p '{"spec":{"disableVirtualMediaTLS": true}}'$ oc patch provisioning provisioning-configuration --type merge -p '{"spec":{"disableVirtualMediaTLS": true}}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Continue the steps to deploy your single-node OpenShift cluster.
22.4.9. Removing a managed cluster site from the GitOps ZTP pipeline Copy linkLink copied to clipboard!
You can remove a managed site and the associated installation and configuration policy CRs from the GitOps Zero Touch Provisioning (ZTP) pipeline.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges.
Procedure
-
Remove a site and the associated CRs by removing the associated
SiteConfigandPolicyGenTemplatefiles from thekustomization.yamlfile. Add the following
syncOptionsfield to yourSiteConfigapplication:kind: Application spec: syncPolicy: syncOptions: - PrunePropagationPolicy=backgroundkind: Application spec: syncPolicy: syncOptions: - PrunePropagationPolicy=backgroundCopy to Clipboard Copied! Toggle word wrap Toggle overflow When you run the GitOps ZTP pipeline again, the generated CRs are removed.
-
Optional: If you want to permanently remove a site, you should also remove the
SiteConfigand site-specificPolicyGenTemplatefiles from the Git repository. -
Optional: If you want to remove a site temporarily, for example when redeploying a site, you can leave the
SiteConfigand site-specificPolicyGenTemplateCRs in the Git repository.
22.4.10. Removing obsolete content from the GitOps ZTP pipeline Copy linkLink copied to clipboard!
If a change to the PolicyGenTemplate configuration results in obsolete policies, for example, if you rename policies, use the following procedure to remove the obsolete policies.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges.
Procedure
-
Remove the affected
PolicyGenTemplatefiles from the Git repository, commit and push to the remote repository. - Wait for the changes to synchronize through the application and the affected policies to be removed from the hub cluster.
Add the updated
PolicyGenTemplatefiles back to the Git repository, and then commit and push to the remote repository.NoteRemoving GitOps Zero Touch Provisioning (ZTP) policies from the Git repository, and as a result also removing them from the hub cluster, does not affect the configuration of the managed cluster. The policy and CRs managed by that policy remains in place on the managed cluster.
Optional: As an alternative, after making changes to
PolicyGenTemplateCRs that result in obsolete policies, you can remove these policies from the hub cluster manually. You can delete policies from the RHACM console using the Governance tab or by running the following command:oc delete policy -n <namespace> <policy_name>
$ oc delete policy -n <namespace> <policy_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.4.11. Tearing down the GitOps ZTP pipeline Copy linkLink copied to clipboard!
You can remove the ArgoCD pipeline and all generated GitOps Zero Touch Provisioning (ZTP) artifacts.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges.
Procedure
- Detach all clusters from Red Hat Advanced Cluster Management (RHACM) on the hub cluster.
Delete the
kustomization.yamlfile in thedeploymentdirectory using the following command:oc delete -k out/argocd/deployment
$ oc delete -k out/argocd/deploymentCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Commit and push your changes to the site repository.
22.5. Configuring managed clusters with policies and PolicyGenTemplate resources Copy linkLink copied to clipboard!
Applied policy custom resources (CRs) configure the managed clusters that you provision. You can customize how Red Hat Advanced Cluster Management (RHACM) uses PolicyGenTemplate CRs to generate the applied policy CRs.
22.5.1. About the PolicyGenTemplate CRD Copy linkLink copied to clipboard!
The PolicyGenTemplate custom resource definition (CRD) tells the PolicyGen policy generator what custom resources (CRs) to include in the cluster configuration, how to combine the CRs into the generated policies, and what items in those CRs need to be updated with overlay content.
The following example shows a PolicyGenTemplate CR (common-du-ranGen.yaml) extracted from the ztp-site-generate reference container. The common-du-ranGen.yaml file defines two Red Hat Advanced Cluster Management (RHACM) policies. The polices manage a collection of configuration CRs, one for each unique value of policyName in the CR. common-du-ranGen.yaml creates a single placement binding and a placement rule to bind the policies to clusters based on the labels listed in the bindingRules section.
Example PolicyGenTemplate CR - common-du-ranGen.yaml
- 1
common: "true"applies the policies to all clusters with this label.- 2
- Files listed under
sourceFilescreate the Operator policies for installed clusters. - 3
OperatorHub.yamlconfigures the OperatorHub for the disconnected registry.- 4
DefaultCatsrc.yamlconfigures the catalog source for the disconnected registry.- 5
policyName: "config-policy"configures Operator subscriptions. TheOperatorHubCR disables the default and this CR replacesredhat-operatorswith aCatalogSourceCR that points to the disconnected registry.
A PolicyGenTemplate CR can be constructed with any number of included CRs. Apply the following example CR in the hub cluster to generate a policy containing a single CR:
Using the source file PtpConfigSlave.yaml as an example, the file defines a PtpConfig CR. The generated policy for the PtpConfigSlave example is named group-du-sno-config-policy. The PtpConfig CR defined in the generated group-du-sno-config-policy is named du-ptp-slave. The spec defined in PtpConfigSlave.yaml is placed under du-ptp-slave along with the other spec items defined under the source file.
The following example shows the group-du-sno-config-policy CR:
22.5.2. Recommendations when customizing PolicyGenTemplate CRs Copy linkLink copied to clipboard!
Consider the following best practices when customizing site configuration PolicyGenTemplate custom resources (CRs):
-
Use as few policies as are necessary. Using fewer policies requires less resources. Each additional policy creates overhead for the hub cluster and the deployed managed cluster. CRs are combined into policies based on the
policyNamefield in thePolicyGenTemplateCR. CRs in the samePolicyGenTemplatewhich have the same value forpolicyNameare managed under a single policy. -
In disconnected environments, use a single catalog source for all Operators by configuring the registry as a single index containing all Operators. Each additional
CatalogSourceCR on the managed clusters increases CPU usage. -
MachineConfigCRs should be included asextraManifestsin theSiteConfigCR so that they are applied during installation. This can reduce the overall time taken until the cluster is ready to deploy applications. -
PolicyGenTemplatesshould override the channel field to explicitly identify the desired version. This ensures that changes in the source CR during upgrades does not update the generated subscription.
When managing large numbers of spoke clusters on the hub cluster, minimize the number of policies to reduce resource consumption.
Grouping multiple configuration CRs into a single or limited number of policies is one way to reduce the overall number of policies on the hub cluster. When using the common, group, and site hierarchy of policies for managing site configuration, it is especially important to combine site-specific configuration into a single policy.
22.5.3. PolicyGenTemplate CRs for RAN deployments Copy linkLink copied to clipboard!
Use PolicyGenTemplate (PGT) custom resources (CRs) to customize the configuration applied to the cluster by using the GitOps Zero Touch Provisioning (ZTP) pipeline. The PGT CR allows you to generate one or more policies to manage the set of configuration CRs on your fleet of clusters. The PGT identifies the set of managed CRs, bundles them into policies, builds the policy wrapping around those CRs, and associates the policies with clusters by using label binding rules.
The reference configuration, obtained from the GitOps ZTP container, is designed to provide a set of critical features and node tuning settings that ensure the cluster can support the stringent performance and resource utilization constraints typical of RAN (Radio Access Network) Distributed Unit (DU) applications. Changes or omissions from the baseline configuration can affect feature availability, performance, and resource utilization. Use the reference PolicyGenTemplate CRs as the basis to create a hierarchy of configuration files tailored to your specific site requirements.
The baseline PolicyGenTemplate CRs that are defined for RAN DU cluster configuration can be extracted from the GitOps ZTP ztp-site-generate container. See "Preparing the GitOps ZTP site configuration repository" for further details.
The PolicyGenTemplate CRs can be found in the ./out/argocd/example/policygentemplates folder. The reference architecture has common, group, and site-specific configuration CRs. Each PolicyGenTemplate CR refers to other CRs that can be found in the ./out/source-crs folder.
The PolicyGenTemplate CRs relevant to RAN cluster configuration are described below. Variants are provided for the group PolicyGenTemplate CRs to account for differences in single-node, three-node compact, and standard cluster configurations. Similarly, site-specific configuration variants are provided for single-node clusters and multi-node (compact or standard) clusters. Use the group and site-specific configuration variants that are relevant for your deployment.
| PolicyGenTemplate CR | Description |
|---|---|
|
| Contains a set of CRs that get applied to multi-node clusters. These CRs configure SR-IOV features typical for RAN installations. |
|
| Contains a set of CRs that get applied to single-node OpenShift clusters. These CRs configure SR-IOV features typical for RAN installations. |
|
| Contains a set of common RAN CRs that get applied to all clusters. These CRs subscribe to a set of operators providing cluster features typical for RAN as well as baseline cluster tuning. |
|
| Contains the RAN policies for three-node clusters only. |
|
| Contains the RAN policies for single-node clusters only. |
|
| Contains the RAN policies for standard three control-plane clusters. |
|
|
|
|
|
|
|
|
|
22.5.4. Customizing a managed cluster with PolicyGenTemplate CRs Copy linkLink copied to clipboard!
Use the following procedure to customize the policies that get applied to the managed cluster that you provision using the GitOps Zero Touch Provisioning (ZTP) pipeline.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. - You configured the hub cluster for generating the required installation and policy CRs.
- You created a Git repository where you manage your custom site configuration data. The repository must be accessible from the hub cluster and be defined as a source repository for the Argo CD application.
Procedure
Create a
PolicyGenTemplateCR for site-specific configuration CRs.-
Choose the appropriate example for your CR from the
out/argocd/example/policygentemplatesfolder, for example,example-sno-site.yamlorexample-multinode-site.yaml. Change the
bindingRulesfield in the example file to match the site-specific label included in theSiteConfigCR. In the exampleSiteConfigfile, the site-specific label issites: example-sno.NoteEnsure that the labels defined in your
PolicyGenTemplatebindingRulesfield correspond to the labels that are defined in the related managed clustersSiteConfigCR.- Change the content in the example file to match the desired configuration.
-
Choose the appropriate example for your CR from the
Optional: Create a
PolicyGenTemplateCR for any common configuration CRs that apply to the entire fleet of clusters.-
Select the appropriate example for your CR from the
out/argocd/example/policygentemplatesfolder, for example,common-ranGen.yaml. - Change the content in the example file to match the desired configuration.
-
Select the appropriate example for your CR from the
Optional: Create a
PolicyGenTemplateCR for any group configuration CRs that apply to the certain groups of clusters in the fleet.Ensure that the content of the overlaid spec files matches your desired end state. As a reference, the out/source-crs directory contains the full list of source-crs available to be included and overlaid by your PolicyGenTemplate templates.
NoteDepending on the specific requirements of your clusters, you might need more than a single group policy per cluster type, especially considering that the example group policies each have a single PerformancePolicy.yaml file that can only be shared across a set of clusters if those clusters consist of identical hardware configurations.
-
Select the appropriate example for your CR from the
out/argocd/example/policygentemplatesfolder, for example,group-du-sno-ranGen.yaml. - Change the content in the example file to match the desired configuration.
-
Select the appropriate example for your CR from the
-
Optional. Create a validator inform policy
PolicyGenTemplateCR to signal when the GitOps ZTP installation and configuration of the deployed cluster is complete. For more information, see "Creating a validator inform policy". Define all the policy namespaces in a YAML file similar to the example
out/argocd/example/policygentemplates/ns.yamlfile.ImportantDo not include the
NamespaceCR in the same file with thePolicyGenTemplateCR.-
Add the
PolicyGenTemplateCRs andNamespaceCR to thekustomization.yamlfile in the generators section, similar to the example shown inout/argocd/example/policygentemplates/kustomization.yaml. Commit the
PolicyGenTemplateCRs,NamespaceCR, and associatedkustomization.yamlfile in your Git repository and push the changes.The ArgoCD pipeline detects the changes and begins the managed cluster deployment. You can push the changes to the
SiteConfigCR and thePolicyGenTemplateCR simultaneously.
22.5.5. Monitoring managed cluster policy deployment progress Copy linkLink copied to clipboard!
The ArgoCD pipeline uses PolicyGenTemplate CRs in Git to generate the RHACM policies and then sync them to the hub cluster. You can monitor the progress of the managed cluster policy synchronization after the assisted service installs OpenShift Container Platform on the managed cluster.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges.
Procedure
The Topology Aware Lifecycle Manager (TALM) applies the configuration policies that are bound to the cluster.
After the cluster installation is complete and the cluster becomes
Ready, aClusterGroupUpgradeCR corresponding to this cluster, with a list of ordered policies defined by theran.openshift.io/ztp-deploy-wave annotations, is automatically created by the TALM. The cluster’s policies are applied in the order listed inClusterGroupUpgradeCR.You can monitor the high-level progress of configuration policy reconciliation by using the following commands:
export CLUSTER=<clusterName>
$ export CLUSTER=<clusterName>Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get clustergroupupgrades -n ztp-install $CLUSTER -o jsonpath='{.status.conditions[-1:]}' | jq$ oc get clustergroupupgrades -n ztp-install $CLUSTER -o jsonpath='{.status.conditions[-1:]}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can monitor the detailed cluster policy compliance status by using the RHACM dashboard or the command line.
To check policy compliance by using
oc, run the following command:oc get policies -n $CLUSTER
$ oc get policies -n $CLUSTERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To check policy status from the RHACM web console, perform the following actions:
- Click Governance → Find policies.
- Click on a cluster policy to check it’s status.
When all of the cluster policies become compliant, GitOps ZTP installation and configuration for the cluster is complete. The ztp-done label is added to the cluster.
In the reference configuration, the final policy that becomes compliant is the one defined in the *-du-validator-policy policy. This policy, when compliant on a cluster, ensures that all cluster configuration, Operator installation, and Operator configuration is complete.
22.5.6. Validating the generation of configuration policy CRs Copy linkLink copied to clipboard!
Policy custom resources (CRs) are generated in the same namespace as the PolicyGenTemplate from which they are created. The same troubleshooting flow applies to all policy CRs generated from a PolicyGenTemplate regardless of whether they are ztp-common, ztp-group, or ztp-site based, as shown using the following commands:
export NS=<namespace>
$ export NS=<namespace>
oc get policy -n $NS
$ oc get policy -n $NS
The expected set of policy-wrapped CRs should be displayed.
If the policies failed synchronization, use the following troubleshooting steps.
Procedure
To display detailed information about the policies, run the following command:
oc describe -n openshift-gitops application policies
$ oc describe -n openshift-gitops application policiesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check for
Status: Conditions:to show the error logs. For example, setting an invalidsourceFile→fileName:generates the error shown below:Status: Conditions: Last Transition Time: 2021-11-26T17:21:39Z Message: rpc error: code = Unknown desc = `kustomize build /tmp/https___git.com/ran-sites/policies/ --enable-alpha-plugins` failed exit status 1: 2021/11/26 17:21:40 Error could not find test.yaml under source-crs/: no such file or directory Error: failure in plugin configured via /tmp/kust-plugin-config-52463179; exit status 1: exit status 1 Type: ComparisonErrorStatus: Conditions: Last Transition Time: 2021-11-26T17:21:39Z Message: rpc error: code = Unknown desc = `kustomize build /tmp/https___git.com/ran-sites/policies/ --enable-alpha-plugins` failed exit status 1: 2021/11/26 17:21:40 Error could not find test.yaml under source-crs/: no such file or directory Error: failure in plugin configured via /tmp/kust-plugin-config-52463179; exit status 1: exit status 1 Type: ComparisonErrorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check for
Status: Sync:. If there are log errors atStatus: Conditions:, theStatus: Sync:showsUnknownorError:Copy to Clipboard Copied! Toggle word wrap Toggle overflow When Red Hat Advanced Cluster Management (RHACM) recognizes that policies apply to a
ManagedClusterobject, the policy CR objects are applied to the cluster namespace. Check to see if the policies were copied to the cluster namespace:oc get policy -n $CLUSTER
$ oc get policy -n $CLUSTERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow RHACM copies all applicable policies into the cluster namespace. The copied policy names have the format:
<policyGenTemplate.Namespace>.<policyGenTemplate.Name>-<policyName>.Check the placement rule for any policies not copied to the cluster namespace. The
matchSelectorin thePlacementRulefor those policies should match labels on theManagedClusterobject:oc get placementrule -n $NS
$ oc get placementrule -n $NSCopy to Clipboard Copied! Toggle word wrap Toggle overflow Note the
PlacementRulename appropriate for the missing policy, common, group, or site, using the following command:oc get placementrule -n $NS <placementRuleName> -o yaml
$ oc get placementrule -n $NS <placementRuleName> -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow - The status-decisions should include your cluster name.
-
The key-value pair of the
matchSelectorin the spec must match the labels on your managed cluster.
Check the labels on the
ManagedClusterobject using the following command:oc get ManagedCluster $CLUSTER -o jsonpath='{.metadata.labels}' | jq$ oc get ManagedCluster $CLUSTER -o jsonpath='{.metadata.labels}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check to see which policies are compliant using the following command:
oc get policy -n $CLUSTER
$ oc get policy -n $CLUSTERCopy to Clipboard Copied! Toggle word wrap Toggle overflow If the
Namespace,OperatorGroup, andSubscriptionpolicies are compliant but the Operator configuration policies are not, it is likely that the Operators did not install on the managed cluster. This causes the Operator configuration policies to fail to apply because the CRD is not yet applied to the spoke.
22.5.7. Restarting policy reconciliation Copy linkLink copied to clipboard!
You can restart policy reconciliation when unexpected compliance issues occur, for example, when the ClusterGroupUpgrade custom resource (CR) has timed out.
Procedure
A
ClusterGroupUpgradeCR is generated in the namespaceztp-installby the Topology Aware Lifecycle Manager after the managed cluster becomesReady:export CLUSTER=<clusterName>
$ export CLUSTER=<clusterName>Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get clustergroupupgrades -n ztp-install $CLUSTER
$ oc get clustergroupupgrades -n ztp-install $CLUSTERCopy to Clipboard Copied! Toggle word wrap Toggle overflow If there are unexpected issues and the policies fail to become complaint within the configured timeout (the default is 4 hours), the status of the
ClusterGroupUpgradeCR showsUpgradeTimedOut:oc get clustergroupupgrades -n ztp-install $CLUSTER -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'$ oc get clustergroupupgrades -n ztp-install $CLUSTER -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow A
ClusterGroupUpgradeCR in theUpgradeTimedOutstate automatically restarts its policy reconciliation every hour. If you have changed your policies, you can start a retry immediately by deleting the existingClusterGroupUpgradeCR. This triggers the automatic creation of a newClusterGroupUpgradeCR that begins reconciling the policies immediately:oc delete clustergroupupgrades -n ztp-install $CLUSTER
$ oc delete clustergroupupgrades -n ztp-install $CLUSTERCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Note that when the ClusterGroupUpgrade CR completes with status UpgradeCompleted and the managed cluster has the label ztp-done applied, you can make additional configuration changes using PolicyGenTemplate. Deleting the existing ClusterGroupUpgrade CR will not make the TALM generate a new CR.
At this point, GitOps ZTP has completed its interaction with the cluster and any further interactions should be treated as an update and a new ClusterGroupUpgrade CR created for remediation of the policies.
22.5.8. Changing applied managed cluster CRs using policies Copy linkLink copied to clipboard!
You can remove content from a custom resource (CR) that is deployed in a managed cluster through a policy.
By default, all Policy CRs created from a PolicyGenTemplate CR have the complianceType field set to musthave. A musthave policy without the removed content is still compliant because the CR on the managed cluster has all the specified content. With this configuration, when you remove content from a CR, TALM removes the content from the policy but the content is not removed from the CR on the managed cluster.
With the complianceType field to mustonlyhave, the policy ensures that the CR on the cluster is an exact match of what is specified in the policy.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. - You have deployed a managed cluster from a hub cluster running RHACM.
- You have installed Topology Aware Lifecycle Manager on the hub cluster.
Procedure
Remove the content that you no longer need from the affected CRs. In this example, the
disableDrain: falseline was removed from theSriovOperatorConfigCR.Example CR
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change the
complianceTypeof the affected policies tomustonlyhavein thegroup-du-sno-ranGen.yamlfile.Example YAML
# ... - fileName: SriovOperatorConfig.yaml policyName: "config-policy" complianceType: mustonlyhave # ...
# ... - fileName: SriovOperatorConfig.yaml policyName: "config-policy" complianceType: mustonlyhave # ...Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
ClusterGroupUpdatesCR and specify the clusters that must receive the CR changes::Example ClusterGroupUpdates CR
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
ClusterGroupUpgradeCR by running the following command:oc create -f cgu-remove.yaml
$ oc create -f cgu-remove.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow When you are ready to apply the changes, for example, during an appropriate maintenance window, change the value of the
spec.enablefield totrueby running the following command:oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-remove \ --patch '{"spec":{"enable":true}}' --type=merge$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-remove \ --patch '{"spec":{"enable":true}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the status of the policies by running the following command:
oc get <kind> <changed_cr_name>
$ oc get <kind> <changed_cr_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default cgu-ztp-group.group-du-sno-config-policy enforce 17m default ztp-group.group-du-sno-config-policy inform NonCompliant 15h
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default cgu-ztp-group.group-du-sno-config-policy enforce 17m default ztp-group.group-du-sno-config-policy inform NonCompliant 15hCopy to Clipboard Copied! Toggle word wrap Toggle overflow When the
COMPLIANCE STATEof the policy isCompliant, it means that the CR is updated and the unwanted content is removed.Check that the policies are removed from the targeted clusters by running the following command on the managed clusters:
oc get <kind> <changed_cr_name>
$ oc get <kind> <changed_cr_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow If there are no results, the CR is removed from the managed cluster.
22.5.9. Indication of done for GitOps ZTP installations Copy linkLink copied to clipboard!
GitOps Zero Touch Provisioning (ZTP) simplifies the process of checking the GitOps ZTP installation status for a cluster. The GitOps ZTP status moves through three phases: cluster installation, cluster configuration, and GitOps ZTP done.
- Cluster installation phase
-
The cluster installation phase is shown by the
ManagedClusterJoinedandManagedClusterAvailableconditions in theManagedClusterCR . If theManagedClusterCR does not have these conditions, or the condition is set toFalse, the cluster is still in the installation phase. Additional details about installation are available from theAgentClusterInstallandClusterDeploymentCRs. For more information, see "Troubleshooting GitOps ZTP". - Cluster configuration phase
-
The cluster configuration phase is shown by a
ztp-runninglabel applied theManagedClusterCR for the cluster. - GitOps ZTP done
Cluster installation and configuration is complete in the GitOps ZTP done phase. This is shown by the removal of the
ztp-runninglabel and addition of theztp-donelabel to theManagedClusterCR. Theztp-donelabel shows that the configuration has been applied and the baseline DU configuration has completed cluster tuning.The transition to the GitOps ZTP done state is conditional on the compliant state of a Red Hat Advanced Cluster Management (RHACM) validator inform policy. This policy captures the existing criteria for a completed installation and validates that it moves to a compliant state only when GitOps ZTP provisioning of the managed cluster is complete.
The validator inform policy ensures the configuration of the cluster is fully applied and Operators have completed their initialization. The policy validates the following:
-
The target
MachineConfigPoolcontains the expected entries and has finished updating. All nodes are available and not degraded. -
The SR-IOV Operator has completed initialization as indicated by at least one
SriovNetworkNodeStatewithsyncStatus: Succeeded. - The PTP Operator daemon set exists.
-
The target
22.6. Manually installing a single-node OpenShift cluster with ZTP Copy linkLink copied to clipboard!
You can deploy a managed single-node OpenShift cluster by using Red Hat Advanced Cluster Management (RHACM) and the assisted service.
If you are creating multiple managed clusters, use the SiteConfig method described in Deploying far edge sites with ZTP.
The target bare-metal host must meet the networking, firmware, and hardware requirements listed in Recommended cluster configuration for vDU application workloads.
22.6.1. Generating GitOps ZTP installation and configuration CRs manually Copy linkLink copied to clipboard!
Use the generator entrypoint for the ztp-site-generate container to generate the site installation and configuration custom resource (CRs) for a cluster based on SiteConfig and PolicyGenTemplate CRs.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges.
Procedure
Create an output folder by running the following command:
mkdir -p ./out
$ mkdir -p ./outCopy to Clipboard Copied! Toggle word wrap Toggle overflow Export the
argocddirectory from theztp-site-generatecontainer image:podman run --log-driver=none --rm registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 extract /home/ztp --tar | tar x -C ./out
$ podman run --log-driver=none --rm registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 extract /home/ztp --tar | tar x -C ./outCopy to Clipboard Copied! Toggle word wrap Toggle overflow The
./outdirectory has the referencePolicyGenTemplateandSiteConfigCRs in theout/argocd/example/folder.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create an output folder for the site installation CRs:
mkdir -p ./site-install
$ mkdir -p ./site-installCopy to Clipboard Copied! Toggle word wrap Toggle overflow Modify the example
SiteConfigCR for the cluster type that you want to install. Copyexample-sno.yamltosite-1-sno.yamland modify the CR to match the details of the site and bare-metal host that you want to install, for example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteOnce you have extracted reference CR configuration files from the
out/extra-manifestdirectory of theztp-site-generatecontainer, you can useextraManifests.searchPathsto include the path to the git directory containing those files. This allows the GitOps ZTP pipeline to apply those CR files during cluster installation. If you configure asearchPathsdirectory, the GitOps ZTP pipeline does not fetch manifests from theztp-site-generatecontainer during site installation.Generate the Day 0 installation CRs by processing the modified
SiteConfigCRsite-1-sno.yamlby running the following command:podman run -it --rm -v `pwd`/out/argocd/example/siteconfig:/resources:Z -v `pwd`/site-install:/output:Z,U registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 generator install site-1-sno.yaml /output
$ podman run -it --rm -v `pwd`/out/argocd/example/siteconfig:/resources:Z -v `pwd`/site-install:/output:Z,U registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 generator install site-1-sno.yaml /outputCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Generate just the Day 0
MachineConfiginstallation CRs for a particular cluster type by processing the referenceSiteConfigCR with the-Eoption. For example, run the following commands:Create an output folder for the
MachineConfigCRs:mkdir -p ./site-machineconfig
$ mkdir -p ./site-machineconfigCopy to Clipboard Copied! Toggle word wrap Toggle overflow Generate the
MachineConfiginstallation CRs:podman run -it --rm -v `pwd`/out/argocd/example/siteconfig:/resources:Z -v `pwd`/site-machineconfig:/output:Z,U registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 generator install -E site-1-sno.yaml /output
$ podman run -it --rm -v `pwd`/out/argocd/example/siteconfig:/resources:Z -v `pwd`/site-machineconfig:/output:Z,U registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 generator install -E site-1-sno.yaml /outputCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
site-machineconfig └── site-1-sno ├── site-1-sno_machineconfig_02-master-workload-partitioning.yaml ├── site-1-sno_machineconfig_predefined-extra-manifests-master.yaml └── site-1-sno_machineconfig_predefined-extra-manifests-worker.yamlsite-machineconfig └── site-1-sno ├── site-1-sno_machineconfig_02-master-workload-partitioning.yaml ├── site-1-sno_machineconfig_predefined-extra-manifests-master.yaml └── site-1-sno_machineconfig_predefined-extra-manifests-worker.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Generate and export the Day 2 configuration CRs using the reference
PolicyGenTemplateCRs from the previous step. Run the following commands:Create an output folder for the Day 2 CRs:
mkdir -p ./ref
$ mkdir -p ./refCopy to Clipboard Copied! Toggle word wrap Toggle overflow Generate and export the Day 2 configuration CRs:
podman run -it --rm -v `pwd`/out/argocd/example/policygentemplates:/resources:Z -v `pwd`/ref:/output:Z,U registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 generator config -N . /output
$ podman run -it --rm -v `pwd`/out/argocd/example/policygentemplates:/resources:Z -v `pwd`/ref:/output:Z,U registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14 generator config -N . /outputCopy to Clipboard Copied! Toggle word wrap Toggle overflow The command generates example group and site-specific
PolicyGenTemplateCRs for single-node OpenShift, three-node clusters, and standard clusters in the./reffolder.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Use the generated CRs as the basis for the CRs that you use to install the cluster. You apply the installation CRs to the hub cluster as described in "Installing a single managed cluster". The configuration CRs can be applied to the cluster after cluster installation is complete.
Verification
Verify that the custom roles and labels are applied after the node is deployed:
oc describe node example-node.example.com
$ oc describe node example-node.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Example output
- 1
- The custom label is applied to the node.
22.6.2. Creating the managed bare-metal host secrets Copy linkLink copied to clipboard!
Add the required Secret custom resources (CRs) for the managed bare-metal host to the hub cluster. You need a secret for the GitOps Zero Touch Provisioning (ZTP) pipeline to access the Baseboard Management Controller (BMC) and a secret for the assisted installer service to pull cluster installation images from the registry.
The secrets are referenced from the SiteConfig CR by name. The namespace must match the SiteConfig namespace.
Procedure
Create a YAML secret file containing credentials for the host Baseboard Management Controller (BMC) and a pull secret required for installing OpenShift and all add-on cluster Operators:
Save the following YAML as the file
example-sno-secret.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Add the relative path to
example-sno-secret.yamlto thekustomization.yamlfile that you use to install the cluster.
22.6.3. Configuring Discovery ISO kernel arguments for manual installations using GitOps ZTP Copy linkLink copied to clipboard!
The GitOps Zero Touch Provisioning (ZTP) workflow uses the Discovery ISO as part of the OpenShift Container Platform installation process on managed bare-metal hosts. You can edit the InfraEnv resource to specify kernel arguments for the Discovery ISO. This is useful for cluster installations with specific environmental requirements. For example, configure the rd.net.timeout.carrier kernel argument for the Discovery ISO to facilitate static networking for the cluster or to receive a DHCP address before downloading the root file system during installation.
In OpenShift Container Platform 4.14, you can only add kernel arguments. You can not replace or delete kernel arguments.
Prerequisites
- You have installed the OpenShift CLI (oc).
- You have logged in to the hub cluster as a user with cluster-admin privileges.
- You have manually generated the installation and configuration custom resources (CRs).
Procedure
-
Edit the
spec.kernelArgumentsspecification in theInfraEnvCR to configure kernel arguments:
The SiteConfig CR generates the InfraEnv resource as part of the day-0 installation CRs.
Verification
To verify that the kernel arguments are applied, after the Discovery image verifies that OpenShift Container Platform is ready for installation, you can SSH to the target host before the installation process begins. At that point, you can view the kernel arguments for the Discovery ISO in the /proc/cmdline file.
Begin an SSH session with the target host:
ssh -i /path/to/privatekey core@<host_name>
$ ssh -i /path/to/privatekey core@<host_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow View the system’s kernel arguments by using the following command:
cat /proc/cmdline
$ cat /proc/cmdlineCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.6.4. Installing a single managed cluster Copy linkLink copied to clipboard!
You can manually deploy a single managed cluster using the assisted service and Red Hat Advanced Cluster Management (RHACM).
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. -
You have created the baseboard management controller (BMC)
Secretand the image pull-secretSecretcustom resources (CRs). See "Creating the managed bare-metal host secrets" for details. - Your target bare-metal host meets the networking and hardware requirements for managed clusters.
Procedure
Create a
ClusterImageSetfor each specific cluster version to be deployed, for exampleclusterImageSet-4.14.yaml. AClusterImageSethas the following format:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the
clusterImageSetCR:oc apply -f clusterImageSet-4.14.yaml
$ oc apply -f clusterImageSet-4.14.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
NamespaceCR in thecluster-namespace.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the
NamespaceCR by running the following command:oc apply -f cluster-namespace.yaml
$ oc apply -f cluster-namespace.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the generated day-0 CRs that you extracted from the
ztp-site-generatecontainer and customized to meet your requirements:oc apply -R ./site-install/site-sno-1
$ oc apply -R ./site-install/site-sno-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.6.5. Monitoring the managed cluster installation status Copy linkLink copied to clipboard!
Ensure that cluster provisioning was successful by checking the cluster status.
Prerequisites
-
All of the custom resources have been configured and provisioned, and the
Agentcustom resource is created on the hub for the managed cluster.
Procedure
Check the status of the managed cluster:
oc get managedcluster
$ oc get managedclusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Trueindicates the managed cluster is ready.Check the agent status:
oc get agent -n <cluster_name>
$ oc get agent -n <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the
describecommand to provide an in-depth description of the agent’s condition. Statuses to be aware of includeBackendError,InputError,ValidationsFailing,InstallationFailed, andAgentIsConnected. These statuses are relevant to theAgentandAgentClusterInstallcustom resources.oc describe agent -n <cluster_name>
$ oc describe agent -n <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the cluster provisioning status:
oc get agentclusterinstall -n <cluster_name>
$ oc get agentclusterinstall -n <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the
describecommand to provide an in-depth description of the cluster provisioning status:oc describe agentclusterinstall -n <cluster_name>
$ oc describe agentclusterinstall -n <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of the managed cluster’s add-on services:
oc get managedclusteraddon -n <cluster_name>
$ oc get managedclusteraddon -n <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Retrieve the authentication information of the
kubeconfigfile for the managed cluster:oc get secret -n <cluster_name> <cluster_name>-admin-kubeconfig -o jsonpath={.data.kubeconfig} | base64 -d > <directory>/<cluster_name>-kubeconfig$ oc get secret -n <cluster_name> <cluster_name>-admin-kubeconfig -o jsonpath={.data.kubeconfig} | base64 -d > <directory>/<cluster_name>-kubeconfigCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.6.6. Troubleshooting the managed cluster Copy linkLink copied to clipboard!
Use this procedure to diagnose any installation issues that might occur with the managed cluster.
Procedure
Check the status of the managed cluster:
oc get managedcluster
$ oc get managedclusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE SNO-cluster true True True 2d19h
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE SNO-cluster true True True 2d19hCopy to Clipboard Copied! Toggle word wrap Toggle overflow If the status in the
AVAILABLEcolumn isTrue, the managed cluster is being managed by the hub.If the status in the
AVAILABLEcolumn isUnknown, the managed cluster is not being managed by the hub. Use the following steps to continue checking to get more information.Check the
AgentClusterInstallinstall status:oc get clusterdeployment -n <cluster_name>
$ oc get clusterdeployment -n <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME PLATFORM REGION CLUSTERTYPE INSTALLED INFRAID VERSION POWERSTATE AGE Sno0026 agent-baremetal false Initialized 2d14h
NAME PLATFORM REGION CLUSTERTYPE INSTALLED INFRAID VERSION POWERSTATE AGE Sno0026 agent-baremetal false Initialized 2d14hCopy to Clipboard Copied! Toggle word wrap Toggle overflow If the status in the
INSTALLEDcolumn isfalse, the installation was unsuccessful.If the installation failed, enter the following command to review the status of the
AgentClusterInstallresource:oc describe agentclusterinstall -n <cluster_name> <cluster_name>
$ oc describe agentclusterinstall -n <cluster_name> <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Resolve the errors and reset the cluster:
Remove the cluster’s managed cluster resource:
oc delete managedcluster <cluster_name>
$ oc delete managedcluster <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the cluster’s namespace:
oc delete namespace <cluster_name>
$ oc delete namespace <cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow This deletes all of the namespace-scoped custom resources created for this cluster. You must wait for the
ManagedClusterCR deletion to complete before proceeding.- Recreate the custom resources for the managed cluster.
22.6.7. RHACM generated cluster installation CRs reference Copy linkLink copied to clipboard!
Red Hat Advanced Cluster Management (RHACM) supports deploying OpenShift Container Platform on single-node clusters, three-node clusters, and standard clusters with a specific set of installation custom resources (CRs) that you generate using SiteConfig CRs for each site.
Every managed cluster has its own namespace, and all of the installation CRs except for ManagedCluster and ClusterImageSet are under that namespace. ManagedCluster and ClusterImageSet are cluster-scoped, not namespace-scoped. The namespace and the CR names match the cluster name.
The following table lists the installation CRs that are automatically applied by the RHACM assisted service when it installs clusters using the SiteConfig CRs that you configure.
| CR | Description | Usage |
|---|---|---|
|
| Contains the connection information for the Baseboard Management Controller (BMC) of the target bare-metal host. | Provides access to the BMC to load and start the discovery image on the target server by using the Redfish protocol. |
|
| Contains information for installing OpenShift Container Platform on the target bare-metal host. |
Used with |
|
|
Specifies details of the managed cluster configuration such as networking and the number of control plane nodes. Displays the cluster | Specifies the managed cluster configuration information and provides status during the installation of the cluster. |
|
|
References the |
Used with |
|
|
Provides network configuration information such as | Sets up a static IP address for the managed cluster’s Kube API server. |
|
| Contains hardware information about the target bare-metal host. | Created automatically on the hub when the target machine’s discovery image boots. |
|
| When a cluster is managed by the hub, it must be imported and known. This Kubernetes object provides that interface. | The hub uses this resource to manage and show the status of managed clusters. |
|
|
Contains the list of services provided by the hub to be deployed to the |
Tells the hub which addon services to deploy to the |
|
|
Logical space for |
Propagates resources to the |
|
|
Two CRs are created: |
|
|
| Contains OpenShift Container Platform image information such as the repository and image name. | Passed into resources to provide OpenShift Container Platform images. |
22.7. Recommended single-node OpenShift cluster configuration for vDU application workloads Copy linkLink copied to clipboard!
Use the following reference information to understand the single-node OpenShift configurations required to deploy virtual distributed unit (vDU) applications in the cluster. Configurations include cluster optimizations for high performance workloads, enabling workload partitioning, and minimizing the number of reboots required postinstallation.
22.7.1. Running low latency applications on OpenShift Container Platform Copy linkLink copied to clipboard!
OpenShift Container Platform enables low latency processing for applications running on commercial off-the-shelf (COTS) hardware by using several technologies and specialized hardware devices:
- Real-time kernel for RHCOS
- Ensures workloads are handled with a high degree of process determinism.
- CPU isolation
- Avoids CPU scheduling delays and ensures CPU capacity is available consistently.
- NUMA-aware topology management
- Aligns memory and huge pages with CPU and PCI devices to pin guaranteed container memory and huge pages to the non-uniform memory access (NUMA) node. Pod resources for all Quality of Service (QoS) classes stay on the same NUMA node. This decreases latency and improves performance of the node.
- Huge pages memory management
- Using huge page sizes improves system performance by reducing the amount of system resources required to access page tables.
- Precision timing synchronization using PTP
- Allows synchronization between nodes in the network with sub-microsecond accuracy.
22.7.2. Recommended cluster host requirements for vDU application workloads Copy linkLink copied to clipboard!
Running vDU application workloads requires a bare-metal host with sufficient resources to run OpenShift Container Platform services and production workloads.
| Profile | vCPU | Memory | Storage |
|---|---|---|---|
| Minimum | 4 to 8 vCPU | 32GB of RAM | 120GB |
One vCPU equals one physical core. However, if you enable simultaneous multithreading (SMT), or Hyper-Threading, use the following formula to calculate the number of vCPUs that represent one physical core:
- (threads per core × cores) × sockets = vCPUs
The server must have a Baseboard Management Controller (BMC) when booting with virtual media.
22.7.3. Configuring host firmware for low latency and high performance Copy linkLink copied to clipboard!
Bare-metal hosts require the firmware to be configured before the host can be provisioned. The firmware configuration is dependent on the specific hardware and the particular requirements of your installation.
Procedure
-
Set the UEFI/BIOS Boot Mode to
UEFI. - In the host boot sequence order, set Hard drive first.
Apply the specific firmware configuration for your hardware. The following table describes a representative firmware configuration for an Intel Xeon Skylake or Intel Cascade Lake server, based on the Intel FlexRAN 4G and 5G baseband PHY reference design.
ImportantThe exact firmware configuration depends on your specific hardware and network requirements. The following sample configuration is for illustrative purposes only.
Expand Table 22.10. Sample firmware configuration for an Intel Xeon Skylake or Cascade Lake server Firmware setting Configuration CPU Power and Performance Policy
Performance
Uncore Frequency Scaling
Disabled
Performance P-limit
Disabled
Enhanced Intel SpeedStep ® Tech
Enabled
Intel Configurable TDP
Enabled
Configurable TDP Level
Level 2
Intel® Turbo Boost Technology
Enabled
Energy Efficient Turbo
Disabled
Hardware P-States
Disabled
Package C-State
C0/C1 state
C1E
Disabled
Processor C6
Disabled
Enable global SR-IOV and VT-d settings in the firmware for the host. These settings are relevant to bare-metal environments.
22.7.4. Connectivity prerequisites for managed cluster networks Copy linkLink copied to clipboard!
Before you can install and provision a managed cluster with the GitOps Zero Touch Provisioning (ZTP) pipeline, the managed cluster host must meet the following networking prerequisites:
- There must be bi-directional connectivity between the GitOps ZTP container in the hub cluster and the Baseboard Management Controller (BMC) of the target bare-metal host.
The managed cluster must be able to resolve and reach the API hostname of the hub hostname and
*.appshostname. Here is an example of the API hostname of the hub and*.appshostname:-
api.hub-cluster.internal.domain.com -
console-openshift-console.apps.hub-cluster.internal.domain.com
-
The hub cluster must be able to resolve and reach the API and
*.appshostname of the managed cluster. Here is an example of the API hostname of the managed cluster and*.appshostname:-
api.sno-managed-cluster-1.internal.domain.com -
console-openshift-console.apps.sno-managed-cluster-1.internal.domain.com
-
22.7.5. Workload partitioning in single-node OpenShift with GitOps ZTP Copy linkLink copied to clipboard!
Workload partitioning configures OpenShift Container Platform services, cluster management workloads, and infrastructure pods to run on a reserved number of host CPUs.
To configure workload partitioning with GitOps Zero Touch Provisioning (ZTP), you configure a cpuPartitioningMode field in the SiteConfig custom resource (CR) that you use to install the cluster and you apply a PerformanceProfile CR that configures the isolated and reserved CPUs on the host.
Configuring the SiteConfig CR enables workload partitioning at cluster installation time and applying the PerformanceProfile CR configures the specific allocation of CPUs to reserved and isolated sets. Both of these steps happen at different points during cluster provisioning.
Configuring workload partitioning by using the cpuPartitioningMode field in the SiteConfig CR is a Tech Preview feature in OpenShift Container Platform 4.13.
Alternatively, you can specify cluster management CPU resources with the cpuset field of the SiteConfig custom resource (CR) and the reserved field of the group PolicyGenTemplate CR. The GitOps ZTP pipeline uses these values to populate the required fields in the workload partitioning MachineConfig CR (cpuset) and the PerformanceProfile CR (reserved) that configure the single-node OpenShift cluster. This method is a General Availability feature in OpenShift Container Platform 4.14.
The workload partitioning configuration pins the OpenShift Container Platform infrastructure pods to the reserved CPU set. Platform services such as systemd, CRI-O, and kubelet run on the reserved CPU set. The isolated CPU sets are exclusively allocated to your container workloads. Isolating CPUs ensures that the workload has guaranteed access to the specified CPUs without contention from other applications running on the same node. All CPUs that are not isolated should be reserved.
Ensure that reserved and isolated CPU sets do not overlap with each other.
22.7.6. Recommended cluster install manifests Copy linkLink copied to clipboard!
The ZTP pipeline applies the following custom resources (CRs) during cluster installation. These configuration CRs ensure that the cluster meets the feature and performance requirements necessary for running a vDU application.
When using the GitOps ZTP plugin and SiteConfig CRs for cluster deployment, the following MachineConfig CRs are included by default.
Use the SiteConfig extraManifests filter to alter the CRs that are included by default. For more information, see Advanced managed cluster configuration with SiteConfig CRs.
22.7.6.1. Workload partitioning Copy linkLink copied to clipboard!
Single-node OpenShift clusters that run DU workloads require workload partitioning. This limits the cores allowed to run platform services, maximizing the CPU core for application payloads.
Workload partitioning can be enabled during cluster installation only. You cannot disable workload partitioning postinstallation. You can however change the set of CPUs assigned to the isolated and reserved sets through the PerformanceProfile CR. Changes to CPU settings cause the node to reboot.
When transitioning to using cpuPartitioningMode for enabling workload partitioning, remove the workload partitioning MachineConfig CRs from the /extra-manifest folder that you use to provision the cluster.
Recommended SiteConfig CR configuration for workload partitioning
- 1
- Set the
cpuPartitioningModefield toAllNodesto configure workload partitioning for all nodes in the cluster.
Verification
Check that the applications and cluster system CPU pinning is correct. Run the following commands:
Open a remote shell prompt to the managed cluster:
oc debug node/example-sno-1
$ oc debug node/example-sno-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the OpenShift infrastructure applications CPU pinning is correct:
pgrep ovn | while read i; do taskset -cp $i; done
sh-4.4# pgrep ovn | while read i; do taskset -cp $i; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the system applications CPU pinning is correct:
pgrep systemd | while read i; do taskset -cp $i; done
sh-4.4# pgrep systemd | while read i; do taskset -cp $i; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
pid 1's current affinity list: 0-1,52-53 pid 938's current affinity list: 0-1,52-53 pid 962's current affinity list: 0-1,52-53 pid 1197's current affinity list: 0-1,52-53
pid 1's current affinity list: 0-1,52-53 pid 938's current affinity list: 0-1,52-53 pid 962's current affinity list: 0-1,52-53 pid 1197's current affinity list: 0-1,52-53Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.7.6.2. Reduced platform management footprint Copy linkLink copied to clipboard!
To reduce the overall management footprint of the platform, a MachineConfig custom resource (CR) is required that places all Kubernetes-specific mount points in a new namespace separate from the host operating system. The following base64-encoded example MachineConfig CR illustrates this configuration.
Recommended container mount namespace configuration (01-container-mount-ns-and-kubelet-conf-master.yaml)
22.7.6.3. SCTP Copy linkLink copied to clipboard!
Stream Control Transmission Protocol (SCTP) is a key protocol used in RAN applications. This MachineConfig object adds the SCTP kernel module to the node to enable this protocol.
Recommended control plane node SCTP configuration (03-sctp-machine-config-master.yaml)
Recommended worker node SCTP configuration (03-sctp-machine-config-worker.yaml)
22.7.6.4. Setting rcu_normal Copy linkLink copied to clipboard!
The following MachineConfig CR configures the system to set rcu_normal to 1 after the system has finished startup. This improves kernel latency for vDU applications.
Recommended configuration for disabling rcu_expedited after the node has finished startup (08-set-rcu-normal-master.yaml)
22.7.6.5. Automatic kernel crash dumps with kdump Copy linkLink copied to clipboard!
kdump is a Linux kernel feature that creates a kernel crash dump when the kernel crashes. kdump is enabled with the following MachineConfig CRs.
Recommended MachineConfig CR to remove ice driver from control plane kdump logs (05-kdump-config-master.yaml)
Recommended control plane node kdump configuration (06-kdump-master.yaml)
Recommended MachineConfig CR to remove ice driver from worker node kdump logs (05-kdump-config-worker.yaml)
Recommended kdump worker node configuration (06-kdump-worker.yaml)
22.7.6.6. Disable automatic CRI-O cache wipe Copy linkLink copied to clipboard!
After an uncontrolled host shutdown or cluster reboot, CRI-O automatically deletes the entire CRI-O cache, causing all images to be pulled from the registry when the node reboots. This can result in unacceptably slow recovery times or recovery failures. To prevent this from happening in single-node OpenShift clusters that you install with GitOps ZTP, disable the CRI-O delete cache feature during cluster installation.
Recommended MachineConfig CR to disable CRI-O cache wipe on control plane nodes (99-crio-disable-wipe-master.yaml)
Recommended MachineConfig CR to disable CRI-O cache wipe on worker nodes (99-crio-disable-wipe-worker.yaml)
22.7.6.7. Configuring crun as the default container runtime Copy linkLink copied to clipboard!
The following ContainerRuntimeConfig custom resources (CRs) configure crun as the default OCI container runtime for control plane and worker nodes. The crun container runtime is fast and lightweight and has a low memory footprint.
For optimal performance, enable crun for control plane and worker nodes in single-node OpenShift, three-node OpenShift, and standard clusters. To avoid the cluster rebooting when the CR is applied, apply the change as a GitOps ZTP additional Day 0 install-time manifest.
Recommended ContainerRuntimeConfig CR for control plane nodes (enable-crun-master.yaml)
Recommended ContainerRuntimeConfig CR for worker nodes (enable-crun-worker.yaml)
22.7.7. Recommended postinstallation cluster configurations Copy linkLink copied to clipboard!
When the cluster installation is complete, the ZTP pipeline applies the following custom resources (CRs) that are required to run DU workloads.
In GitOps ZTP v4.10 and earlier, you configure UEFI secure boot with a MachineConfig CR. This is no longer required in GitOps ZTP v4.11 and later. In v4.11, you configure UEFI secure boot for single-node OpenShift clusters by updating the spec.clusters.nodes.bootMode field in the SiteConfig CR that you use to install the cluster. For more information, see Deploying a managed cluster with SiteConfig and GitOps ZTP.
22.7.7.1. Operators Copy linkLink copied to clipboard!
Single-node OpenShift clusters that run DU workloads require the following Operators to be installed:
- Local Storage Operator
- Logging Operator
- PTP Operator
- SR-IOV Network Operator
You also need to configure a custom CatalogSource CR, disable the default OperatorHub configuration, and configure an ImageContentSourcePolicy mirror registry that is accessible from the clusters that you install.
Recommended Storage Operator namespace and Operator group configuration (StorageNS.yaml, StorageOperGroup.yaml)
Recommended Cluster Logging Operator namespace and Operator group configuration (ClusterLogNS.yaml, ClusterLogOperGroup.yaml)
Recommended PTP Operator namespace and Operator group configuration (PtpSubscriptionNS.yaml, PtpSubscriptionOperGroup.yaml)
Recommended SR-IOV Operator namespace and Operator group configuration (SriovSubscriptionNS.yaml, SriovSubscriptionOperGroup.yaml)
Recommended CatalogSource configuration (DefaultCatsrc.yaml)
Recommended ImageContentSourcePolicy configuration (DisconnectedICSP.yaml)
Recommended OperatorHub configuration (OperatorHub.yaml)
22.7.7.2. Operator subscriptions Copy linkLink copied to clipboard!
Single-node OpenShift clusters that run DU workloads require the following Subscription CRs. The subscription provides the location to download the following Operators:
- Local Storage Operator
- Logging Operator
- PTP Operator
- SR-IOV Network Operator
- SRIOV-FEC Operator
For each Operator subscription, specify the channel to get the Operator from. The recommended channel is stable.
You can specify Manual or Automatic updates. In Automatic mode, the Operator automatically updates to the latest versions in the channel as they become available in the registry. In Manual mode, new Operator versions are installed only when they are explicitly approved.
Use Manual mode for subscriptions. This allows you to control the timing of Operator updates to fit within scheduled maintenance windows.
Recommended Local Storage Operator subscription (StorageSubscription.yaml)
Recommended SR-IOV Operator subscription (SriovSubscription.yaml)
Recommended PTP Operator subscription (PtpSubscription.yaml)
Recommended Cluster Logging Operator subscription (ClusterLogSubscription.yaml)
22.7.7.3. Cluster logging and log forwarding Copy linkLink copied to clipboard!
Single-node OpenShift clusters that run DU workloads require logging and log forwarding for debugging. The following ClusterLogging and ClusterLogForwarder custom resources (CRs) are required.
Recommended cluster logging configuration (ClusterLogging.yaml)
Recommended log forwarding configuration (ClusterLogForwarder.yaml)
Set the spec.outputs.url field to the URL of the Kafka server where the logs are forwarded to.
22.7.7.4. Performance profile Copy linkLink copied to clipboard!
Single-node OpenShift clusters that run DU workloads require a Node Tuning Operator performance profile to use real-time host capabilities and services.
In earlier versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.
The following example PerformanceProfile CR illustrates the required single-node OpenShift cluster configuration.
Recommended performance profile configuration (PerformanceProfile.yaml)
| PerformanceProfile CR field | Description |
|---|---|
|
|
Ensure that
|
|
|
|
|
| Set the isolated CPUs. Ensure all of the Hyper-Threading pairs match. Important The reserved and isolated CPU pools must not overlap and together must span all available cores. CPU cores that are not accounted for cause an undefined behaviour in the system. |
|
| Set the reserved CPUs. When workload partitioning is enabled, system processes, kernel threads, and system container threads are restricted to these CPUs. All CPUs that are not isolated should be reserved. |
|
|
|
|
|
Set |
|
|
Use |
22.7.7.5. Configuring cluster time synchronization Copy linkLink copied to clipboard!
Run a one-time system time synchronization job for control plane or worker nodes.
Recommended one time time-sync for control plane nodes (99-sync-time-once-master.yaml)
Recommended one time time-sync for worker nodes (99-sync-time-once-worker.yaml)
22.7.7.6. PTP Copy linkLink copied to clipboard!
Single-node OpenShift clusters use Precision Time Protocol (PTP) for network time synchronization. The following example PtpConfig CRs illustrate the required PTP configurations for ordinary clocks, boundary clocks, and grandmaster clocks. The exact configuration you apply will depend on the node hardware and specific use case.
Recommended PTP ordinary clock configuration (PtpConfigSlave.yaml)
Recommended boundary clock configuration (PtpConfigBoundary.yaml)
Recommended PTP Westport Channel e810 grandmaster clock configuration (PtpConfigGmWpc.yaml)
The following optional PtpOperatorConfig CR configures PTP events reporting for the node.
Recommended PTP events configuration (PtpOperatorConfigForEvent.yaml)
22.7.7.7. Extended Tuned profile Copy linkLink copied to clipboard!
Single-node OpenShift clusters that run DU workloads require additional performance tuning configurations necessary for high-performance workloads. The following example Tuned CR extends the Tuned profile:
Recommended extended Tuned profile configuration (TunedPerformancePatch.yaml)
| Tuned CR field | Description |
|---|---|
|
|
|
22.7.7.8. SR-IOV Copy linkLink copied to clipboard!
Single root I/O virtualization (SR-IOV) is commonly used to enable fronthaul and midhaul networks. The following YAML example configures SR-IOV for a single-node OpenShift cluster.
The configuration of the SriovNetwork CR will vary depending on your specific network and infrastructure requirements.
Recommended SriovOperatorConfig CR configuration (SriovOperatorConfig.yaml)
| SriovOperatorConfig CR field | Description |
|---|---|
|
|
Disable For example: |
|
|
Disable |
Recommended SriovNetwork configuration (SriovNetwork.yaml)
| SriovNetwork CR field | Description |
|---|---|
|
|
Configure |
Recommended SriovNetworkNodePolicy CR configuration (SriovNetworkNodePolicy.yaml)
| SriovNetworkNodePolicy CR field | Description |
|---|---|
|
|
Configure |
|
| Specifies the interface connected to the fronthaul network. |
|
| Specifies the number of VFs for the fronthaul network. |
|
| The exact name of physical function must match the hardware. |
Recommended SR-IOV kernel configurations (07-sriov-related-kernel-args-master.yaml)
22.7.7.9. Console Operator Copy linkLink copied to clipboard!
Use the cluster capabilities feature to prevent the Console Operator from being installed. When the node is centrally managed it is not needed. Removing the Operator provides additional space and capacity for application workloads.
To disable the Console Operator during the installation of the managed cluster, set the following in the spec.clusters.0.installConfigOverrides field of the SiteConfig custom resource (CR):
installConfigOverrides: "{\"capabilities\":{\"baselineCapabilitySet\": \"None\" }}"
installConfigOverrides: "{\"capabilities\":{\"baselineCapabilitySet\": \"None\" }}"
22.7.7.10. Alertmanager Copy linkLink copied to clipboard!
Single-node OpenShift clusters that run DU workloads require reduced CPU resources consumed by the OpenShift Container Platform monitoring components. The following ConfigMap custom resource (CR) disables Alertmanager.
Recommended cluster monitoring configuration (ReduceMonitoringFootprint.yaml)
22.7.7.11. Operator Lifecycle Manager Copy linkLink copied to clipboard!
Single-node OpenShift clusters that run distributed unit workloads require consistent access to CPU resources. Operator Lifecycle Manager (OLM) collects performance data from Operators at regular intervals, resulting in an increase in CPU utilisation. The following ConfigMap custom resource (CR) disables the collection of Operator performance data by OLM.
Recommended cluster OLM configuration (ReduceOLMFootprint.yaml)
22.7.7.12. LVM Storage Copy linkLink copied to clipboard!
You can dynamically provision local storage on single-node OpenShift clusters with Logical Volume Manager (LVM) Storage.
The recommended storage solution for single-node OpenShift is the Local Storage Operator. Alternatively, you can use LVM Storage but it requires additional CPU resources to be allocated.
The following YAML example configures the storage of the node to be available to OpenShift Container Platform applications.
Recommended LVMCluster configuration (StorageLVMCluster.yaml)
| LVMCluster CR field | Description |
|---|---|
|
| Configure the disks used for LVM storage. If no disks are specified, the LVM Storage uses all the unused disks in the specified thin pool. |
22.7.7.13. Network diagnostics Copy linkLink copied to clipboard!
Single-node OpenShift clusters that run DU workloads require less inter-pod network connectivity checks to reduce the additional load created by these pods. The following custom resource (CR) disables these checks.
Recommended network diagnostics configuration (DisableSnoNetworkDiag.yaml)
22.8. Validating single-node OpenShift cluster tuning for vDU application workloads Copy linkLink copied to clipboard!
Before you can deploy virtual distributed unit (vDU) applications, you need to tune and configure the cluster host firmware and various other cluster configuration settings. Use the following information to validate the cluster configuration to support vDU workloads.
22.8.1. Recommended firmware configuration for vDU cluster hosts Copy linkLink copied to clipboard!
Use the following table as the basis to configure the cluster host firmware for vDU applications running on OpenShift Container Platform 4.14.
The following table is a general recommendation for vDU cluster host firmware configuration. Exact firmware settings will depend on your requirements and specific hardware platform. Automatic setting of firmware is not handled by the zero touch provisioning pipeline.
| Firmware setting | Configuration | Description |
|---|---|---|
| HyperTransport (HT) | Enabled | HyperTransport (HT) bus is a bus technology developed by AMD. HT provides a high-speed link between the components in the host memory and other system peripherals. |
| UEFI | Enabled | Enable booting from UEFI for the vDU host. |
| CPU Power and Performance Policy | Performance | Set CPU Power and Performance Policy to optimize the system for performance over energy efficiency. |
| Uncore Frequency Scaling | Disabled | Disable Uncore Frequency Scaling to prevent the voltage and frequency of non-core parts of the CPU from being set independently. |
| Uncore Frequency | Maximum | Sets the non-core parts of the CPU such as cache and memory controller to their maximum possible frequency of operation. |
| Performance P-limit | Disabled | Disable Performance P-limit to prevent the Uncore frequency coordination of processors. |
| Enhanced Intel® SpeedStep Tech | Enabled | Enable Enhanced Intel SpeedStep to allow the system to dynamically adjust processor voltage and core frequency that decreases power consumption and heat production in the host. |
| Intel® Turbo Boost Technology | Enabled | Enable Turbo Boost Technology for Intel-based CPUs to automatically allow processor cores to run faster than the rated operating frequency if they are operating below power, current, and temperature specification limits. |
| Intel Configurable TDP | Enabled | Enables Thermal Design Power (TDP) for the CPU. |
| Configurable TDP Level | Level 2 | TDP level sets the CPU power consumption required for a particular performance rating. TDP level 2 sets the CPU to the most stable performance level at the cost of power consumption. |
| Energy Efficient Turbo | Disabled | Disable Energy Efficient Turbo to prevent the processor from using an energy-efficiency based policy. |
| Hardware P-States | Enabled or Disabled |
Enable OS-controlled P-States to allow power saving configurations. Disable |
| Package C-State | C0/C1 state | Use C0 or C1 states to set the processor to a fully active state (C0) or to stop CPU internal clocks running in software (C1). |
| C1E | Disabled | CPU Enhanced Halt (C1E) is a power saving feature in Intel chips. Disabling C1E prevents the operating system from sending a halt command to the CPU when inactive. |
| Processor C6 | Disabled | C6 power-saving is a CPU feature that automatically disables idle CPU cores and cache. Disabling C6 improves system performance. |
| Sub-NUMA Clustering | Disabled | Sub-NUMA clustering divides the processor cores, cache, and memory into multiple NUMA domains. Disabling this option can increase performance for latency-sensitive workloads. |
Enable global SR-IOV and VT-d settings in the firmware for the host. These settings are relevant to bare-metal environments.
Enable both C-states and OS-controlled P-States to allow per pod power management.
22.8.2. Recommended cluster configurations to run vDU applications Copy linkLink copied to clipboard!
Clusters running virtualized distributed unit (vDU) applications require a highly tuned and optimized configuration. The following information describes the various elements that you require to support vDU workloads in OpenShift Container Platform 4.14 clusters.
22.8.2.1. Recommended cluster MachineConfig CRs for single-node OpenShift clusters Copy linkLink copied to clipboard!
Check that the MachineConfig custom resources (CRs) that you extract from the ztp-site-generate container are applied in the cluster. The CRs can be found in the extracted out/source-crs/extra-manifest/ folder.
The following MachineConfig CRs from the ztp-site-generate container configure the cluster host:
| MachineConfig CR | Description |
|---|---|
|
| Configures the container mount namespace and kubelet configuration. |
|
|
Loads the SCTP kernel module. These |
|
| Configures kdump crash reporting for the cluster. |
|
| Configures SR-IOV kernel arguments in the cluster. |
|
|
Disables |
|
| Disables the automatic CRI-O cache wipe following cluster reboot. |
|
| Configures the one-time check and adjustment of the system clock by the Chrony service. |
|
|
Enables the |
|
| Enables cgroups v1 during cluster installation and when generating RHACM cluster policies. |
In OpenShift Container Platform 4.14 and later, you configure workload partitioning with the cpuPartitioningMode field in the SiteConfig CR.
22.8.2.2. Recommended cluster Operators Copy linkLink copied to clipboard!
The following Operators are required for clusters running virtualized distributed unit (vDU) applications and are a part of the baseline reference configuration:
- Node Tuning Operator (NTO). NTO packages functionality that was previously delivered with the Performance Addon Operator, which is now a part of NTO.
- PTP Operator
- SR-IOV Network Operator
- Red Hat OpenShift Logging Operator
- Local Storage Operator
22.8.2.3. Recommended cluster kernel configuration Copy linkLink copied to clipboard!
Always use the latest supported real-time kernel version in your cluster. Ensure that you apply the following configurations in the cluster:
Ensure that the following
additionalKernelArgsare set in the cluster performance profile:spec: additionalKernelArgs: - "rcupdate.rcu_normal_after_boot=0" - "efi=runtime" - "module_blacklist=irdma"
spec: additionalKernelArgs: - "rcupdate.rcu_normal_after_boot=0" - "efi=runtime" - "module_blacklist=irdma"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the
performance-patchprofile in theTunedCR configures the correct CPU isolation set that matches theisolatedCPU set in the relatedPerformanceProfileCR, for example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.8.2.4. Checking the realtime kernel version Copy linkLink copied to clipboard!
Always use the latest version of the realtime kernel in your OpenShift Container Platform clusters. If you are unsure about the kernel version that is in use in the cluster, you can compare the current realtime kernel version to the release version with the following procedure.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You are logged in as a user with
cluster-adminprivileges. -
You have installed
podman.
Procedure
Run the following command to get the cluster version:
OCP_VERSION=$(oc get clusterversion version -o jsonpath='{.status.desired.version}{"\n"}')$ OCP_VERSION=$(oc get clusterversion version -o jsonpath='{.status.desired.version}{"\n"}')Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the release image SHA number:
DTK_IMAGE=$(oc adm release info --image-for=driver-toolkit quay.io/openshift-release-dev/ocp-release:$OCP_VERSION-x86_64)
$ DTK_IMAGE=$(oc adm release info --image-for=driver-toolkit quay.io/openshift-release-dev/ocp-release:$OCP_VERSION-x86_64)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the release image container and extract the kernel version that is packaged with cluster’s current release:
podman run --rm $DTK_IMAGE rpm -qa | grep 'kernel-rt-core-' | sed 's#kernel-rt-core-##'
$ podman run --rm $DTK_IMAGE rpm -qa | grep 'kernel-rt-core-' | sed 's#kernel-rt-core-##'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
4.18.0-305.49.1.rt7.121.el8_4.x86_64
4.18.0-305.49.1.rt7.121.el8_4.x86_64Copy to Clipboard Copied! Toggle word wrap Toggle overflow This is the default realtime kernel version that ships with the release.
NoteThe realtime kernel is denoted by the string
.rtin the kernel version.
Verification
Check that the kernel version listed for the cluster’s current release matches actual realtime kernel that is running in the cluster. Run the following commands to check the running realtime kernel version:
Open a remote shell connection to the cluster node:
oc debug node/<node_name>
$ oc debug node/<node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the realtime kernel version:
uname -r
sh-4.4# uname -rCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
4.18.0-305.49.1.rt7.121.el8_4.x86_64
4.18.0-305.49.1.rt7.121.el8_4.x86_64Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.8.3. Checking that the recommended cluster configurations are applied Copy linkLink copied to clipboard!
You can check that clusters are running the correct configuration. The following procedure describes how to check the various configurations that you require to deploy a DU application in OpenShift Container Platform 4.14 clusters.
Prerequisites
- You have deployed a cluster and tuned it for vDU workloads.
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges.
Procedure
Check that the default OperatorHub sources are disabled. Run the following command:
oc get operatorhub cluster -o yaml
$ oc get operatorhub cluster -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
spec: disableAllDefaultSources: truespec: disableAllDefaultSources: trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that all required
CatalogSourceresources are annotated for workload partitioning (PreferredDuringScheduling) by running the following command:oc get catalogsource -A -o jsonpath='{range .items[*]}{.metadata.name}{" -- "}{.metadata.annotations.target\.workload\.openshift\.io/management}{"\n"}{end}'$ oc get catalogsource -A -o jsonpath='{range .items[*]}{.metadata.name}{" -- "}{.metadata.annotations.target\.workload\.openshift\.io/management}{"\n"}{end}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
certified-operators -- {"effect": "PreferredDuringScheduling"} community-operators -- {"effect": "PreferredDuringScheduling"} ran-operators redhat-marketplace -- {"effect": "PreferredDuringScheduling"} redhat-operators -- {"effect": "PreferredDuringScheduling"}certified-operators -- {"effect": "PreferredDuringScheduling"} community-operators -- {"effect": "PreferredDuringScheduling"} ran-operators1 redhat-marketplace -- {"effect": "PreferredDuringScheduling"} redhat-operators -- {"effect": "PreferredDuringScheduling"}Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
CatalogSourceresources that are not annotated are also returned. In this example, theran-operatorsCatalogSourceresource is not annotated and does not have thePreferredDuringSchedulingannotation.
NoteIn a properly configured vDU cluster, only a single annotated catalog source is listed.
Check that all applicable OpenShift Container Platform Operator namespaces are annotated for workload partitioning. This includes all Operators installed with core OpenShift Container Platform and the set of additional Operators included in the reference DU tuning configuration. Run the following command:
oc get namespaces -A -o jsonpath='{range .items[*]}{.metadata.name}{" -- "}{.metadata.annotations.workload\.openshift\.io/allowed}{"\n"}{end}'$ oc get namespaces -A -o jsonpath='{range .items[*]}{.metadata.name}{" -- "}{.metadata.annotations.workload\.openshift\.io/allowed}{"\n"}{end}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
default -- openshift-apiserver -- management openshift-apiserver-operator -- management openshift-authentication -- management openshift-authentication-operator -- management
default -- openshift-apiserver -- management openshift-apiserver-operator -- management openshift-authentication -- management openshift-authentication-operator -- managementCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantAdditional Operators must not be annotated for workload partitioning. In the output from the previous command, additional Operators should be listed without any value on the right side of the
--separator.Check that the
ClusterLoggingconfiguration is correct. Run the following commands:Validate that the appropriate input and output logs are configured:
oc get -n openshift-logging ClusterLogForwarder instance -o yaml
$ oc get -n openshift-logging ClusterLogForwarder instance -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the curation schedule is appropriate for your application:
oc get -n openshift-logging clusterloggings.logging.openshift.io instance -o yaml
$ oc get -n openshift-logging clusterloggings.logging.openshift.io instance -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Check that the web console is disabled (
managementState: Removed) by running the following command:oc get consoles.operator.openshift.io cluster -o jsonpath="{ .spec.managementState }"$ oc get consoles.operator.openshift.io cluster -o jsonpath="{ .spec.managementState }"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Removed
RemovedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that
chronydis disabled on the cluster node by running the following commands:oc debug node/<node_name>
$ oc debug node/<node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of
chronydon the node:chroot /host
sh-4.4# chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow systemctl status chronyd
sh-4.4# systemctl status chronydCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
● chronyd.service - NTP client/server Loaded: loaded (/usr/lib/systemd/system/chronyd.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: man:chronyd(8) man:chrony.conf(5)● chronyd.service - NTP client/server Loaded: loaded (/usr/lib/systemd/system/chronyd.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: man:chronyd(8) man:chrony.conf(5)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the PTP interface is successfully synchronized to the primary clock using a remote shell connection to the
linuxptp-daemoncontainer and the PTP Management Client (pmc) tool:Set the
$PTP_POD_NAMEvariable with the name of thelinuxptp-daemonpod by running the following command:PTP_POD_NAME=$(oc get pods -n openshift-ptp -l app=linuxptp-daemon -o name)
$ PTP_POD_NAME=$(oc get pods -n openshift-ptp -l app=linuxptp-daemon -o name)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to check the sync status of the PTP device:
oc -n openshift-ptp rsh -c linuxptp-daemon-container ${PTP_POD_NAME} pmc -u -f /var/run/ptp4l.0.config -b 0 'GET PORT_DATA_SET'$ oc -n openshift-ptp rsh -c linuxptp-daemon-container ${PTP_POD_NAME} pmc -u -f /var/run/ptp4l.0.config -b 0 'GET PORT_DATA_SET'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following
pmccommand to check the PTP clock status:oc -n openshift-ptp rsh -c linuxptp-daemon-container ${PTP_POD_NAME} pmc -u -f /var/run/ptp4l.0.config -b 0 'GET TIME_STATUS_NP'$ oc -n openshift-ptp rsh -c linuxptp-daemon-container ${PTP_POD_NAME} pmc -u -f /var/run/ptp4l.0.config -b 0 'GET TIME_STATUS_NP'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the expected
master offsetvalue corresponding to the value in/var/run/ptp4l.0.configis found in thelinuxptp-daemon-containerlog:oc logs $PTP_POD_NAME -n openshift-ptp -c linuxptp-daemon-container
$ oc logs $PTP_POD_NAME -n openshift-ptp -c linuxptp-daemon-containerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
phc2sys[56020.341]: [ptp4l.1.config] CLOCK_REALTIME phc offset -1731092 s2 freq -1546242 delay 497 ptp4l[56020.390]: [ptp4l.1.config] master offset -2 s2 freq -5863 path delay 541 ptp4l[56020.390]: [ptp4l.0.config] master offset -8 s2 freq -10699 path delay 533
phc2sys[56020.341]: [ptp4l.1.config] CLOCK_REALTIME phc offset -1731092 s2 freq -1546242 delay 497 ptp4l[56020.390]: [ptp4l.1.config] master offset -2 s2 freq -5863 path delay 541 ptp4l[56020.390]: [ptp4l.0.config] master offset -8 s2 freq -10699 path delay 533Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Check that the SR-IOV configuration is correct by running the following commands:
Check that the
disableDrainvalue in theSriovOperatorConfigresource is set totrue:oc get sriovoperatorconfig -n openshift-sriov-network-operator default -o jsonpath="{.spec.disableDrain}{'\n'}"$ oc get sriovoperatorconfig -n openshift-sriov-network-operator default -o jsonpath="{.spec.disableDrain}{'\n'}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
true
trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the
SriovNetworkNodeStatesync status isSucceededby running the following command:oc get SriovNetworkNodeStates -n openshift-sriov-network-operator -o jsonpath="{.items[*].status.syncStatus}{'\n'}"$ oc get SriovNetworkNodeStates -n openshift-sriov-network-operator -o jsonpath="{.items[*].status.syncStatus}{'\n'}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Succeeded
SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the expected number and configuration of virtual functions (
Vfs) under each interface configured for SR-IOV is present and correct in the.status.interfacesfield. For example:oc get SriovNetworkNodeStates -n openshift-sriov-network-operator -o yaml
$ oc get SriovNetworkNodeStates -n openshift-sriov-network-operator -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Check that the cluster performance profile is correct. The
cpuandhugepagessections will vary depending on your hardware configuration. Run the following command:oc get PerformanceProfile openshift-node-performance-profile -o yaml
$ oc get PerformanceProfile openshift-node-performance-profile -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteCPU settings are dependent on the number of cores available on the server and should align with workload partitioning settings.
hugepagesconfiguration is server and application dependent.Check that the
PerformanceProfilewas successfully applied to the cluster by running the following command:oc get performanceprofile openshift-node-performance-profile -o jsonpath="{range .status.conditions[*]}{ @.type }{' -- '}{@.status}{'\n'}{end}"$ oc get performanceprofile openshift-node-performance-profile -o jsonpath="{range .status.conditions[*]}{ @.type }{' -- '}{@.status}{'\n'}{end}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Available -- True Upgradeable -- True Progressing -- False Degraded -- False
Available -- True Upgradeable -- True Progressing -- False Degraded -- FalseCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the
Tunedperformance patch settings by running the following command:oc get tuneds.tuned.openshift.io -n openshift-cluster-node-tuning-operator performance-patch -o yaml
$ oc get tuneds.tuned.openshift.io -n openshift-cluster-node-tuning-operator performance-patch -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The cpu list in
cmdline=nohz_full=will vary based on your hardware configuration.
Check that cluster networking diagnostics are disabled by running the following command:
oc get networks.operator.openshift.io cluster -o jsonpath='{.spec.disableNetworkDiagnostics}'$ oc get networks.operator.openshift.io cluster -o jsonpath='{.spec.disableNetworkDiagnostics}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
true
trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the
Kubelethousekeeping interval is tuned to slower rate. This is set in thecontainerMountNSmachine config. Run the following command:oc describe machineconfig container-mount-namespace-and-kubelet-conf-master | grep OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION
$ oc describe machineconfig container-mount-namespace-and-kubelet-conf-master | grep OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATIONCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Environment="OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION=60s"
Environment="OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION=60s"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that Grafana and
alertManagerMainare disabled and that the Prometheus retention period is set to 24h by running the following command:oc get configmap cluster-monitoring-config -n openshift-monitoring -o jsonpath="{ .data.config\.yaml }"$ oc get configmap cluster-monitoring-config -n openshift-monitoring -o jsonpath="{ .data.config\.yaml }"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the following commands to verify that Grafana and
alertManagerMainroutes are not found in the cluster:oc get route -n openshift-monitoring alertmanager-main
$ oc get route -n openshift-monitoring alertmanager-mainCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc get route -n openshift-monitoring grafana
$ oc get route -n openshift-monitoring grafanaCopy to Clipboard Copied! Toggle word wrap Toggle overflow Both queries should return
Error from server (NotFound)messages.
Check that there is a minimum of 4 CPUs allocated as
reservedfor each of thePerformanceProfile,Tunedperformance-patch, workload partitioning, and kernel command-line arguments by running the following command:oc get performanceprofile -o jsonpath="{ .items[0].spec.cpu.reserved }"$ oc get performanceprofile -o jsonpath="{ .items[0].spec.cpu.reserved }"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
0-3
0-3Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteDepending on your workload requirements, you might require additional reserved CPUs to be allocated.
22.9. Advanced managed cluster configuration with SiteConfig resources Copy linkLink copied to clipboard!
You can use SiteConfig custom resources (CRs) to deploy custom functionality and configurations in your managed clusters at installation time.
22.9.1. Customizing extra installation manifests in the GitOps ZTP pipeline Copy linkLink copied to clipboard!
You can define a set of extra manifests for inclusion in the installation phase of the GitOps Zero Touch Provisioning (ZTP) pipeline. These manifests are linked to the SiteConfig custom resources (CRs) and are applied to the cluster during installation. Including MachineConfig CRs at install time makes the installation process more efficient.
Prerequisites
- Create a Git repository where you manage your custom site configuration data. The repository must be accessible from the hub cluster and be defined as a source repository for the Argo CD application.
Procedure
- Create a set of extra manifest CRs that the GitOps ZTP pipeline uses to customize the cluster installs.
In your custom
/siteconfigdirectory, create a subdirectory/custom-manifestfor your extra manifests. The following example illustrates a sample/siteconfigwith/custom-manifestfolder:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe subdirectory names
/custom-manifestand/extra-manifestused throughout are example names only. There is no requirement to use these names and no restriction on how you name these subdirectories. In this example/extra-manifestrefers to the Git subdirectory that stores the contents of/extra-manifestfrom theztp-site-generatecontainer.-
Add your custom extra manifest CRs to the
siteconfig/custom-manifestdirectory. In your
SiteConfigCR, enter the directory name in theextraManifests.searchPathsfield, for example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Save the
SiteConfig,/extra-manifest, and/custom-manifestCRs, and push them to the site configuration repo.
During cluster provisioning, the GitOps ZTP pipeline appends the CRs in the /custom-manifest directory to the default set of extra manifests stored in extra-manifest/.
As of version 4.14 extraManifestPath is subject to a deprecation warning.
While extraManifestPath is still supported, we recommend that you use extraManifests.searchPaths. If you define extraManifests.searchPaths in the SiteConfig file, the GitOps ZTP pipeline does not fetch manifests from the ztp-site-generate container during site installation.
If you define both extraManifestPath and extraManifests.searchPaths in the Siteconfig CR, the setting defined for extraManifests.searchPaths takes precedence.
It is strongly recommended that you extract the contents of /extra-manifest from the ztp-site-generate container and push it to the GIT repository.
22.9.2. Filtering custom resources using SiteConfig filters Copy linkLink copied to clipboard!
By using filters, you can easily customize SiteConfig custom resources (CRs) to include or exclude other CRs for use in the installation phase of the GitOps Zero Touch Provisioning (ZTP) pipeline.
You can specify an inclusionDefault value of include or exclude for the SiteConfig CR, along with a list of the specific extraManifest RAN CRs that you want to include or exclude. Setting inclusionDefault to include makes the GitOps ZTP pipeline apply all the files in /source-crs/extra-manifest during installation. Setting inclusionDefault to exclude does the opposite.
You can exclude individual CRs from the /source-crs/extra-manifest folder that are otherwise included by default. The following example configures a custom single-node OpenShift SiteConfig CR to exclude the /source-crs/extra-manifest/03-sctp-machine-config-worker.yaml CR at installation time.
Some additional optional filtering scenarios are also described.
Prerequisites
- You configured the hub cluster for generating the required installation and policy CRs.
- You created a Git repository where you manage your custom site configuration data. The repository must be accessible from the hub cluster and be defined as a source repository for the Argo CD application.
Procedure
To prevent the GitOps ZTP pipeline from applying the
03-sctp-machine-config-worker.yamlCR file, apply the following YAML in theSiteConfigCR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow The GitOps ZTP pipeline skips the
03-sctp-machine-config-worker.yamlCR during installation. All other CRs in/source-crs/extra-manifestare applied.Save the
SiteConfigCR and push the changes to the site configuration repository.The GitOps ZTP pipeline monitors and adjusts what CRs it applies based on the
SiteConfigfilter instructions.Optional: To prevent the GitOps ZTP pipeline from applying all the
/source-crs/extra-manifestCRs during cluster installation, apply the following YAML in theSiteConfigCR:- clusterName: "site1-sno-du" extraManifests: filter: inclusionDefault: exclude- clusterName: "site1-sno-du" extraManifests: filter: inclusionDefault: excludeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: To exclude all the
/source-crs/extra-manifestRAN CRs and instead include a custom CR file during installation, edit the customSiteConfigCR to set the custom manifests folder and theincludefile, for example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following example illustrates the custom folder structure:
siteconfig ├── site1-sno-du.yaml └── user-custom-manifest └── custom-sctp-machine-config-worker.yamlsiteconfig ├── site1-sno-du.yaml └── user-custom-manifest └── custom-sctp-machine-config-worker.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.9.3. Deleting a node by using the SiteConfig CR Copy linkLink copied to clipboard!
By using a SiteConfig custom resource (CR), you can delete and reprovision a node. This method is more efficient than manually deleting the node.
Prerequisites
- You have configured the hub cluster to generate the required installation and policy CRs.
- You have created a Git repository in which you can manage your custom site configuration data. The repository must be accessible from the hub cluster and be defined as the source repository for the Argo CD application.
Procedure
Update the
SiteConfigCR to include thebmac.agent-install.openshift.io/remove-agent-and-node-on-delete=trueannotation and push the changes to the Git repository:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
BareMetalHostobject is annotated by running the following command:oc get bmh -n <managed-cluster-namespace> <bmh-object> -ojsonpath='{.metadata}' | jq -r '.annotations["bmac.agent-install.openshift.io/remove-agent-and-node-on-delete"]'oc get bmh -n <managed-cluster-namespace> <bmh-object> -ojsonpath='{.metadata}' | jq -r '.annotations["bmac.agent-install.openshift.io/remove-agent-and-node-on-delete"]'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
true
trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Suppress the generation of the
BareMetalHostCR by updating theSiteConfigCR to include thecrSuppression.BareMetalHostannotation:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Push the changes to the Git repository and wait for deprovisioning to start. The status of the
BareMetalHostCR should change todeprovisioning. Wait for theBareMetalHostto finish deprovisioning, and be fully deleted.
Verification
Verify that the
BareMetalHostandAgentCRs for the worker node have been deleted from the hub cluster by running the following commands:oc get bmh -n <cluster-ns>
$ oc get bmh -n <cluster-ns>Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get agent -n <cluster-ns>
$ oc get agent -n <cluster-ns>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the node record has been deleted from the spoke cluster by running the following command:
oc get nodes
$ oc get nodesCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf you are working with secrets, deleting a secret too early can cause an issue because ArgoCD needs the secret to complete resynchronization after deletion. Delete the secret only after the node cleanup, when the current ArgoCD synchronization is complete.
Next steps
To reprovision a node, delete the changes previously added to the SiteConfig, push the changes to the Git repository, and wait for the synchronization to complete. This regenerates the BareMetalHost CR of the worker node and triggers the re-install of the node.
22.10. Advanced managed cluster configuration with PolicyGenTemplate resources Copy linkLink copied to clipboard!
You can use PolicyGenTemplate CRs to deploy custom functionality in your managed clusters.
22.10.1. Deploying additional changes to clusters Copy linkLink copied to clipboard!
If you require cluster configuration changes outside of the base GitOps Zero Touch Provisioning (ZTP) pipeline configuration, there are three options:
- Apply the additional configuration after the GitOps ZTP pipeline is complete
- When the GitOps ZTP pipeline deployment is complete, the deployed cluster is ready for application workloads. At this point, you can install additional Operators and apply configurations specific to your requirements. Ensure that additional configurations do not negatively affect the performance of the platform or allocated CPU budget.
- Add content to the GitOps ZTP library
- The base source custom resources (CRs) that you deploy with the GitOps ZTP pipeline can be augmented with custom content as required.
- Create extra manifests for the cluster installation
- Extra manifests are applied during installation and make the installation process more efficient.
Providing additional source CRs or modifying existing source CRs can significantly impact the performance or CPU profile of OpenShift Container Platform.
22.10.2. Using PolicyGenTemplate CRs to override source CRs content Copy linkLink copied to clipboard!
PolicyGenTemplate custom resources (CRs) allow you to overlay additional configuration details on top of the base source CRs provided with the GitOps plugin in the ztp-site-generate container. You can think of PolicyGenTemplate CRs as a logical merge or patch to the base CR. Use PolicyGenTemplate CRs to update a single field of the base CR, or overlay the entire contents of the base CR. You can update values and insert fields that are not in the base CR.
The following example procedure describes how to update fields in the generated PerformanceProfile CR for the reference configuration based on the PolicyGenTemplate CR in the group-du-sno-ranGen.yaml file. Use the procedure as a basis for modifying other parts of the PolicyGenTemplate based on your requirements.
Prerequisites
- Create a Git repository where you manage your custom site configuration data. The repository must be accessible from the hub cluster and be defined as a source repository for Argo CD.
Procedure
Review the baseline source CR for existing content. You can review the source CRs listed in the reference
PolicyGenTemplateCRs by extracting them from the GitOps Zero Touch Provisioning (ZTP) container.Create an
/outfolder:mkdir -p ./out
$ mkdir -p ./outCopy to Clipboard Copied! Toggle word wrap Toggle overflow Extract the source CRs:
podman run --log-driver=none --rm registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14.1 extract /home/ztp --tar | tar x -C ./out
$ podman run --log-driver=none --rm registry.redhat.io/openshift4/ztp-site-generate-rhel8:v4.14.1 extract /home/ztp --tar | tar x -C ./outCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Review the baseline
PerformanceProfileCR in./out/source-crs/PerformanceProfile.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteAny fields in the source CR which contain
$…are removed from the generated CR if they are not provided in thePolicyGenTemplateCR.Update the
PolicyGenTemplateentry forPerformanceProfilein thegroup-du-sno-ranGen.yamlreference file. The following examplePolicyGenTemplateCR stanza supplies appropriate CPU specifications, sets thehugepagesconfiguration, and adds a new field that setsgloballyDisableIrqLoadBalancingto false.Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Commit the
PolicyGenTemplatechange in Git, and then push to the Git repository being monitored by the GitOps ZTP argo CD application.
Example output
The GitOps ZTP application generates an RHACM policy that contains the generated PerformanceProfile CR. The contents of that CR are derived by merging the metadata and spec contents from the PerformanceProfile entry in the PolicyGenTemplate onto the source CR. The resulting CR has the following content:
In the /source-crs folder that you extract from the ztp-site-generate container, the $ syntax is not used for template substitution as implied by the syntax. Rather, if the policyGen tool sees the $ prefix for a string and you do not specify a value for that field in the related PolicyGenTemplate CR, the field is omitted from the output CR entirely.
An exception to this is the $mcp variable in /source-crs YAML files that is substituted with the specified value for mcp from the PolicyGenTemplate CR. For example, in example/policygentemplates/group-du-standard-ranGen.yaml, the value for mcp is worker:
spec:
bindingRules:
group-du-standard: ""
mcp: "worker"
spec:
bindingRules:
group-du-standard: ""
mcp: "worker"
The policyGen tool replace instances of $mcp with worker in the output CRs.
22.10.3. Adding custom content to the GitOps ZTP pipeline Copy linkLink copied to clipboard!
Perform the following procedure to add new content to the GitOps ZTP pipeline.
Procedure
-
Create a subdirectory named
source-crsin the directory that contains thekustomization.yamlfile for thePolicyGenTemplatecustom resource (CR). Add your user-provided CRs to the
source-crssubdirectory, as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
source-crssubdirectory must be in the same directory as thekustomization.yamlfile.
Update the required
PolicyGenTemplateCRs to include references to the content you added in thesource-crs/custom-crsandsource-crs/elasticsearchdirectories. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Commit the
PolicyGenTemplatechange in Git, and then push to the Git repository that is monitored by the GitOps ZTP Argo CD policies application. Update the
ClusterGroupUpgradeCR to include the changedPolicyGenTemplateand save it ascgu-test.yaml. The following example shows a generatedcgu-test.yamlfile.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the updated
ClusterGroupUpgradeCR by running the following command:oc apply -f cgu-test.yaml
$ oc apply -f cgu-test.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check that the updates have succeeded by running the following command:
oc get cgu -A
$ oc get cgu -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME AGE STATE DETAILS ztp-clusters custom-source-cr 6s InProgress Remediating non-compliant policies ztp-install cluster1 19h Completed All clusters are compliant with all the managed policies
NAMESPACE NAME AGE STATE DETAILS ztp-clusters custom-source-cr 6s InProgress Remediating non-compliant policies ztp-install cluster1 19h Completed All clusters are compliant with all the managed policiesCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.10.4. Configuring policy compliance evaluation timeouts for PolicyGenTemplate CRs Copy linkLink copied to clipboard!
Use Red Hat Advanced Cluster Management (RHACM) installed on a hub cluster to monitor and report on whether your managed clusters are compliant with applied policies. RHACM uses policy templates to apply predefined policy controllers and policies. Policy controllers are Kubernetes custom resource definition (CRD) instances.
You can override the default policy evaluation intervals with PolicyGenTemplate custom resources (CRs). You configure duration settings that define how long a ConfigurationPolicy CR can be in a state of policy compliance or non-compliance before RHACM re-evaluates the applied cluster policies.
The GitOps Zero Touch Provisioning (ZTP) policy generator generates ConfigurationPolicy CR policies with pre-defined policy evaluation intervals. The default value for the noncompliant state is 10 seconds. The default value for the compliant state is 10 minutes. To disable the evaluation interval, set the value to never.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. - You have created a Git repository where you manage your custom site configuration data.
Procedure
To configure the evaluation interval for all policies in a
PolicyGenTemplateCR, addevaluationIntervalto thespecfield, and then set the appropriatecompliantandnoncompliantvalues. For example:spec: evaluationInterval: compliant: 30m noncompliant: 20sspec: evaluationInterval: compliant: 30m noncompliant: 20sCopy to Clipboard Copied! Toggle word wrap Toggle overflow To configure the evaluation interval for the
spec.sourceFilesobject in aPolicyGenTemplateCR, addevaluationIntervalto thesourceFilesfield, for example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Commit the
PolicyGenTemplateCRs files in the Git repository and push your changes.
Verification
Check that the managed spoke cluster policies are monitored at the expected intervals.
-
Log in as a user with
cluster-adminprivileges on the managed cluster. Get the pods that are running in the
open-cluster-management-agent-addonnamespace. Run the following command:oc get pods -n open-cluster-management-agent-addon
$ oc get pods -n open-cluster-management-agent-addonCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE config-policy-controller-858b894c68-v4xdb 1/1 Running 22 (5d8h ago) 10d
NAME READY STATUS RESTARTS AGE config-policy-controller-858b894c68-v4xdb 1/1 Running 22 (5d8h ago) 10dCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the applied policies are being evaluated at the expected interval in the logs for the
config-policy-controllerpod:oc logs -n open-cluster-management-agent-addon config-policy-controller-858b894c68-v4xdb
$ oc logs -n open-cluster-management-agent-addon config-policy-controller-858b894c68-v4xdbCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
2022-05-10T15:10:25.280Z info configuration-policy-controller controllers/configurationpolicy_controller.go:166 Skipping the policy evaluation due to the policy not reaching the evaluation interval {"policy": "compute-1-config-policy-config"} 2022-05-10T15:10:25.280Z info configuration-policy-controller controllers/configurationpolicy_controller.go:166 Skipping the policy evaluation due to the policy not reaching the evaluation interval {"policy": "compute-1-common-compute-1-catalog-policy-config"}2022-05-10T15:10:25.280Z info configuration-policy-controller controllers/configurationpolicy_controller.go:166 Skipping the policy evaluation due to the policy not reaching the evaluation interval {"policy": "compute-1-config-policy-config"} 2022-05-10T15:10:25.280Z info configuration-policy-controller controllers/configurationpolicy_controller.go:166 Skipping the policy evaluation due to the policy not reaching the evaluation interval {"policy": "compute-1-common-compute-1-catalog-policy-config"}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.10.5. Signalling GitOps ZTP cluster deployment completion with validator inform policies Copy linkLink copied to clipboard!
Create a validator inform policy that signals when the GitOps Zero Touch Provisioning (ZTP) installation and configuration of the deployed cluster is complete. This policy can be used for deployments of single-node OpenShift clusters, three-node clusters, and standard clusters.
Procedure
Create a standalone
PolicyGenTemplatecustom resource (CR) that contains the source filevalidatorCRs/informDuValidator.yaml. You only need one standalonePolicyGenTemplateCR for each cluster type. For example, this CR applies a validator inform policy for single-node OpenShift clusters:Example single-node cluster validator inform policy CR (group-du-sno-validator-ranGen.yaml)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name of
PolicyGenTemplatesobject. This name is also used as part of the names for theplacementBinding,placementRule, andpolicythat are created in the requestednamespace. - 2
- This value should match the
namespaceused in the groupPolicyGenTemplates. - 3
- The
group-du-*label defined inbindingRulesmust exist in theSiteConfigfiles. - 4
- The label defined in
bindingExcludedRulesmust be`ztp-done:`. Theztp-donelabel is used in coordination with the Topology Aware Lifecycle Manager. - 5
mcpdefines theMachineConfigPoolobject that is used in the source filevalidatorCRs/informDuValidator.yaml. It should bemasterfor single node and three-node cluster deployments andworkerfor standard cluster deployments.- 6
- Optional. The default value is
inform. - 7
- This value is used as part of the name for the generated RHACM policy. The generated validator policy for the single node example is
group-du-sno-validator-du-policy.
-
Commit the
PolicyGenTemplateCR file in your Git repository and push the changes.
22.10.6. Configuring power states using PolicyGenTemplates CRs Copy linkLink copied to clipboard!
For low latency and high-performance edge deployments, it is necessary to disable or limit C-states and P-states. With this configuration, the CPU runs at a constant frequency, which is typically the maximum turbo frequency. This ensures that the CPU is always running at its maximum speed, which results in high performance and low latency. This leads to the best latency for workloads. However, this also leads to the highest power consumption, which might not be necessary for all workloads.
Workloads can be classified as critical or non-critical, with critical workloads requiring disabled C-state and P-state settings for high performance and low latency, while non-critical workloads use C-state and P-state settings for power savings at the expense of some latency and performance. You can configure the following three power states using GitOps Zero Touch Provisioning (ZTP):
- High-performance mode provides ultra low latency at the highest power consumption.
- Performance mode provides low latency at a relatively high power consumption.
- Power saving balances reduced power consumption with increased latency.
The default configuration is for a low latency, performance mode.
PolicyGenTemplate custom resources (CRs) allow you to overlay additional configuration details onto the base source CRs provided with the GitOps plugin in the ztp-site-generate container.
Configure the power states by updating the workloadHints fields in the generated PerformanceProfile CR for the reference configuration, based on the PolicyGenTemplate CR in the group-du-sno-ranGen.yaml.
The following common prerequisites apply to configuring all three power states.
Prerequisites
- You have created a Git repository where you manage your custom site configuration data. The repository must be accessible from the hub cluster and be defined as a source repository for Argo CD.
- You have followed the procedure described in "Preparing the GitOps ZTP site configuration repository".
22.10.6.1. Configuring performance mode using PolicyGenTemplate CRs Copy linkLink copied to clipboard!
Follow this example to set performance mode by updating the workloadHints fields in the generated PerformanceProfile CR for the reference configuration, based on the PolicyGenTemplate CR in the group-du-sno-ranGen.yaml.
Performance mode provides low latency at a relatively high power consumption.
Prerequisites
- You have configured the BIOS with performance related settings by following the guidance in "Configuring host firmware for low latency and high performance".
Procedure
Update the
PolicyGenTemplateentry forPerformanceProfilein thegroup-du-sno-ranGen.yamlreference file inout/argocd/example/policygentemplatesas follows to set performance mode.Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Commit the
PolicyGenTemplatechange in Git, and then push to the Git repository being monitored by the GitOps ZTP Argo CD application.
22.10.6.2. Configuring high-performance mode using PolicyGenTemplate CRs Copy linkLink copied to clipboard!
Follow this example to set high performance mode by updating the workloadHints fields in the generated PerformanceProfile CR for the reference configuration, based on the PolicyGenTemplate CR in the group-du-sno-ranGen.yaml.
High performance mode provides ultra low latency at the highest power consumption.
Prerequisites
- You have configured the BIOS with performance related settings by following the guidance in "Configuring host firmware for low latency and high performance".
Procedure
Update the
PolicyGenTemplateentry forPerformanceProfilein thegroup-du-sno-ranGen.yamlreference file inout/argocd/example/policygentemplatesas follows to set high-performance mode.Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Commit the
PolicyGenTemplatechange in Git, and then push to the Git repository being monitored by the GitOps ZTP Argo CD application.
22.10.6.3. Configuring power saving mode using PolicyGenTemplate CRs Copy linkLink copied to clipboard!
Follow this example to set power saving mode by updating the workloadHints fields in the generated PerformanceProfile CR for the reference configuration, based on the PolicyGenTemplate CR in the group-du-sno-ranGen.yaml.
The power saving mode balances reduced power consumption with increased latency.
Prerequisites
- You enabled C-states and OS-controlled P-states in the BIOS.
Procedure
Update the
PolicyGenTemplateentry forPerformanceProfilein thegroup-du-sno-ranGen.yamlreference file inout/argocd/example/policygentemplatesas follows to configure power saving mode. It is recommended to configure the CPU governor for the power saving mode through the additional kernel arguments object.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
schedutilgovernor is recommended, however, other governors that can be used includeondemandandpowersave.
-
Commit the
PolicyGenTemplatechange in Git, and then push to the Git repository being monitored by the GitOps ZTP Argo CD application.
Verification
Select a worker node in your deployed cluster from the list of nodes identified by using the following command:
oc get nodes
$ oc get nodesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Log in to the node by using the following command:
oc debug node/<node-name>
$ oc debug node/<node-name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
<node-name>with the name of the node you want to verify the power state on.Set
/hostas the root directory within the debug shell. The debug pod mounts the host’s root file system in/hostwithin the pod. By changing the root directory to/host, you can run binaries contained in the host’s executable paths as shown in the following example:chroot /host
# chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to verify the applied power state:
cat /proc/cmdline
# cat /proc/cmdlineCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Expected output
-
For power saving mode the
intel_pstate=passive.
22.10.6.4. Maximizing power savings Copy linkLink copied to clipboard!
Limiting the maximum CPU frequency is recommended to achieve maximum power savings. Enabling C-states on the non-critical workload CPUs without restricting the maximum CPU frequency negates much of the power savings by boosting the frequency of the critical CPUs.
Maximize power savings by updating the sysfs plugin fields, setting an appropriate value for max_perf_pct in the TunedPerformancePatch CR for the reference configuration. This example based on the group-du-sno-ranGen.yaml describes the procedure to follow to restrict the maximum CPU frequency.
Prerequisites
- You have configured power savings mode as described in "Using PolicyGenTemplate CRs to configure power savings mode".
Procedure
Update the
PolicyGenTemplateentry forTunedPerformancePatchin thegroup-du-sno-ranGen.yamlreference file inout/argocd/example/policygentemplates. To maximize power savings, addmax_perf_pctas shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
max_perf_pctcontrols the maximum frequency thecpufreqdriver is allowed to set as a percentage of the maximum supported CPU frequency. This value applies to all CPUs. You can check the maximum supported frequency in/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq. As a starting point, you can use a percentage that caps all CPUs at theAll Cores Turbofrequency. TheAll Cores Turbofrequency is the frequency that all cores will run at when the cores are all fully occupied.
NoteTo maximize power savings, set a lower value. Setting a lower value for
max_perf_pctlimits the maximum CPU frequency, thereby reducing power consumption, but also potentially impacting performance. Experiment with different values and monitor the system’s performance and power consumption to find the optimal setting for your use-case.-
Commit the
PolicyGenTemplatechange in Git, and then push to the Git repository being monitored by the GitOps ZTP Argo CD application.
22.10.7. Configuring LVM Storage using PolicyGenTemplate CRs Copy linkLink copied to clipboard!
You can configure Logical Volume Manager (LVM) Storage for managed clusters that you deploy with GitOps Zero Touch Provisioning (ZTP).
You use LVM Storage to persist event subscriptions when you use PTP events or bare-metal hardware events with HTTP transport.
Use the Local Storage Operator for persistent storage that uses local volumes in distributed units.
Prerequisites
-
Install the OpenShift CLI (
oc). -
Log in as a user with
cluster-adminprivileges. - Create a Git repository where you manage your custom site configuration data.
Procedure
To configure LVM Storage for new managed clusters, add the following YAML to
spec.sourceFilesin thecommon-ranGen.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe Storage LVMO subscription is deprecated. In future releases of OpenShift Container Platform, the storage LVMO subscription will not be available. Instead, you must use the Storage LVMS subscription.
In OpenShift Container Platform 4.14, you can use the Storage LVMS subscription instead of the LVMO subscription. The LVMS subscription does not require manual overrides in the
common-ranGen.yamlfile. Add the following YAML tospec.sourceFilesin thecommon-ranGen.yamlfile to use the Storage LVMS subscription:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the
LVMClusterCR tospec.sourceFilesin your specific group or individual site configuration file. For example, in thegroup-du-sno-ranGen.yamlfile, add the following:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- This example configuration creates a volume group (
vg1) with all the available devices, except the disk where OpenShift Container Platform is installed. A thin-pool logical volume is also created.
- Merge any other required changes and files with your custom site repository.
-
Commit the
PolicyGenTemplatechanges in Git, and then push the changes to your site configuration repository to deploy LVM Storage to new sites using GitOps ZTP.
22.10.8. Configuring PTP events with PolicyGenTemplate CRs Copy linkLink copied to clipboard!
You can use the GitOps ZTP pipeline to configure PTP events that use HTTP or AMQP transport.
HTTP transport is the default transport for PTP and bare-metal events. Use HTTP transport instead of AMQP for PTP and bare-metal events where possible. AMQ Interconnect is EOL from 30 June 2024. Extended life cycle support (ELS) for AMQ Interconnect ends 29 November 2029. For more information see, Red Hat AMQ Interconnect support status.
22.10.8.1. Configuring PTP events that use HTTP transport Copy linkLink copied to clipboard!
You can configure PTP events that use HTTP transport on managed clusters that you deploy with the GitOps Zero Touch Provisioning (ZTP) pipeline.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have created a Git repository where you manage your custom site configuration data.
Procedure
Apply the following
PolicyGenTemplatechanges togroup-du-3node-ranGen.yaml,group-du-sno-ranGen.yaml, orgroup-du-standard-ranGen.yamlfiles according to your requirements:In
.sourceFiles, add thePtpOperatorConfigCR file that configures the transport host:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIn OpenShift Container Platform 4.13 or later, you do not need to set the
transportHostfield in thePtpOperatorConfigresource when you use HTTP transport with PTP events.Configure the
linuxptpandphc2sysfor the PTP clock type and interface. For example, add the following stanza into.sourceFiles:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Can be
PtpConfigMaster.yamlorPtpConfigSlave.yamldepending on your requirements. For configurations based ongroup-du-sno-ranGen.yamlorgroup-du-3node-ranGen.yaml, usePtpConfigSlave.yaml. - 2
- Device specific interface name.
- 3
- You must append the
--summary_interval -4value toptp4lOptsin.spec.sourceFiles.spec.profileto enable PTP fast events. - 4
- Required
phc2sysOptsvalues.-mprints messages tostdout. Thelinuxptp-daemonDaemonSetparses the logs and generates Prometheus metrics. - 5
- Optional. If the
ptpClockThresholdstanza is not present, default values are used for theptpClockThresholdfields. The stanza shows defaultptpClockThresholdvalues. TheptpClockThresholdvalues configure how long after the PTP master clock is disconnected before PTP events are triggered.holdOverTimeoutis the time value in seconds before the PTP clock event state changes toFREERUNwhen the PTP master clock is disconnected. ThemaxOffsetThresholdandminOffsetThresholdsettings configure offset values in nanoseconds that compare against the values forCLOCK_REALTIME(phc2sys) or master offset (ptp4l). When theptp4lorphc2sysoffset value is outside this range, the PTP clock state is set toFREERUN. When the offset value is within this range, the PTP clock state is set toLOCKED.
- Merge any other required changes and files with your custom site repository.
- Push the changes to your site configuration repository to deploy PTP fast events to new sites using GitOps ZTP.
22.10.8.2. Configuring PTP events that use AMQP transport Copy linkLink copied to clipboard!
You can configure PTP events that use AMQP transport on managed clusters that you deploy with the GitOps Zero Touch Provisioning (ZTP) pipeline.
HTTP transport is the default transport for PTP and bare-metal events. Use HTTP transport instead of AMQP for PTP and bare-metal events where possible. AMQ Interconnect is EOL from 30 June 2024. Extended life cycle support (ELS) for AMQ Interconnect ends 29 November 2029. For more information see, Red Hat AMQ Interconnect support status.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have created a Git repository where you manage your custom site configuration data.
Procedure
Add the following YAML into
.spec.sourceFilesin thecommon-ranGen.yamlfile to configure the AMQP Operator:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the following
PolicyGenTemplatechanges togroup-du-3node-ranGen.yaml,group-du-sno-ranGen.yaml, orgroup-du-standard-ranGen.yamlfiles according to your requirements:In
.sourceFiles, add thePtpOperatorConfigCR file that configures the AMQ transport host to theconfig-policy:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure the
linuxptpandphc2sysfor the PTP clock type and interface. For example, add the following stanza into.sourceFiles:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Can be
PtpConfigMaster.yamlorPtpConfigSlave.yamldepending on your requirements. For configurations based ongroup-du-sno-ranGen.yamlorgroup-du-3node-ranGen.yaml, usePtpConfigSlave.yaml. - 2
- Device specific interface name.
- 3
- You must append the
--summary_interval -4value toptp4lOptsin.spec.sourceFiles.spec.profileto enable PTP fast events. - 4
- Required
phc2sysOptsvalues.-mprints messages tostdout. Thelinuxptp-daemonDaemonSetparses the logs and generates Prometheus metrics. - 5
- Optional. If the
ptpClockThresholdstanza is not present, default values are used for theptpClockThresholdfields. The stanza shows defaultptpClockThresholdvalues. TheptpClockThresholdvalues configure how long after the PTP master clock is disconnected before PTP events are triggered.holdOverTimeoutis the time value in seconds before the PTP clock event state changes toFREERUNwhen the PTP master clock is disconnected. ThemaxOffsetThresholdandminOffsetThresholdsettings configure offset values in nanoseconds that compare against the values forCLOCK_REALTIME(phc2sys) or master offset (ptp4l). When theptp4lorphc2sysoffset value is outside this range, the PTP clock state is set toFREERUN. When the offset value is within this range, the PTP clock state is set toLOCKED.
Apply the following
PolicyGenTemplatechanges to your specific site YAML files, for example,example-sno-site.yaml:In
.sourceFiles, add theInterconnectCR file that configures the AMQ router to theconfig-policy:- fileName: AmqInstance.yaml policyName: "config-policy"
- fileName: AmqInstance.yaml policyName: "config-policy"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Merge any other required changes and files with your custom site repository.
- Push the changes to your site configuration repository to deploy PTP fast events to new sites using GitOps ZTP.
22.10.9. Configuring bare-metal events with PolicyGenTemplate CRs Copy linkLink copied to clipboard!
You can use the GitOps ZTP pipeline to configure bare-metal events that use HTTP or AMQP transport.
HTTP transport is the default transport for PTP and bare-metal events. Use HTTP transport instead of AMQP for PTP and bare-metal events where possible. AMQ Interconnect is EOL from 30 June 2024. Extended life cycle support (ELS) for AMQ Interconnect ends 29 November 2029. For more information see, Red Hat AMQ Interconnect support status.
22.10.9.1. Configuring bare-metal events that use HTTP transport Copy linkLink copied to clipboard!
You can configure bare-metal events that use HTTP transport on managed clusters that you deploy with the GitOps Zero Touch Provisioning (ZTP) pipeline.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have created a Git repository where you manage your custom site configuration data.
Procedure
Configure the Bare Metal Event Relay Operator by adding the following YAML to
spec.sourceFilesin thecommon-ranGen.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the
HardwareEventCR tospec.sourceFilesin your specific group configuration file, for example, in thegroup-du-sno-ranGen.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Each baseboard management controller (BMC) requires a single
HardwareEventCR only.
NoteIn OpenShift Container Platform 4.13 or later, you do not need to set the
transportHostfield in theHardwareEventcustom resource (CR) when you use HTTP transport with bare-metal events.- Merge any other required changes and files with your custom site repository.
- Push the changes to your site configuration repository to deploy bare-metal events to new sites with GitOps ZTP.
Create the Redfish Secret by running the following command:
oc -n openshift-bare-metal-events create secret generic redfish-basic-auth \ --from-literal=username=<bmc_username> --from-literal=password=<bmc_password> \ --from-literal=hostaddr="<bmc_host_ip_addr>"
$ oc -n openshift-bare-metal-events create secret generic redfish-basic-auth \ --from-literal=username=<bmc_username> --from-literal=password=<bmc_password> \ --from-literal=hostaddr="<bmc_host_ip_addr>"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.10.9.2. Configuring bare-metal events that use AMQP transport Copy linkLink copied to clipboard!
You can configure bare-metal events that use AMQP transport on managed clusters that you deploy with the GitOps Zero Touch Provisioning (ZTP) pipeline.
HTTP transport is the default transport for PTP and bare-metal events. Use HTTP transport instead of AMQP for PTP and bare-metal events where possible. AMQ Interconnect is EOL from 30 June 2024. Extended life cycle support (ELS) for AMQ Interconnect ends 29 November 2029. For more information see, Red Hat AMQ Interconnect support status.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have created a Git repository where you manage your custom site configuration data.
Procedure
To configure the AMQ Interconnect Operator and the Bare Metal Event Relay Operator, add the following YAML to
spec.sourceFilesin thecommon-ranGen.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the
InterconnectCR to.spec.sourceFilesin the site configuration file, for example, theexample-sno-site.yamlfile:- fileName: AmqInstance.yaml policyName: "config-policy"
- fileName: AmqInstance.yaml policyName: "config-policy"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the
HardwareEventCR tospec.sourceFilesin your specific group configuration file, for example, in thegroup-du-sno-ranGen.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
transportHostURL is composed of the existing AMQ Interconnect CRnameandnamespace. For example, intransportHost: "amqp://amq-router.amq-router.svc.cluster.local", the AMQ Interconnectnameandnamespaceare both set toamq-router.
NoteEach baseboard management controller (BMC) requires a single
HardwareEventresource only.-
Commit the
PolicyGenTemplatechange in Git, and then push the changes to your site configuration repository to deploy bare-metal events monitoring to new sites using GitOps ZTP. Create the Redfish Secret by running the following command:
oc -n openshift-bare-metal-events create secret generic redfish-basic-auth \ --from-literal=username=<bmc_username> --from-literal=password=<bmc_password> \ --from-literal=hostaddr="<bmc_host_ip_addr>"
$ oc -n openshift-bare-metal-events create secret generic redfish-basic-auth \ --from-literal=username=<bmc_username> --from-literal=password=<bmc_password> \ --from-literal=hostaddr="<bmc_host_ip_addr>"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.10.10. Configuring the Image Registry Operator for local caching of images Copy linkLink copied to clipboard!
OpenShift Container Platform manages image caching using a local registry. In edge computing use cases, clusters are often subject to bandwidth restrictions when communicating with centralized image registries, which might result in long image download times.
Long download times are unavoidable during initial deployment. Over time, there is a risk that CRI-O will erase the /var/lib/containers/storage directory in the case of an unexpected shutdown. To address long image download times, you can create a local image registry on remote managed clusters using GitOps Zero Touch Provisioning (ZTP). This is useful in Edge computing scenarios where clusters are deployed at the far edge of the network.
Before you can set up the local image registry with GitOps ZTP, you need to configure disk partitioning in the SiteConfig CR that you use to install the remote managed cluster. After installation, you configure the local image registry using a PolicyGenTemplate CR. Then, the GitOps ZTP pipeline creates Persistent Volume (PV) and Persistent Volume Claim (PVC) CRs and patches the imageregistry configuration.
The local image registry can only be used for user application images and cannot be used for the OpenShift Container Platform or Operator Lifecycle Manager operator images.
22.10.10.1. Configuring disk partitioning with SiteConfig Copy linkLink copied to clipboard!
Configure disk partitioning for a managed cluster using a SiteConfig CR and GitOps Zero Touch Provisioning (ZTP). The disk partition details in the SiteConfig CR must match the underlying disk.
You must complete this procedure at installation time.
Prerequisites
- Install Butane.
Procedure
Create the
storage.bufile by using the following example YAML file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Convert the
storage.buto an Ignition file by running the following command:butane storage.bu
$ butane storage.buCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
{"ignition":{"version":"3.2.0"},"storage":{"disks":[{"device":"/dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0","partitions":[{"label":"var-lib-containers","sizeMiB":0,"startMiB":250000}],"wipeTable":false}],"filesystems":[{"device":"/dev/disk/by-partlabel/var-lib-containers","format":"xfs","mountOptions":["defaults","prjquota"],"path":"/var/lib/containers","wipeFilesystem":true}]},"systemd":{"units":[{"contents":"# # Generated by Butane\n[Unit]\nRequires=systemd-fsck@dev-disk-by\\x2dpartlabel-var\\x2dlib\\x2dcontainers.service\nAfter=systemd-fsck@dev-disk-by\\x2dpartlabel-var\\x2dlib\\x2dcontainers.service\n\n[Mount]\nWhere=/var/lib/containers\nWhat=/dev/disk/by-partlabel/var-lib-containers\nType=xfs\nOptions=defaults,prjquota\n\n[Install]\nRequiredBy=local-fs.target","enabled":true,"name":"var-lib-containers.mount"}]}}{"ignition":{"version":"3.2.0"},"storage":{"disks":[{"device":"/dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0","partitions":[{"label":"var-lib-containers","sizeMiB":0,"startMiB":250000}],"wipeTable":false}],"filesystems":[{"device":"/dev/disk/by-partlabel/var-lib-containers","format":"xfs","mountOptions":["defaults","prjquota"],"path":"/var/lib/containers","wipeFilesystem":true}]},"systemd":{"units":[{"contents":"# # Generated by Butane\n[Unit]\nRequires=systemd-fsck@dev-disk-by\\x2dpartlabel-var\\x2dlib\\x2dcontainers.service\nAfter=systemd-fsck@dev-disk-by\\x2dpartlabel-var\\x2dlib\\x2dcontainers.service\n\n[Mount]\nWhere=/var/lib/containers\nWhat=/dev/disk/by-partlabel/var-lib-containers\nType=xfs\nOptions=defaults,prjquota\n\n[Install]\nRequiredBy=local-fs.target","enabled":true,"name":"var-lib-containers.mount"}]}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Use a tool such as JSON Pretty Print to convert the output into JSON format.
Copy the output into the
.spec.clusters.nodes.ignitionConfigOverridefield in theSiteConfigCR.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the
.spec.clusters.nodes.ignitionConfigOverridefield does not exist, create it.
Verification
During or after installation, verify on the hub cluster that the
BareMetalHostobject shows the annotation by running the following command:oc get bmh -n my-sno-ns my-sno -ojson | jq '.metadata.annotations["bmac.agent-install.openshift.io/ignition-config-overrides"]
$ oc get bmh -n my-sno-ns my-sno -ojson | jq '.metadata.annotations["bmac.agent-install.openshift.io/ignition-config-overrides"]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
"{\"ignition\":{\"version\":\"3.2.0\"},\"storage\":{\"disks\":[{\"device\":\"/dev/disk/by-id/wwn-0x6b07b250ebb9d0002a33509f24af1f62\",\"partitions\":[{\"label\":\"var-lib-containers\",\"sizeMiB\":0,\"startMiB\":250000}],\"wipeTable\":false}],\"filesystems\":[{\"device\":\"/dev/disk/by-partlabel/var-lib-containers\",\"format\":\"xfs\",\"mountOptions\":[\"defaults\",\"prjquota\"],\"path\":\"/var/lib/containers\",\"wipeFilesystem\":true}]},\"systemd\":{\"units\":[{\"contents\":\"# Generated by Butane\\n[Unit]\\nRequires=systemd-fsck@dev-disk-by\\\\x2dpartlabel-var\\\\x2dlib\\\\x2dcontainers.service\\nAfter=systemd-fsck@dev-disk-by\\\\x2dpartlabel-var\\\\x2dlib\\\\x2dcontainers.service\\n\\n[Mount]\\nWhere=/var/lib/containers\\nWhat=/dev/disk/by-partlabel/var-lib-containers\\nType=xfs\\nOptions=defaults,prjquota\\n\\n[Install]\\nRequiredBy=local-fs.target\",\"enabled\":true,\"name\":\"var-lib-containers.mount\"}]}}""{\"ignition\":{\"version\":\"3.2.0\"},\"storage\":{\"disks\":[{\"device\":\"/dev/disk/by-id/wwn-0x6b07b250ebb9d0002a33509f24af1f62\",\"partitions\":[{\"label\":\"var-lib-containers\",\"sizeMiB\":0,\"startMiB\":250000}],\"wipeTable\":false}],\"filesystems\":[{\"device\":\"/dev/disk/by-partlabel/var-lib-containers\",\"format\":\"xfs\",\"mountOptions\":[\"defaults\",\"prjquota\"],\"path\":\"/var/lib/containers\",\"wipeFilesystem\":true}]},\"systemd\":{\"units\":[{\"contents\":\"# Generated by Butane\\n[Unit]\\nRequires=systemd-fsck@dev-disk-by\\\\x2dpartlabel-var\\\\x2dlib\\\\x2dcontainers.service\\nAfter=systemd-fsck@dev-disk-by\\\\x2dpartlabel-var\\\\x2dlib\\\\x2dcontainers.service\\n\\n[Mount]\\nWhere=/var/lib/containers\\nWhat=/dev/disk/by-partlabel/var-lib-containers\\nType=xfs\\nOptions=defaults,prjquota\\n\\n[Install]\\nRequiredBy=local-fs.target\",\"enabled\":true,\"name\":\"var-lib-containers.mount\"}]}}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow After installation, check the single-node OpenShift disk status.
Enter into a debug session on the single-node OpenShift node by running the following command.
This step instantiates a debug pod called
<node_name>-debug:oc debug node/my-sno-node
$ oc debug node/my-sno-nodeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set
/hostas the root directory within the debug shell by running the following command.The debug pod mounts the host’s root file system in
/hostwithin the pod. By changing the root directory to/host, you can run binaries contained in the host’s executable paths:chroot /host
# chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow List information about all available block devices by running the following command:
lsblk
# lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Display information about the file system disk space usage by running the following command:
df -h
# df -hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.10.10.2. Configuring the image registry using PolicyGenTemplate CRs Copy linkLink copied to clipboard!
Use PolicyGenTemplate (PGT) CRs to apply the CRs required to configure the image registry and patch the imageregistry configuration.
Prerequisites
- You have configured a disk partition in the managed cluster.
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. - You have created a Git repository where you manage your custom site configuration data for use with GitOps Zero Touch Provisioning (ZTP).
Procedure
Configure the storage class, persistent volume claim, persistent volume, and image registry configuration in the appropriate
PolicyGenTemplateCR. For example, to configure an individual site, add the following YAML to the fileexample-sno-site.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Set the appropriate value for
ztp-deploy-wavedepending on whether you are configuring image registries at the site, common, or group level.ztp-deploy-wave: "100"is suitable for development or testing because it allows you to group the referenced source files together. - 2
- In
ImageRegistryPV.yaml, ensure that thespec.local.pathfield is set to/var/imageregistryto match the value set for themount_pointfield in theSiteConfigCR.
ImportantDo not set
complianceType: mustonlyhavefor the- fileName: ImageRegistryConfig.yamlconfiguration. This can cause the registry pod deployment to fail.-
Commit the
PolicyGenTemplatechange in Git, and then push to the Git repository being monitored by the GitOps ZTP ArgoCD application.
Verification
Use the following steps to troubleshoot errors with the local image registry on the managed clusters:
Verify successful login to the registry while logged in to the managed cluster. Run the following commands:
Export the managed cluster name:
cluster=<managed_cluster_name>
$ cluster=<managed_cluster_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the managed cluster
kubeconfigdetails:oc get secret -n $cluster $cluster-admin-password -o jsonpath='{.data.password}' | base64 -d > kubeadmin-password-$cluster$ oc get secret -n $cluster $cluster-admin-password -o jsonpath='{.data.password}' | base64 -d > kubeadmin-password-$clusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Download and export the cluster
kubeconfig:oc get secret -n $cluster $cluster-admin-kubeconfig -o jsonpath='{.data.kubeconfig}' | base64 -d > kubeconfig-$cluster && export KUBECONFIG=./kubeconfig-$cluster$ oc get secret -n $cluster $cluster-admin-kubeconfig -o jsonpath='{.data.kubeconfig}' | base64 -d > kubeconfig-$cluster && export KUBECONFIG=./kubeconfig-$clusterCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Verify access to the image registry from the managed cluster. See "Accessing the registry".
Check that the
ConfigCRD in theimageregistry.operator.openshift.iogroup instance is not reporting errors. Run the following command while logged in to the managed cluster:oc get image.config.openshift.io cluster -o yaml
$ oc get image.config.openshift.io cluster -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the
PersistentVolumeClaimon the managed cluster is populated with data. Run the following command while logged in to the managed cluster:oc get pv image-registry-sc
$ oc get pv image-registry-scCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the
registry*pod is running and is located under theopenshift-image-registrynamespace.oc get pods -n openshift-image-registry | grep registry*
$ oc get pods -n openshift-image-registry | grep registry*Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
cluster-image-registry-operator-68f5c9c589-42cfg 1/1 Running 0 8d image-registry-5f8987879-6nx6h 1/1 Running 0 8d
cluster-image-registry-operator-68f5c9c589-42cfg 1/1 Running 0 8d image-registry-5f8987879-6nx6h 1/1 Running 0 8dCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the disk partition on the managed cluster is correct:
Open a debug shell to the managed cluster:
oc debug node/sno-1.example.com
$ oc debug node/sno-1.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run
lsblkto check the host disk partitions:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
/var/imageregistryindicates that the disk is correctly partitioned.
22.10.11. Using hub templates in PolicyGenTemplate CRs Copy linkLink copied to clipboard!
Topology Aware Lifecycle Manager supports partial Red Hat Advanced Cluster Management (RHACM) hub cluster template functions in configuration policies used with GitOps Zero Touch Provisioning (ZTP).
Hub-side cluster templates allow you to define configuration policies that can be dynamically customized to the target clusters. This reduces the need to create separate policies for many clusters with similiar configurations but with different values.
Policy templates are restricted to the same namespace as the namespace where the policy is defined. This means that you must create the objects referenced in the hub template in the same namespace where the policy is created.
The following supported hub template functions are available for use in GitOps ZTP with TALM:
fromConfigmapreturns the value of the provided data key in the namedConfigMapresource.NoteThere is a 1 MiB size limit for
ConfigMapCRs. The effective size forConfigMapCRs is further limited by thelast-applied-configurationannotation. To avoid thelast-applied-configurationlimitation, add the following annotation to the templateConfigMap:argocd.argoproj.io/sync-options: Replace=true
argocd.argoproj.io/sync-options: Replace=trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
base64encreturns the base64-encoded value of the input string -
base64decreturns the decoded value of the base64-encoded input string -
indentreturns the input string with added indent spaces -
autoindentreturns the input string with added indent spaces based on the spacing used in the parent template -
toIntcasts and returns the integer value of the input value -
toBoolconverts the input string into a boolean value, and returns the boolean
Various Open source community functions are also available for use with GitOps ZTP.
22.10.11.1. Example hub templates Copy linkLink copied to clipboard!
The following code examples are valid hub templates. Each of these templates return values from the ConfigMap CR with the name test-config in the default namespace.
Returns the value with the key
common-key:{{hub fromConfigMap "default" "test-config" "common-key" hub}}{{hub fromConfigMap "default" "test-config" "common-key" hub}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Returns a string by using the concatenated value of the
.ManagedClusterNamefield and the string-name:{{hub fromConfigMap "default" "test-config" (printf "%s-name" .ManagedClusterName) hub}}{{hub fromConfigMap "default" "test-config" (printf "%s-name" .ManagedClusterName) hub}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Casts and returns a boolean value from the concatenated value of the
.ManagedClusterNamefield and the string-name:{{hub fromConfigMap "default" "test-config" (printf "%s-name" .ManagedClusterName) | toBool hub}}{{hub fromConfigMap "default" "test-config" (printf "%s-name" .ManagedClusterName) | toBool hub}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Casts and returns an integer value from the concatenated value of the
.ManagedClusterNamefield and the string-name:{{hub (printf "%s-name" .ManagedClusterName) | fromConfigMap "default" "test-config" | toInt hub}}{{hub (printf "%s-name" .ManagedClusterName) | fromConfigMap "default" "test-config" | toInt hub}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.10.11.2. Specifying host NICs in site PolicyGenTemplate CRs with hub cluster templates Copy linkLink copied to clipboard!
You can manage host NICs in a single ConfigMap CR and use hub cluster templates to populate the custom NIC values in the generated polices that get applied to the cluster hosts. Using hub cluster templates in site PolicyGenTemplate (PGT) CRs means that you do not need to create multiple single site PGT CRs for each site.
The following example shows you how to use a single ConfigMap CR to manage cluster host NICs and apply them to the cluster as polices by using a single PolicyGenTemplate site CR.
When you use the fromConfigmap function, the printf variable is only available for the template resource data key fields. You cannot use it with name and namespace fields.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. - You have created a Git repository where you manage your custom site configuration data. The repository must be accessible from the hub cluster and be defined as a source repository for the GitOps ZTP ArgoCD application.
Procedure
Create a
ConfigMapresource that describes the NICs for a group of hosts. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
argocd.argoproj.io/sync-optionsannotation is required only if theConfigMapis larger than 1 MiB in size.
NoteThe
ConfigMapmust be in the same namespace with the policy that has the hub template substitution.-
Commit the
ConfigMapCR in Git, and then push to the Git repository being monitored by the Argo CD application. Create a site PGT CR that uses templates to pull the required data from the
ConfigMapobject. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Commit the site
PolicyGenTemplateCR in Git and push to the Git repository that is monitored by the ArgoCD application.NoteSubsequent changes to the referenced
ConfigMapCR are not automatically synced to the applied policies. You need to manually sync the newConfigMapchanges to update existing PolicyGenTemplate CRs. See "Syncing new ConfigMap changes to existing PolicyGenTemplate CRs".
22.10.11.3. Specifying VLAN IDs in group PolicyGenTemplate CRs with hub cluster templates Copy linkLink copied to clipboard!
You can manage VLAN IDs for managed clusters in a single ConfigMap CR and use hub cluster templates to populate the VLAN IDs in the generated polices that get applied to the clusters.
The following example shows how you how manage VLAN IDs in single ConfigMap CR and apply them in individual cluster polices by using a single PolicyGenTemplate group CR.
When using the fromConfigmap function, the printf variable is only available for the template resource data key fields. You cannot use it with name and namespace fields.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. - You have created a Git repository where you manage your custom site configuration data. The repository must be accessible from the hub cluster and be defined as a source repository for the Argo CD application.
Procedure
Create a
ConfigMapCR that describes the VLAN IDs for a group of cluster hosts. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
argocd.argoproj.io/sync-optionsannotation is required only if theConfigMapis larger than 1 MiB in size.
NoteThe
ConfigMapmust be in the same namespace with the policy that has the hub template substitution.-
Commit the
ConfigMapCR in Git, and then push to the Git repository being monitored by the Argo CD application. Create a group PGT CR that uses a hub template to pull the required VLAN IDs from the
ConfigMapobject. For example, add the following YAML snippet to the group PGT CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Commit the group
PolicyGenTemplateCR in Git, and then push to the Git repository being monitored by the Argo CD application.NoteSubsequent changes to the referenced
ConfigMapCR are not automatically synced to the applied policies. You need to manually sync the newConfigMapchanges to update existing PolicyGenTemplate CRs. See "Syncing new ConfigMap changes to existing PolicyGenTemplate CRs".
22.10.11.4. Syncing new ConfigMap changes to existing PolicyGenTemplate CRs Copy linkLink copied to clipboard!
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in to the hub cluster as a user with
cluster-adminprivileges. -
You have created a
PolicyGenTemplateCR that pulls information from aConfigMapCR using hub cluster templates.
Procedure
-
Update the contents of your
ConfigMapCR, and apply the changes in the hub cluster. To sync the contents of the updated
ConfigMapCR to the deployed policy, do either of the following:Option 1: Delete the existing policy. ArgoCD uses the
PolicyGenTemplateCR to immediately recreate the deleted policy. For example, run the following command:oc delete policy <policy_name> -n <policy_namespace>
$ oc delete policy <policy_name> -n <policy_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Option 2: Apply a special annotation
policy.open-cluster-management.io/trigger-updateto the policy with a different value every time when you update theConfigMap. For example:oc annotate policy <policy_name> -n <policy_namespace> policy.open-cluster-management.io/trigger-update="1"
$ oc annotate policy <policy_name> -n <policy_namespace> policy.open-cluster-management.io/trigger-update="1"Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou must apply the updated policy for the changes to take effect. For more information, see Special annotation for reprocessing.
Optional: If it exists, delete the
ClusterGroupUpdateCR that contains the policy. For example:oc delete clustergroupupgrade <cgu_name> -n <cgu_namespace>
$ oc delete clustergroupupgrade <cgu_name> -n <cgu_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a new
ClusterGroupUpdateCR that includes the policy to apply with the updatedConfigMapchanges. For example, add the following YAML to the filecgr-example.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the updated policy:
oc apply -f cgr-example.yaml
$ oc apply -f cgr-example.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.11. Updating managed clusters with the Topology Aware Lifecycle Manager Copy linkLink copied to clipboard!
You can use the Topology Aware Lifecycle Manager (TALM) to manage the software lifecycle of multiple clusters. TALM uses Red Hat Advanced Cluster Management (RHACM) policies to perform changes on the target clusters.
22.11.1. About the Topology Aware Lifecycle Manager configuration Copy linkLink copied to clipboard!
The Topology Aware Lifecycle Manager (TALM) manages the deployment of Red Hat Advanced Cluster Management (RHACM) policies for one or more OpenShift Container Platform clusters. Using TALM in a large network of clusters allows the phased rollout of policies to the clusters in limited batches. This helps to minimize possible service disruptions when updating. With TALM, you can control the following actions:
- The timing of the update
- The number of RHACM-managed clusters
- The subset of managed clusters to apply the policies to
- The update order of the clusters
- The set of policies remediated to the cluster
- The order of policies remediated to the cluster
- The assignment of a canary cluster
For single-node OpenShift, the Topology Aware Lifecycle Manager (TALM) offers the following features:
- Create a backup of a deployment before an upgrade
- Pre-caching images for clusters with limited bandwidth
TALM supports the orchestration of the OpenShift Container Platform y-stream and z-stream updates, and day-two operations on y-streams and z-streams.
22.11.2. About managed policies used with Topology Aware Lifecycle Manager Copy linkLink copied to clipboard!
The Topology Aware Lifecycle Manager (TALM) uses RHACM policies for cluster updates.
TALM can be used to manage the rollout of any policy CR where the remediationAction field is set to inform. Supported use cases include the following:
- Manual user creation of policy CRs
-
Automatically generated policies from the
PolicyGenTemplatecustom resource definition (CRD)
For policies that update an Operator subscription with manual approval, TALM provides additional functionality that approves the installation of the updated Operator.
For more information about managed policies, see Policy Overview in the RHACM documentation.
For more information about the PolicyGenTemplate CRD, see the "About the PolicyGenTemplate CRD" section in "Configuring managed clusters with policies and PolicyGenTemplate resources".
22.11.3. Installing the Topology Aware Lifecycle Manager by using the web console Copy linkLink copied to clipboard!
You can use the OpenShift Container Platform web console to install the Topology Aware Lifecycle Manager.
Prerequisites
- Install the latest version of the RHACM Operator.
- Set up a hub cluster with disconnected regitry.
-
Log in as a user with
cluster-adminprivileges.
Procedure
- In the OpenShift Container Platform web console, navigate to Operators → OperatorHub.
- Search for the Topology Aware Lifecycle Manager from the list of available Operators, and then click Install.
- Keep the default selection of Installation mode ["All namespaces on the cluster (default)"] and Installed Namespace ("openshift-operators") to ensure that the Operator is installed properly.
- Click Install.
Verification
To confirm that the installation is successful:
- Navigate to the Operators → Installed Operators page.
-
Check that the Operator is installed in the
All Namespacesnamespace and its status isSucceeded.
If the Operator is not installed successfully:
-
Navigate to the Operators → Installed Operators page and inspect the
Statuscolumn for any errors or failures. -
Navigate to the Workloads → Pods page and check the logs in any containers in the
cluster-group-upgrades-controller-managerpod that are reporting issues.
22.11.4. Installing the Topology Aware Lifecycle Manager by using the CLI Copy linkLink copied to clipboard!
You can use the OpenShift CLI (oc) to install the Topology Aware Lifecycle Manager (TALM).
Prerequisites
-
Install the OpenShift CLI (
oc). - Install the latest version of the RHACM Operator.
- Set up a hub cluster with disconnected registry.
-
Log in as a user with
cluster-adminprivileges.
Procedure
Create a
SubscriptionCR:Define the
SubscriptionCR and save the YAML file, for example,talm-subscription.yaml:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
SubscriptionCR by running the following command:oc create -f talm-subscription.yaml
$ oc create -f talm-subscription.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the installation succeeded by inspecting the CSV resource:
oc get csv -n openshift-operators
$ oc get csv -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME DISPLAY VERSION REPLACES PHASE topology-aware-lifecycle-manager.4.14.x Topology Aware Lifecycle Manager 4.14.x Succeeded
NAME DISPLAY VERSION REPLACES PHASE topology-aware-lifecycle-manager.4.14.x Topology Aware Lifecycle Manager 4.14.x SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the TALM is up and running:
oc get deploy -n openshift-operators
$ oc get deploy -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE openshift-operators cluster-group-upgrades-controller-manager 1/1 1 1 14s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE openshift-operators cluster-group-upgrades-controller-manager 1/1 1 1 14sCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.11.5. About the ClusterGroupUpgrade CR Copy linkLink copied to clipboard!
The Topology Aware Lifecycle Manager (TALM) builds the remediation plan from the ClusterGroupUpgrade CR for a group of clusters. You can define the following specifications in a ClusterGroupUpgrade CR:
- Clusters in the group
-
Blocking
ClusterGroupUpgradeCRs - Applicable list of managed policies
- Number of concurrent updates
- Applicable canary updates
- Actions to perform before and after the update
- Update timing
You can control the start time of an update using the enable field in the ClusterGroupUpgrade CR. For example, if you have a scheduled maintenance window of four hours, you can prepare a ClusterGroupUpgrade CR with the enable field set to false.
You can set the timeout by configuring the spec.remediationStrategy.timeout setting as follows:
spec
remediationStrategy:
maxConcurrency: 1
timeout: 240
spec
remediationStrategy:
maxConcurrency: 1
timeout: 240
You can use the batchTimeoutAction to determine what happens if an update fails for a cluster. You can specify continue to skip the failing cluster and continue to upgrade other clusters, or abort to stop policy remediation for all clusters. Once the timeout elapses, TALM removes all enforce policies to ensure that no further updates are made to clusters.
To apply the changes, you set the enabled field to true.
For more information see the "Applying update policies to managed clusters" section.
As TALM works through remediation of the policies to the specified clusters, the ClusterGroupUpgrade CR can report true or false statuses for a number of conditions.
After TALM completes a cluster update, the cluster does not update again under the control of the same ClusterGroupUpgrade CR. You must create a new ClusterGroupUpgrade CR in the following cases:
- When you need to update the cluster again
-
When the cluster changes to non-compliant with the
informpolicy after being updated
22.11.5.1. Selecting clusters Copy linkLink copied to clipboard!
TALM builds a remediation plan and selects clusters based on the following fields:
-
The
clusterLabelSelectorfield specifies the labels of the clusters that you want to update. This consists of a list of the standard label selectors fromk8s.io/apimachinery/pkg/apis/meta/v1. Each selector in the list uses either label value pairs or label expressions. Matches from each selector are added to the final list of clusters along with the matches from theclusterSelectorfield and theclusterfield. -
The
clustersfield specifies a list of clusters to update. -
The
canariesfield specifies the clusters for canary updates. -
The
maxConcurrencyfield specifies the number of clusters to update in a batch. -
The
actionsfield specifiesbeforeEnableactions that TALM takes as it begins the update process, andafterCompletionactions that TALM takes as it completes policy remediation for each cluster.
You can use the clusters, clusterLabelSelector, and clusterSelector fields together to create a combined list of clusters.
The remediation plan starts with the clusters listed in the canaries field. Each canary cluster forms a single-cluster batch.
Sample ClusterGroupUpgrade CR with the enabled field set to false
- 1
- Specifies the action that TALM takes when it completes policy remediation for each cluster.
- 2
- Specifies the action that TALM takes as it begins the update process.
- 3
- Defines the list of clusters to update.
- 4
- The
enablefield is set tofalse. - 5
- Lists the user-defined set of policies to remediate.
- 6
- Defines the specifics of the cluster updates.
- 7
- Defines the clusters for canary updates.
- 8
- Defines the maximum number of concurrent updates in a batch. The number of remediation batches is the number of canary clusters, plus the number of clusters, except the canary clusters, divided by the
maxConcurrencyvalue. The clusters that are already compliant with all the managed policies are excluded from the remediation plan. - 9
- Displays the parameters for selecting clusters.
- 10
- Controls what happens if a batch times out. Possible values are
abortorcontinue. If unspecified, the default iscontinue. - 11
- Displays information about the status of the updates.
- 12
- The
ClustersSelectedcondition shows that all selected clusters are valid. - 13
- The
Validatedcondition shows that all selected clusters have been validated.
Any failures during the update of a canary cluster stops the update process.
When the remediation plan is successfully created, you can you set the enable field to true and TALM starts to update the non-compliant clusters with the specified managed policies.
You can only make changes to the spec fields if the enable field of the ClusterGroupUpgrade CR is set to false.
22.11.5.2. Validating Copy linkLink copied to clipboard!
TALM checks that all specified managed policies are available and correct, and uses the Validated condition to report the status and reasons as follows:
trueValidation is completed.
falsePolicies are missing or invalid, or an invalid platform image has been specified.
22.11.5.3. Pre-caching Copy linkLink copied to clipboard!
Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed. On single-node OpenShift clusters, you can use pre-caching to avoid this. The container image pre-caching starts when you create a ClusterGroupUpgrade CR with the preCaching field set to true. TALM compares the available disk space with the estimated OpenShift Container Platform image size to ensure that there is enough space. If a cluster has insufficient space, TALM cancels pre-caching for that cluster and does not remediate policies on it.
TALM uses the PrecacheSpecValid condition to report status information as follows:
trueThe pre-caching spec is valid and consistent.
falseThe pre-caching spec is incomplete.
TALM uses the PrecachingSucceeded condition to report status information as follows:
trueTALM has concluded the pre-caching process. If pre-caching fails for any cluster, the update fails for that cluster but proceeds for all other clusters. A message informs you if pre-caching has failed for any clusters.
falsePre-caching is still in progress for one or more clusters or has failed for all clusters.
For more information see the "Using the container image pre-cache feature" section.
22.11.5.4. Creating a backup Copy linkLink copied to clipboard!
For single-node OpenShift, TALM can create a backup of a deployment before an update. If the update fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications. To use the backup feature you first create a ClusterGroupUpgrade CR with the backup field set to true. To ensure that the contents of the backup are up to date, the backup is not taken until you set the enable field in the ClusterGroupUpgrade CR to true.
TALM uses the BackupSucceeded condition to report the status and reasons as follows:
trueBackup is completed for all clusters or the backup run has completed but failed for one or more clusters. If backup fails for any cluster, the update fails for that cluster but proceeds for all other clusters.
falseBackup is still in progress for one or more clusters or has failed for all clusters.
For more information, see the "Creating a backup of cluster resources before upgrade" section.
22.11.5.5. Updating clusters Copy linkLink copied to clipboard!
TALM enforces the policies following the remediation plan. Enforcing the policies for subsequent batches starts immediately after all the clusters of the current batch are compliant with all the managed policies. If the batch times out, TALM moves on to the next batch. The timeout value of a batch is the spec.timeout field divided by the number of batches in the remediation plan.
TALM uses the Progressing condition to report the status and reasons as follows:
trueTALM is remediating non-compliant policies.
falseThe update is not in progress. Possible reasons for this are:
- All clusters are compliant with all the managed policies.
- The update has timed out as policy remediation took too long.
- Blocking CRs are missing from the system or have not yet completed.
-
The
ClusterGroupUpgradeCR is not enabled. - Backup is still in progress.
The managed policies apply in the order that they are listed in the managedPolicies field in the ClusterGroupUpgrade CR. One managed policy is applied to the specified clusters at a time. When a cluster complies with the current policy, the next managed policy is applied to it.
Sample ClusterGroupUpgrade CR in the Progressing state
- 1
- The
Progressingfields show that TALM is in the process of remediating policies.
22.11.5.6. Update status Copy linkLink copied to clipboard!
TALM uses the Succeeded condition to report the status and reasons as follows:
trueAll clusters are compliant with the specified managed policies.
falsePolicy remediation failed as there were no clusters available for remediation, or because policy remediation took too long for one of the following reasons:
- The current batch contains canary updates and the cluster in the batch does not comply with all the managed policies within the batch timeout.
-
Clusters did not comply with the managed policies within the
timeoutvalue specified in theremediationStrategyfield.
Sample ClusterGroupUpgrade CR in the Succeeded state
- 2
- In the
Progressingfields, the status isfalseas the update has completed; clusters are compliant with all the managed policies. - 3
- The
Succeededfields show that the validations completed successfully. - 1
- The
statusfield includes a list of clusters and their respective statuses. The status of a cluster can becompleteortimedout.
Sample ClusterGroupUpgrade CR in the timedout state
22.11.5.7. Blocking ClusterGroupUpgrade CRs Copy linkLink copied to clipboard!
You can create multiple ClusterGroupUpgrade CRs and control their order of application.
For example, if you create ClusterGroupUpgrade CR C that blocks the start of ClusterGroupUpgrade CR A, then ClusterGroupUpgrade CR A cannot start until the status of ClusterGroupUpgrade CR C becomes UpgradeComplete.
One ClusterGroupUpgrade CR can have multiple blocking CRs. In this case, all the blocking CRs must complete before the upgrade for the current CR can start.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-adminprivileges. - Create RHACM policies in the hub cluster.
Procedure
Save the content of the
ClusterGroupUpgradeCRs in thecgu-a.yaml,cgu-b.yaml, andcgu-c.yamlfiles.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Defines the blocking CRs. The
cgu-aupdate cannot start untilcgu-cis complete.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
cgu-bupdate cannot start untilcgu-ais complete.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
cgu-cupdate does not have any blocking CRs. TALM starts thecgu-cupdate when theenablefield is set totrue.
Create the
ClusterGroupUpgradeCRs by running the following command for each relevant CR:oc apply -f <name>.yaml
$ oc apply -f <name>.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Start the update process by running the following command for each relevant CR:
oc --namespace=default patch clustergroupupgrade.ran.openshift.io/<name> \ --type merge -p '{"spec":{"enable":true}}'$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/<name> \ --type merge -p '{"spec":{"enable":true}}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following examples show
ClusterGroupUpgradeCRs where theenablefield is set totrue:Example for
cgu-awith blocking CRsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Shows the list of blocking CRs.
Example for
cgu-bwith blocking CRsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Shows the list of blocking CRs.
Example for
cgu-cwith blocking CRsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
cgu-cupdate does not have any blocking CRs.
22.11.6. Update policies on managed clusters Copy linkLink copied to clipboard!
The Topology Aware Lifecycle Manager (TALM) remediates a set of inform policies for the clusters specified in the ClusterGroupUpgrade CR. TALM remediates inform policies by making enforce copies of the managed RHACM policies. Each copied policy has its own corresponding RHACM placement rule and RHACM placement binding.
One by one, TALM adds each cluster from the current batch to the placement rule that corresponds with the applicable managed policy. If a cluster is already compliant with a policy, TALM skips applying that policy on the compliant cluster. TALM then moves on to applying the next policy to the non-compliant cluster. After TALM completes the updates in a batch, all clusters are removed from the placement rules associated with the copied policies. Then, the update of the next batch starts.
If a spoke cluster does not report any compliant state to RHACM, the managed policies on the hub cluster can be missing status information that TALM needs. TALM handles these cases in the following ways:
-
If a policy’s
status.compliantfield is missing, TALM ignores the policy and adds a log entry. Then, TALM continues looking at the policy’sstatus.statusfield. -
If a policy’s
status.statusis missing, TALM produces an error. -
If a cluster’s compliance status is missing in the policy’s
status.statusfield, TALM considers that cluster to be non-compliant with that policy.
The ClusterGroupUpgrade CR’s batchTimeoutAction determines what happens if an upgrade fails for a cluster. You can specify continue to skip the failing cluster and continue to upgrade other clusters, or specify abort to stop the policy remediation for all clusters. Once the timeout elapses, TALM removes all enforce policies to ensure that no further updates are made to clusters.
Example upgrade policy
For more information about RHACM policies, see Policy overview.
22.11.6.1. Configuring Operator subscriptions for managed clusters that you install with TALM Copy linkLink copied to clipboard!
Topology Aware Lifecycle Manager (TALM) can only approve the install plan for an Operator if the Subscription custom resource (CR) of the Operator contains the status.state.AtLatestKnown field.
Procedure
Add the
status.state.AtLatestKnownfield to theSubscriptionCR of the Operator:Example Subscription CR
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
status.state: AtLatestKnownfield is used for the latest Operator version available from the Operator catalog.
NoteWhen a new version of the Operator is available in the registry, the associated policy becomes non-compliant.
-
Apply the changed
Subscriptionpolicy to your managed clusters with aClusterGroupUpgradeCR.
22.11.6.2. Applying update policies to managed clusters Copy linkLink copied to clipboard!
You can update your managed clusters by applying your policies.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-adminprivileges. - Create RHACM policies in the hub cluster.
Procedure
Save the contents of the
ClusterGroupUpgradeCR in thecgu-1.yamlfile.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name of the policies to apply.
- 2
- The list of clusters to update.
- 3
- The
maxConcurrencyfield signifies the number of clusters updated at the same time. - 4
- The update timeout in minutes.
- 5
- Controls what happens if a batch times out. Possible values are
abortorcontinue. If unspecified, the default iscontinue.
Create the
ClusterGroupUpgradeCR by running the following command:oc create -f cgu-1.yaml
$ oc create -f cgu-1.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check if the
ClusterGroupUpgradeCR was created in the hub cluster by running the following command:oc get cgu --all-namespaces
$ oc get cgu --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME AGE STATE DETAILS default cgu-1 8m55 NotEnabled Not Enabled
NAMESPACE NAME AGE STATE DETAILS default cgu-1 8m55 NotEnabled Not EnabledCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of the update by running the following command:
oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq$ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
spec.enablefield in theClusterGroupUpgradeCR is set tofalse.
Check the status of the policies by running the following command:
oc get policies -A
$ oc get policies -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
spec.remediationActionfield of policies currently applied on the clusters is set toenforce. The managed policies ininformmode from theClusterGroupUpgradeCR remain ininformmode during the update.
Change the value of the
spec.enablefield totrueby running the following command:oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-1 \ --patch '{"spec":{"enable":true}}' --type=merge$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-1 \ --patch '{"spec":{"enable":true}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the status of the update again by running the following command:
oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq$ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Reflects the update progress of the current batch. Run this command again to receive updated information about the progress.
If the policies include Operator subscriptions, you can check the installation progress directly on the single-node cluster.
Export the
KUBECONFIGfile of the single-node cluster you want to check the installation progress for by running the following command:export KUBECONFIG=<cluster_kubeconfig_absolute_path>
$ export KUBECONFIG=<cluster_kubeconfig_absolute_path>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check all the subscriptions present on the single-node cluster and look for the one in the policy you are trying to install through the
ClusterGroupUpgradeCR by running the following command:oc get subs -A | grep -i <subscription_name>
$ oc get subs -A | grep -i <subscription_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for
cluster-loggingpolicyNAMESPACE NAME PACKAGE SOURCE CHANNEL openshift-logging cluster-logging cluster-logging redhat-operators stable
NAMESPACE NAME PACKAGE SOURCE CHANNEL openshift-logging cluster-logging cluster-logging redhat-operators stableCopy to Clipboard Copied! Toggle word wrap Toggle overflow
If one of the managed policies includes a
ClusterVersionCR, check the status of platform updates in the current batch by running the following command against the spoke cluster:oc get clusterversion
$ oc get clusterversionCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.14.5 True True 43s Working towards 4.4.14.7: 71 of 735 done (9% complete)
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.14.5 True True 43s Working towards 4.4.14.7: 71 of 735 done (9% complete)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the Operator subscription by running the following command:
oc get subs -n <operator-namespace> <operator-subscription> -ojsonpath="{.status}"$ oc get subs -n <operator-namespace> <operator-subscription> -ojsonpath="{.status}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the install plans present on the single-node cluster that is associated with the desired subscription by running the following command:
oc get installplan -n <subscription_namespace>
$ oc get installplan -n <subscription_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for
cluster-loggingOperatorNAMESPACE NAME CSV APPROVAL APPROVED openshift-logging install-6khtw cluster-logging.5.3.3-4 Manual true
NAMESPACE NAME CSV APPROVAL APPROVED openshift-logging install-6khtw cluster-logging.5.3.3-4 Manual true1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The install plans have their
Approvalfield set toManualand theirApprovedfield changes fromfalsetotrueafter TALM approves the install plan.
NoteWhen TALM is remediating a policy containing a subscription, it automatically approves any install plans attached to that subscription. Where multiple install plans are needed to get the operator to the latest known version, TALM might approve multiple install plans, upgrading through one or more intermediate versions to get to the final version.
Check if the cluster service version for the Operator of the policy that the
ClusterGroupUpgradeis installing reached theSucceededphase by running the following command:oc get csv -n <operator_namespace>
$ oc get csv -n <operator_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for OpenShift Logging Operator
NAME DISPLAY VERSION REPLACES PHASE cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 Succeeded
NAME DISPLAY VERSION REPLACES PHASE cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.11.7. Creating a backup of cluster resources before upgrade Copy linkLink copied to clipboard!
For single-node OpenShift, the Topology Aware Lifecycle Manager (TALM) can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.
To use the backup feature you first create a ClusterGroupUpgrade CR with the backup field set to true. To ensure that the contents of the backup are up to date, the backup is not taken until you set the enable field in the ClusterGroupUpgrade CR to true.
TALM uses the BackupSucceeded condition to report the status and reasons as follows:
trueBackup is completed for all clusters or the backup run has completed but failed for one or more clusters. If backup fails for any cluster, the update does not proceed for that cluster.
falseBackup is still in progress for one or more clusters or has failed for all clusters. The backup process running in the spoke clusters can have the following statuses:
PreparingToStartThe first reconciliation pass is in progress. The TALM deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
StartingThe backup prerequisites and backup job are being created.
ActiveThe backup is in progress.
SucceededThe backup succeeded.
BackupTimeoutArtifact backup is partially done.
UnrecoverableErrorThe backup has ended with a non-zero exit code.
If the backup of a cluster fails and enters the BackupTimeout or UnrecoverableError state, the cluster update does not proceed for that cluster. Updates to other clusters are not affected and continue.
22.11.7.1. Creating a ClusterGroupUpgrade CR with backup Copy linkLink copied to clipboard!
You can create a backup of a deployment before an upgrade on single-node OpenShift clusters. If the upgrade fails you can use the upgrade-recovery.sh script generated by Topology Aware Lifecycle Manager (TALM) to return the system to its preupgrade state. The backup consists of the following items:
- Cluster backup
-
A snapshot of
etcdand static pod manifests. - Content backup
-
Backups of folders, for example,
/etc,/usr/local,/var/lib/kubelet. - Changed files backup
-
Any files managed by
machine-configthat have been changed. - Deployment
-
A pinned
ostreedeployment. - Images (Optional)
- Any container images that are in use.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-adminprivileges. - Install Red Hat Advanced Cluster Management (RHACM).
It is highly recommended that you create a recovery partition. The following is an example SiteConfig custom resource (CR) for a recovery partition of 50 GB:
Procedure
Save the contents of the
ClusterGroupUpgradeCR with thebackupandenablefields set totruein theclustergroupupgrades-group-du.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow To start the update, apply the
ClusterGroupUpgradeCR by running the following command:oc apply -f clustergroupupgrades-group-du.yaml
$ oc apply -f clustergroupupgrades-group-du.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the status of the upgrade in the hub cluster by running the following command:
oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.11.7.2. Recovering a cluster after a failed upgrade Copy linkLink copied to clipboard!
If an upgrade of a cluster fails, you can manually log in to the cluster and use the backup to return the cluster to its preupgrade state. There are two stages:
- Rollback
- If the attempted upgrade included a change to the platform OS deployment, you must roll back to the previous version before running the recovery script.
A rollback is only applicable to upgrades from TALM and single-node OpenShift. This process does not apply to rollbacks from any other upgrade type.
- Recovery
- The recovery shuts down containers and uses files from the backup partition to relaunch containers and restore clusters.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
- Install Red Hat Advanced Cluster Management (RHACM).
-
Log in as a user with
cluster-adminprivileges. - Run an upgrade that is configured for backup.
Procedure
Delete the previously created
ClusterGroupUpgradecustom resource (CR) by running the following command:oc delete cgu/du-upgrade-4918 -n ztp-group-du-sno
$ oc delete cgu/du-upgrade-4918 -n ztp-group-du-snoCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Log in to the cluster that you want to recover.
Check the status of the platform OS deployment by running the following command:
ostree admin status
$ ostree admin statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example outputs
ostree admin status
[root@lab-test-spoke2-node-0 core]# ostree admin status * rhcos c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9.0 Version: 49.84.202202230006-0 Pinned: yes1 origin refspec: c038a8f08458bbed83a77ece033ad3c55597e3f64edad66ea12fda18cbdceaf9Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The current deployment is pinned. A platform OS deployment rollback is not necessary.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To trigger a rollback of the platform OS deployment, run the following command:
rpm-ostree rollback -r
$ rpm-ostree rollback -rCopy to Clipboard Copied! Toggle word wrap Toggle overflow The first phase of the recovery shuts down containers and restores files from the backup partition to the targeted directories. To begin the recovery, run the following command:
/var/recovery/upgrade-recovery.sh
$ /var/recovery/upgrade-recovery.shCopy to Clipboard Copied! Toggle word wrap Toggle overflow When prompted, reboot the cluster by running the following command:
systemctl reboot
$ systemctl rebootCopy to Clipboard Copied! Toggle word wrap Toggle overflow After the reboot, restart the recovery by running the following command:
/var/recovery/upgrade-recovery.sh --resume
$ /var/recovery/upgrade-recovery.sh --resumeCopy to Clipboard Copied! Toggle word wrap Toggle overflow
If the recovery utility fails, you can retry with the --restart option:
/var/recovery/upgrade-recovery.sh --restart
$ /var/recovery/upgrade-recovery.sh --restart
Verification
To check the status of the recovery run the following command:
oc get clusterversion,nodes,clusteroperator
$ oc get clusterversion,nodes,clusteroperatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.11.8. Using the container image pre-cache feature Copy linkLink copied to clipboard!
Single-node OpenShift clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.
The time of the update is not set by TALM. You can apply the ClusterGroupUpgrade CR at the beginning of the update by manual application or by external automation.
The container image pre-caching starts when the preCaching field is set to true in the ClusterGroupUpgrade CR.
TALM uses the PrecacheSpecValid condition to report status information as follows:
trueThe pre-caching spec is valid and consistent.
falseThe pre-caching spec is incomplete.
TALM uses the PrecachingSucceeded condition to report status information as follows:
trueTALM has concluded the pre-caching process. If pre-caching fails for any cluster, the update fails for that cluster but proceeds for all other clusters. A message informs you if pre-caching has failed for any clusters.
falsePre-caching is still in progress for one or more clusters or has failed for all clusters.
After a successful pre-caching process, you can start remediating policies. The remediation actions start when the enable field is set to true. If there is a pre-caching failure on a cluster, the upgrade fails for that cluster. The upgrade process continues for all other clusters that have a successful pre-cache.
The pre-caching process can be in the following statuses:
NotStartedThis is the initial state all clusters are automatically assigned to on the first reconciliation pass of the
ClusterGroupUpgradeCR. In this state, TALM deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. TALM then creates a newManagedClusterViewresource for the spoke pre-caching namespace to verify its deletion in thePrecachePreparingstate.PreparingToStartCleaning up any remaining resources from previous incomplete updates is in progress.
StartingPre-caching job prerequisites and the job are created.
ActiveThe job is in "Active" state.
SucceededThe pre-cache job succeeded.
PrecacheTimeoutThe artifact pre-caching is partially done.
UnrecoverableErrorThe job ends with a non-zero exit code.
22.11.8.1. Using the container image pre-cache filter Copy linkLink copied to clipboard!
The pre-cache feature typically downloads more images than a cluster needs for an update. You can control which pre-cache images are downloaded to a cluster. This decreases download time, and saves bandwidth and storage.
You can see a list of all images to be downloaded using the following command:
oc adm release info <ocp-version>
$ oc adm release info <ocp-version>
The following ConfigMap example shows how you can exclude images using the excludePrecachePatterns field.
- 1
- TALM excludes all images with names that include any of the patterns listed here.
22.11.8.2. Creating a ClusterGroupUpgrade CR with pre-caching Copy linkLink copied to clipboard!
For single-node OpenShift, the pre-cache feature allows the required container images to be present on the spoke cluster before the update starts.
For pre-caching, TALM uses the spec.remediationStrategy.timeout value from the ClusterGroupUpgrade CR. You must set a timeout value that allows sufficient time for the pre-caching job to complete. When you enable the ClusterGroupUpgrade CR after pre-caching has completed, you can change the timeout value to a duration that is appropriate for the update.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Provision one or more managed clusters.
-
Log in as a user with
cluster-adminprivileges.
Procedure
Save the contents of the
ClusterGroupUpgradeCR with thepreCachingfield set totruein theclustergroupupgrades-group-du.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
preCachingfield is set totrue, which enables TALM to pull the container images before starting the update.
When you want to start pre-caching, apply the
ClusterGroupUpgradeCR by running the following command:oc apply -f clustergroupupgrades-group-du.yaml
$ oc apply -f clustergroupupgrades-group-du.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check if the
ClusterGroupUpgradeCR exists in the hub cluster by running the following command:oc get cgu -A
$ oc get cgu -ACopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME AGE STATE DETAILS ztp-group-du-sno du-upgrade-4918 10s InProgress Precaching is required and not done
NAMESPACE NAME AGE STATE DETAILS ztp-group-du-sno du-upgrade-4918 10s InProgress Precaching is required and not done1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The CR is created.
Check the status of the pre-caching task by running the following command:
oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Displays the list of identified clusters.
Check the status of the pre-caching job by running the following command on the spoke cluster:
oc get jobs,pods -n openshift-talo-pre-cache
$ oc get jobs,pods -n openshift-talo-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME COMPLETIONS DURATION AGE job.batch/pre-cache 0/1 3m10s 3m10s NAME READY STATUS RESTARTS AGE pod/pre-cache--1-9bmlr 1/1 Running 0 3m10s
NAME COMPLETIONS DURATION AGE job.batch/pre-cache 0/1 3m10s 3m10s NAME READY STATUS RESTARTS AGE pod/pre-cache--1-9bmlr 1/1 Running 0 3m10sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of the
ClusterGroupUpgradeCR by running the following command:oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The pre-cache tasks are done.
22.11.9. Troubleshooting the Topology Aware Lifecycle Manager Copy linkLink copied to clipboard!
The Topology Aware Lifecycle Manager (TALM) is an OpenShift Container Platform Operator that remediates RHACM policies. When issues occur, use the oc adm must-gather command to gather details and logs and to take steps in debugging the issues.
For more information about related topics, see the following documentation:
- Red Hat Advanced Cluster Management for Kubernetes 2.4 Support Matrix
- Red Hat Advanced Cluster Management Troubleshooting
- The "Troubleshooting Operator issues" section
22.11.9.1. General troubleshooting Copy linkLink copied to clipboard!
You can determine the cause of the problem by reviewing the following questions:
Is the configuration that you are applying supported?
- Are the RHACM and the OpenShift Container Platform versions compatible?
- Are the TALM and RHACM versions compatible?
Which of the following components is causing the problem?
To ensure that the ClusterGroupUpgrade configuration is functional, you can do the following:
-
Create the
ClusterGroupUpgradeCR with thespec.enablefield set tofalse. - Wait for the status to be updated and go through the troubleshooting questions.
-
If everything looks as expected, set the
spec.enablefield totruein theClusterGroupUpgradeCR.
After you set the spec.enable field to true in the ClusterUpgradeGroup CR, the update procedure starts and you cannot edit the CR’s spec fields anymore.
22.11.9.2. Cannot modify the ClusterUpgradeGroup CR Copy linkLink copied to clipboard!
- Issue
-
You cannot edit the
ClusterUpgradeGroupCR after enabling the update. - Resolution
Restart the procedure by performing the following steps:
Remove the old
ClusterGroupUpgradeCR by running the following command:oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>
$ oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check and fix the existing issues with the managed clusters and policies.
- Ensure that all the clusters are managed clusters and available.
-
Ensure that all the policies exist and have the
spec.remediationActionfield set toinform.
Create a new
ClusterGroupUpgradeCR with the correct configurations.oc apply -f <ClusterGroupUpgradeCR_YAML>
$ oc apply -f <ClusterGroupUpgradeCR_YAML>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.11.9.3. Managed policies Copy linkLink copied to clipboard!
Checking managed policies on the system
- Issue
- You want to check if you have the correct managed policies on the system.
- Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'$ oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
["group-du-sno-validator-du-validator-policy", "policy2-common-nto-sub-policy", "policy3-common-ptp-sub-policy"]
["group-du-sno-validator-du-validator-policy", "policy2-common-nto-sub-policy", "policy3-common-ptp-sub-policy"]Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking remediationAction mode
- Issue
-
You want to check if the
remediationActionfield is set toinformin thespecof the managed policies. - Resolution
Run the following command:
oc get policies --all-namespaces
$ oc get policies --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-nto-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-nto-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking policy compliance state
- Issue
- You want to check the compliance state of policies.
- Resolution
Run the following command:
oc get policies --all-namespaces
$ oc get policies --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-nto-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE default policy1-common-cluster-version-policy inform NonCompliant 5d21h default policy2-common-nto-sub-policy inform Compliant 5d21h default policy3-common-ptp-sub-policy inform NonCompliant 5d21h default policy4-common-sriov-sub-policy inform NonCompliant 5d21hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.11.9.4. Clusters Copy linkLink copied to clipboard!
Checking if managed clusters are present
- Issue
-
You want to check if the clusters in the
ClusterGroupUpgradeCR are managed clusters. - Resolution
Run the following command:
oc get managedclusters
$ oc get managedclustersCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.example.com:6443 True Unknown 13d spoke1 true https://api.spoke1.example.com:6443 True True 13d spoke3 true https://api.spoke3.example.com:6443 True True 27h
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.example.com:6443 True Unknown 13d spoke1 true https://api.spoke1.example.com:6443 True True 13d spoke3 true https://api.spoke3.example.com:6443 True True 27hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, check the TALM manager logs:
Get the name of the TALM manager by running the following command:
oc get pod -n openshift-operators
$ oc get pod -n openshift-operatorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp 2/2 Running 0 45m
NAME READY STATUS RESTARTS AGE cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp 2/2 Running 0 45mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the TALM manager logs by running the following command:
oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
$ oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c managerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItemERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"}1 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItemCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The error message shows that the cluster is not a managed cluster.
Checking if managed clusters are available
- Issue
-
You want to check if the managed clusters specified in the
ClusterGroupUpgradeCR are available. - Resolution
Run the following command:
oc get managedclusters
$ oc get managedclustersCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.testlab.com:6443 True Unknown 13d spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hub.testlab.com:6443 True Unknown 13d spoke1 true https://api.spoke1.testlab.com:6443 True True 13d1 spoke3 true https://api.spoke3.testlab.com:6443 True True 27h2 Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking clusterLabelSelector
- Issue
-
You want to check if the
clusterLabelSelectorfield specified in theClusterGroupUpgradeCR matches at least one of the managed clusters. - Resolution
Run the following command:
oc get managedcluster --selector=upgrade=true
$ oc get managedcluster --selector=upgrade=true1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The label for the clusters you want to update is
upgrade:true.
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking if canary clusters are present
- Issue
You want to check if the canary clusters are present in the list of clusters.
Example
ClusterGroupUpgradeCRCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Resolution
Run the following commands:
oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'$ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
["spoke1", "spoke3"]
["spoke1", "spoke3"]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check if the canary clusters are present in the list of clusters that match
clusterLabelSelectorlabels by running the following command:oc get managedcluster --selector=upgrade=true
$ oc get managedcluster --selector=upgrade=trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE spoke1 true https://api.spoke1.testlab.com:6443 True True 13d spoke3 true https://api.spoke3.testlab.com:6443 True True 27hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
A cluster can be present in spec.clusters and also be matched by the spec.clusterLabelSelector label.
Checking the pre-caching status on spoke clusters
Check the status of pre-caching by running the following command on the spoke cluster:
oc get jobs,pods -n openshift-talo-pre-cache
$ oc get jobs,pods -n openshift-talo-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.11.9.5. Remediation Strategy Copy linkLink copied to clipboard!
Checking if remediationStrategy is present in the ClusterGroupUpgrade CR
- Issue
-
You want to check if the
remediationStrategyis present in theClusterGroupUpgradeCR. - Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
{"maxConcurrency":2, "timeout":240}{"maxConcurrency":2, "timeout":240}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking if maxConcurrency is specified in the ClusterGroupUpgrade CR
- Issue
-
You want to check if the
maxConcurrencyis specified in theClusterGroupUpgradeCR. - Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
2
2Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.11.9.6. Topology Aware Lifecycle Manager Copy linkLink copied to clipboard!
Checking condition message and status in the ClusterGroupUpgrade CR
- Issue
-
You want to check the value of the
status.conditionsfield in theClusterGroupUpgradeCR. - Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'$ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking corresponding copied policies
- Issue
-
You want to check if every policy from
status.managedPoliciesForUpgradehas a corresponding policy instatus.copiedPolicies. - Resolution
Run the following command:
oc get cgu lab-upgrade -oyaml
$ oc get cgu lab-upgrade -oyamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Checking if status.remediationPlan was computed
- Issue
-
You want to check if
status.remediationPlanis computed. - Resolution
Run the following command:
oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'$ oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
[["spoke2", "spoke3"]]
[["spoke2", "spoke3"]]Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Errors in the TALM manager container
- Issue
- You want to check the logs of the manager container of TALM.
- Resolution
Run the following command:
oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
$ oc logs -n openshift-operators \ cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c managerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItemERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"}1 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItemCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Displays the error.
Clusters are not compliant to some policies after a ClusterGroupUpgrade CR has completed
- Issue
The policy compliance status that TALM uses to decide if remediation is needed has not yet fully updated for all clusters. This may be because:
- The CGU was run too soon after a policy was created or updated.
-
The remediation of a policy affects the compliance of subsequent policies in the
ClusterGroupUpgradeCR.
- Resolution
-
Create and apply a new
ClusterGroupUpdateCR with the same specification.
Auto-created ClusterGroupUpgrade CR in the GitOps ZTP workflow has no managed policies
- Issue
-
If there are no policies for the managed cluster when the cluster becomes
Ready, aClusterGroupUpgradeCR with no policies is auto-created. Upon completion of theClusterGroupUpgradeCR, the managed cluster is labeled asztp-done. If thePolicyGenTemplateCRs were not pushed to the Git repository within the required time afterSiteConfigresources were pushed, this might result in no policies being available for the target cluster when the cluster becameReady. - Resolution
-
Verify that the policies you want to apply are available on the hub cluster, then create a
ClusterGroupUpgradeCR with the required policies.
You can either manually create the ClusterGroupUpgrade CR or trigger auto-creation again. To trigger auto-creation of the ClusterGroupUpgrade CR, remove the ztp-done label from the cluster and delete the empty ClusterGroupUpgrade CR that was previously created in the zip-install namespace.
Pre-caching has failed
- Issue
Pre-caching might fail for one of the following reasons:
- There is not enough free space on the node.
- For a disconnected environment, the pre-cache image has not been properly mirrored.
- There was an issue when creating the pod.
- Resolution
To check if pre-caching has failed due to insufficient space, check the log of the pre-caching pod in the node.
Find the name of the pod using the following command:
oc get pods -n openshift-talo-pre-cache
$ oc get pods -n openshift-talo-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the logs to see if the error is related to insufficient space using the following command:
oc logs -n openshift-talo-pre-cache <pod name>
$ oc logs -n openshift-talo-pre-cache <pod name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
If there is no log, check the pod status using the following command:
oc describe pod -n openshift-talo-pre-cache <pod name>
$ oc describe pod -n openshift-talo-pre-cache <pod name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the pod does not exist, check the job status to see why it could not create a pod using the following command:
oc describe job -n openshift-talo-pre-cache pre-cache
$ oc describe job -n openshift-talo-pre-cache pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.12. Updating managed clusters in a disconnected environment with the Topology Aware Lifecycle Manager Copy linkLink copied to clipboard!
You can use the Topology Aware Lifecycle Manager (TALM) to manage the software lifecycle of OpenShift Container Platform managed clusters. TALM uses Red Hat Advanced Cluster Management (RHACM) policies to perform changes on the target clusters.
22.12.1. Updating clusters in a disconnected environment Copy linkLink copied to clipboard!
You can upgrade managed clusters and Operators for managed clusters that you have deployed using GitOps Zero Touch Provisioning (ZTP) and Topology Aware Lifecycle Manager (TALM).
22.12.1.1. Setting up the environment Copy linkLink copied to clipboard!
TALM can perform both platform and Operator updates.
You must mirror both the platform image and Operator images that you want to update to in your mirror registry before you can use TALM to update your disconnected clusters. Complete the following steps to mirror the images:
For platform updates, you must perform the following steps:
Mirror the desired OpenShift Container Platform image repository. Ensure that the desired platform image is mirrored by following the "Mirroring the OpenShift Container Platform image repository" procedure linked in the Additional resources. Save the contents of the
imageContentSourcessection in theimageContentSources.yamlfile:Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Save the image signature of the desired platform image that was mirrored. You must add the image signature to the
PolicyGenTemplateCR for platform updates. To get the image signature, perform the following steps:Specify the desired OpenShift Container Platform tag by running the following command:
OCP_RELEASE_NUMBER=<release_version>
$ OCP_RELEASE_NUMBER=<release_version>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the architecture of the cluster by running the following command:
ARCHITECTURE=<cluster_architecture>
$ ARCHITECTURE=<cluster_architecture>1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify the architecture of the cluster, such as
x86_64,aarch64,s390x, orppc64le.
Get the release image digest from Quay by running the following command
DIGEST="$(oc adm release info quay.io/openshift-release-dev/ocp-release:${OCP_RELEASE_NUMBER}-${ARCHITECTURE} | sed -n 's/Pull From: .*@//p')"$ DIGEST="$(oc adm release info quay.io/openshift-release-dev/ocp-release:${OCP_RELEASE_NUMBER}-${ARCHITECTURE} | sed -n 's/Pull From: .*@//p')"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the digest algorithm by running the following command:
DIGEST_ALGO="${DIGEST%%:*}"$ DIGEST_ALGO="${DIGEST%%:*}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the digest signature by running the following command:
DIGEST_ENCODED="${DIGEST#*:}"$ DIGEST_ENCODED="${DIGEST#*:}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the image signature from the mirror.openshift.com website by running the following command:
SIGNATURE_BASE64=$(curl -s "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/${DIGEST_ALGO}=${DIGEST_ENCODED}/signature-1" | base64 -w0 && echo)$ SIGNATURE_BASE64=$(curl -s "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/${DIGEST_ALGO}=${DIGEST_ENCODED}/signature-1" | base64 -w0 && echo)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Save the image signature to the
checksum-<OCP_RELEASE_NUMBER>.yamlfile by running the following commands:cat >checksum-${OCP_RELEASE_NUMBER}.yaml <<EOF ${DIGEST_ALGO}-${DIGEST_ENCODED}: ${SIGNATURE_BASE64} EOF$ cat >checksum-${OCP_RELEASE_NUMBER}.yaml <<EOF ${DIGEST_ALGO}-${DIGEST_ENCODED}: ${SIGNATURE_BASE64} EOFCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Prepare the update graph. You have two options to prepare the update graph:
Use the OpenShift Update Service.
For more information about how to set up the graph on the hub cluster, see Deploy the operator for OpenShift Update Service and Build the graph data init container.
Make a local copy of the upstream graph. Host the update graph on an
httporhttpsserver in the disconnected environment that has access to the managed cluster. To download the update graph, use the following command:curl -s https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.14 -o ~/upgrade-graph_stable-4.14
$ curl -s https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.14 -o ~/upgrade-graph_stable-4.14Copy to Clipboard Copied! Toggle word wrap Toggle overflow
For Operator updates, you must perform the following task:
- Mirror the Operator catalogs. Ensure that the desired operator images are mirrored by following the procedure in the "Mirroring Operator catalogs for use with disconnected clusters" section.
22.12.1.2. Performing a platform update Copy linkLink copied to clipboard!
You can perform a platform update with the TALM.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Update GitOps Zero Touch Provisioning (ZTP) to the latest version.
- Provision one or more managed clusters with GitOps ZTP.
- Mirror the desired image repository.
-
Log in as a user with
cluster-adminprivileges. - Create RHACM policies in the hub cluster.
Procedure
Create a
PolicyGenTemplateCR for the platform update:Save the following contents of the
PolicyGenTemplateCR in thedu-upgrade.yamlfile.Example of
PolicyGenTemplatefor platform updateCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
ConfigMapCR contains the signature of the desired release image to update to. - 2
- Shows the image signature of the desired OpenShift Container Platform release. Get the signature from the
checksum-${OCP_RELEASE_NUMBER}.yamlfile you saved when following the procedures in the "Setting up the environment" section. - 3
- Shows the mirror repository that contains the desired OpenShift Container Platform image. Get the mirrors from the
imageContentSources.yamlfile that you saved when following the procedures in the "Setting up the environment" section. - 4
- Shows the
ClusterVersionCR to trigger the update. Thechannel,upstream, anddesiredVersionfields are all required for image pre-caching.
The
PolicyGenTemplateCR generates two policies:-
The
du-upgrade-platform-upgrade-preppolicy does the preparation work for the platform update. It creates theConfigMapCR for the desired release image signature, creates the image content source of the mirrored release image repository, and updates the cluster version with the desired update channel and the update graph reachable by the managed cluster in the disconnected environment. -
The
du-upgrade-platform-upgradepolicy is used to perform platform upgrade.
Add the
du-upgrade.yamlfile contents to thekustomization.yamlfile located in the GitOps ZTP Git repository for thePolicyGenTemplateCRs and push the changes to the Git repository.ArgoCD pulls the changes from the Git repository and generates the policies on the hub cluster.
Check the created policies by running the following command:
oc get policies -A | grep platform-upgrade
$ oc get policies -A | grep platform-upgradeCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Create the
ClusterGroupUpdateCR for the platform update with thespec.enablefield set tofalse.Save the content of the platform update
ClusterGroupUpdateCR with thedu-upgrade-platform-upgrade-prepand thedu-upgrade-platform-upgradepolicies and the target clusters to thecgu-platform-upgrade.ymlfile, as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the
ClusterGroupUpdateCR to the hub cluster by running the following command:oc apply -f cgu-platform-upgrade.yml
$ oc apply -f cgu-platform-upgrade.ymlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Optional: Pre-cache the images for the platform update.
Enable pre-caching in the
ClusterGroupUpdateCR by running the following command:oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-platform-upgrade \ --patch '{"spec":{"preCaching": true}}' --type=merge$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-platform-upgrade \ --patch '{"spec":{"preCaching": true}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the update process and wait for the pre-caching to complete. Check the status of pre-caching by running the following command on the hub cluster:
oc get cgu cgu-platform-upgrade -o jsonpath='{.status.precaching.status}'$ oc get cgu cgu-platform-upgrade -o jsonpath='{.status.precaching.status}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Start the platform update:
Enable the
cgu-platform-upgradepolicy and disable pre-caching by running the following command:oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-platform-upgrade \ --patch '{"spec":{"enable":true, "preCaching": false}}' --type=merge$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-platform-upgrade \ --patch '{"spec":{"enable":true, "preCaching": false}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the process. Upon completion, ensure that the policy is compliant by running the following command:
oc get policies --all-namespaces
$ oc get policies --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.12.1.3. Performing an Operator update Copy linkLink copied to clipboard!
You can perform an Operator update with the TALM.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Update GitOps Zero Touch Provisioning (ZTP) to the latest version.
- Provision one or more managed clusters with GitOps ZTP.
- Mirror the desired index image, bundle images, and all Operator images referenced in the bundle images.
-
Log in as a user with
cluster-adminprivileges. - Create RHACM policies in the hub cluster.
Procedure
Update the
PolicyGenTemplateCR for the Operator update.Update the
du-upgradePolicyGenTemplateCR with the following additional contents in thedu-upgrade.yamlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The index image URL contains the desired Operator images. If the index images are always pushed to the same image name and tag, this change is not needed.
- 2
- Set how frequently the Operator Lifecycle Manager (OLM) polls the index image for new Operator versions with the
registryPoll.intervalfield. This change is not needed if a new index image tag is always pushed for y-stream and z-stream Operator updates. TheregistryPoll.intervalfield can be set to a shorter interval to expedite the update, however shorter intervals increase computational load. To counteract this, you can restoreregistryPoll.intervalto the default value once the update is complete. - 3
- Last observed state of the catalog connection. The
READYvalue ensures that theCatalogSourcepolicy is ready, indicating that the index pod is pulled and is running. This way, TALM upgrades the Operators based on up-to-date policy compliance states.
This update generates one policy,
du-upgrade-operator-catsrc-policy, to update theredhat-operators-disconnectedcatalog source with the new index images that contain the desired Operators images.NoteIf you want to use the image pre-caching for Operators and there are Operators from a different catalog source other than
redhat-operators-disconnected, you must perform the following tasks:- Prepare a separate catalog source policy with the new index image or registry poll interval update for the different catalog source.
- Prepare a separate subscription policy for the desired Operators that are from the different catalog source.
For example, the desired SRIOV-FEC Operator is available in the
certified-operatorscatalog source. To update the catalog source and the Operator subscription, add the following contents to generate two policies,du-upgrade-fec-catsrc-policyanddu-upgrade-subscriptions-fec-policy:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the specified subscriptions channels in the common
PolicyGenTemplateCR, if they exist. The default subscriptions channels from the GitOps ZTP image are used for the update.NoteThe default channel for the Operators applied through GitOps ZTP 4.14 is
stable, except for theperformance-addon-operator. As of OpenShift Container Platform 4.11, theperformance-addon-operatorfunctionality was moved to thenode-tuning-operator. For the 4.10 release, the default channel for PAO isv4.10. You can also specify the default channels in the commonPolicyGenTemplateCR.Push the
PolicyGenTemplateCRs updates to the GitOps ZTP Git repository.ArgoCD pulls the changes from the Git repository and generates the policies on the hub cluster.
Check the created policies by running the following command:
oc get policies -A | grep -E "catsrc-policy|subscription"
$ oc get policies -A | grep -E "catsrc-policy|subscription"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Apply the required catalog source updates before starting the Operator update.
Save the content of the
ClusterGroupUpgradeCR namedoperator-upgrade-prepwith the catalog source policies and the target managed clusters to thecgu-operator-upgrade-prep.ymlfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the policy to the hub cluster by running the following command:
oc apply -f cgu-operator-upgrade-prep.yml
$ oc apply -f cgu-operator-upgrade-prep.ymlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the update process. Upon completion, ensure that the policy is compliant by running the following command:
oc get policies -A | grep -E "catsrc-policy"
$ oc get policies -A | grep -E "catsrc-policy"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create the
ClusterGroupUpgradeCR for the Operator update with thespec.enablefield set tofalse.Save the content of the Operator update
ClusterGroupUpgradeCR with thedu-upgrade-operator-catsrc-policypolicy and the subscription policies created from the commonPolicyGenTemplateand the target clusters to thecgu-operator-upgrade.ymlfile, as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The policy is needed by the image pre-caching feature to retrieve the operator images from the catalog source.
- 2
- The policy contains Operator subscriptions. If you have followed the structure and content of the reference
PolicyGenTemplates, all Operator subscriptions are grouped into thecommon-subscriptions-policypolicy.
NoteOne
ClusterGroupUpgradeCR can only pre-cache the images of the desired Operators defined in the subscription policy from one catalog source included in theClusterGroupUpgradeCR. If the desired Operators are from different catalog sources, such as in the example of the SRIOV-FEC Operator, anotherClusterGroupUpgradeCR must be created withdu-upgrade-fec-catsrc-policyanddu-upgrade-subscriptions-fec-policypolicies for the SRIOV-FEC Operator images pre-caching and update.Apply the
ClusterGroupUpgradeCR to the hub cluster by running the following command:oc apply -f cgu-operator-upgrade.yml
$ oc apply -f cgu-operator-upgrade.ymlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Optional: Pre-cache the images for the Operator update.
Before starting image pre-caching, verify the subscription policy is
NonCompliantat this point by running the following command:oc get policy common-subscriptions-policy -n <policy_namespace>
$ oc get policy common-subscriptions-policy -n <policy_namespace>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME REMEDIATION ACTION COMPLIANCE STATE AGE common-subscriptions-policy inform NonCompliant 27d
NAME REMEDIATION ACTION COMPLIANCE STATE AGE common-subscriptions-policy inform NonCompliant 27dCopy to Clipboard Copied! Toggle word wrap Toggle overflow Enable pre-caching in the
ClusterGroupUpgradeCR by running the following command:oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-operator-upgrade \ --patch '{"spec":{"preCaching": true}}' --type=merge$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-operator-upgrade \ --patch '{"spec":{"preCaching": true}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the process and wait for the pre-caching to complete. Check the status of pre-caching by running the following command on the managed cluster:
oc get cgu cgu-operator-upgrade -o jsonpath='{.status.precaching.status}'$ oc get cgu cgu-operator-upgrade -o jsonpath='{.status.precaching.status}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check if the pre-caching is completed before starting the update by running the following command:
oc get cgu -n default cgu-operator-upgrade -ojsonpath='{.status.conditions}' | jq$ oc get cgu -n default cgu-operator-upgrade -ojsonpath='{.status.conditions}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Start the Operator update.
Enable the
cgu-operator-upgradeClusterGroupUpgradeCR and disable pre-caching to start the Operator update by running the following command:oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-operator-upgrade \ --patch '{"spec":{"enable":true, "preCaching": false}}' --type=merge$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-operator-upgrade \ --patch '{"spec":{"enable":true, "preCaching": false}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the process. Upon completion, ensure that the policy is compliant by running the following command:
oc get policies --all-namespaces
$ oc get policies --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.12.1.3.1. Troubleshooting missed Operator updates due to out-of-date policy compliance states Copy linkLink copied to clipboard!
In some scenarios, Topology Aware Lifecycle Manager (TALM) might miss Operator updates due to an out-of-date policy compliance state.
After a catalog source update, it takes time for the Operator Lifecycle Manager (OLM) to update the subscription status. The status of the subscription policy might continue to show as compliant while TALM decides whether remediation is needed. As a result, the Operator specified in the subscription policy does not get upgraded.
To avoid this scenario, add another catalog source configuration to the PolicyGenTemplate and specify this configuration in the subscription for any Operators that require an update.
Procedure
Add a catalog source configuration in the
PolicyGenTemplateresource:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the
Subscriptionresource to point to the new configuration for Operators that require an update:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Enter the name of the additional catalog source configuration that you defined in the
PolicyGenTemplateresource.
22.12.1.4. Performing a platform and an Operator update together Copy linkLink copied to clipboard!
You can perform a platform and an Operator update at the same time.
Prerequisites
- Install the Topology Aware Lifecycle Manager (TALM).
- Update GitOps Zero Touch Provisioning (ZTP) to the latest version.
- Provision one or more managed clusters with GitOps ZTP.
-
Log in as a user with
cluster-adminprivileges. - Create RHACM policies in the hub cluster.
Procedure
-
Create the
PolicyGenTemplateCR for the updates by following the steps described in the "Performing a platform update" and "Performing an Operator update" sections. Apply the prep work for the platform and the Operator update.
Save the content of the
ClusterGroupUpgradeCR with the policies for platform update preparation work, catalog source updates, and target clusters to thecgu-platform-operator-upgrade-prep.ymlfile, for example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the
cgu-platform-operator-upgrade-prep.ymlfile to the hub cluster by running the following command:oc apply -f cgu-platform-operator-upgrade-prep.yml
$ oc apply -f cgu-platform-operator-upgrade-prep.ymlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the process. Upon completion, ensure that the policy is compliant by running the following command:
oc get policies --all-namespaces
$ oc get policies --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Create the
ClusterGroupUpdateCR for the platform and the Operator update with thespec.enablefield set tofalse.Save the contents of the platform and Operator update
ClusterGroupUpdateCR with the policies and the target clusters to thecgu-platform-operator-upgrade.ymlfile, as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the
cgu-platform-operator-upgrade.ymlfile to the hub cluster by running the following command:oc apply -f cgu-platform-operator-upgrade.yml
$ oc apply -f cgu-platform-operator-upgrade.ymlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Optional: Pre-cache the images for the platform and the Operator update.
Enable pre-caching in the
ClusterGroupUpgradeCR by running the following command:oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-du-upgrade \ --patch '{"spec":{"preCaching": true}}' --type=merge$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-du-upgrade \ --patch '{"spec":{"preCaching": true}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the update process and wait for the pre-caching to complete. Check the status of pre-caching by running the following command on the managed cluster:
oc get jobs,pods -n openshift-talm-pre-cache
$ oc get jobs,pods -n openshift-talm-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check if the pre-caching is completed before starting the update by running the following command:
oc get cgu cgu-du-upgrade -ojsonpath='{.status.conditions}'$ oc get cgu cgu-du-upgrade -ojsonpath='{.status.conditions}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Start the platform and Operator update.
Enable the
cgu-du-upgradeClusterGroupUpgradeCR to start the platform and the Operator update by running the following command:oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-du-upgrade \ --patch '{"spec":{"enable":true, "preCaching": false}}' --type=merge$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-du-upgrade \ --patch '{"spec":{"enable":true, "preCaching": false}}' --type=mergeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the process. Upon completion, ensure that the policy is compliant by running the following command:
oc get policies --all-namespaces
$ oc get policies --all-namespacesCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe CRs for the platform and Operator updates can be created from the beginning by configuring the setting to
spec.enable: true. In this case, the update starts immediately after pre-caching completes and there is no need to manually enable the CR.Both pre-caching and the update create extra resources, such as policies, placement bindings, placement rules, managed cluster actions, and managed cluster view, to help complete the procedures. Setting the
afterCompletion.deleteObjectsfield totruedeletes all these resources after the updates complete.
22.12.1.5. Removing Performance Addon Operator subscriptions from deployed clusters Copy linkLink copied to clipboard!
In earlier versions of OpenShift Container Platform, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OpenShift Container Platform 4.11 or later, these functions are part of the Node Tuning Operator.
Do not install the Performance Addon Operator on clusters running OpenShift Container Platform 4.11 or later. If you upgrade to OpenShift Container Platform 4.11 or later, the Node Tuning Operator automatically removes the Performance Addon Operator.
You need to remove any policies that create Performance Addon Operator subscriptions to prevent a re-installation of the Operator.
The reference DU profile includes the Performance Addon Operator in the PolicyGenTemplate CR common-ranGen.yaml. To remove the subscription from deployed managed clusters, you must update common-ranGen.yaml.
If you install Performance Addon Operator 4.10.3-5 or later on OpenShift Container Platform 4.11 or later, the Performance Addon Operator detects the cluster version and automatically hibernates to avoid interfering with the Node Tuning Operator functions. However, to ensure best performance, remove the Performance Addon Operator from your OpenShift Container Platform 4.11 clusters.
Prerequisites
- Create a Git repository where you manage your custom site configuration data. The repository must be accessible from the hub cluster and be defined as a source repository for ArgoCD.
- Update to OpenShift Container Platform 4.11 or later.
-
Log in as a user with
cluster-adminprivileges.
Procedure
Change the
complianceTypetomustnothavefor the Performance Addon Operator namespace, Operator group, and subscription in thecommon-ranGen.yamlfile.Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Merge the changes with your custom site repository and wait for the ArgoCD application to synchronize the change to the hub cluster. The status of the
common-subscriptions-policypolicy changes toNon-Compliant. - Apply the change to your target clusters by using the Topology Aware Lifecycle Manager. For more information about rolling out configuration changes, see the "Additional resources" section.
Monitor the process. When the status of the
common-subscriptions-policypolicy for a target cluster isCompliant, the Performance Addon Operator has been removed from the cluster. Get the status of thecommon-subscriptions-policyby running the following command:oc get policy -n ztp-common common-subscriptions-policy
$ oc get policy -n ztp-common common-subscriptions-policyCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Delete the Performance Addon Operator namespace, Operator group and subscription CRs from
.spec.sourceFilesin thecommon-ranGen.yamlfile. - Merge the changes with your custom site repository and wait for the ArgoCD application to synchronize the change to the hub cluster. The policy remains compliant.
22.12.1.6. Pre-caching user-specified images with TALM on single-node OpenShift clusters Copy linkLink copied to clipboard!
You can pre-cache application-specific workload images on single-node OpenShift clusters before upgrading your applications.
You can specify the configuration options for the pre-caching jobs using the following custom resources (CR):
-
PreCachingConfigCR -
ClusterGroupUpgradeCR
All fields in the PreCachingConfig CR are optional.
Example PreCachingConfig CR
- 1
- By default, TALM automatically populates the
platformImage,operatorsIndexes, and theoperatorsPackagesAndChannelsfields from the policies of the managed clusters. You can specify values to override the default TALM-derived values for these fields. - 2
- Specifies the minimum required disk space on the cluster. If unspecified, TALM defines a default value for OpenShift Container Platform images. The disk space field must include an integer value and the storage unit. For example:
40 GiB,200 MB,1 TiB. - 3
- Specifies the images to exclude from pre-caching based on image name matching.
- 4
- Specifies the list of additional images to pre-cache.
Example ClusterGroupUpgrade CR with PreCachingConfig CR reference
22.12.1.6.1. Creating the custom resources for pre-caching Copy linkLink copied to clipboard!
You must create the PreCachingConfig CR before or concurrently with the ClusterGroupUpgrade CR.
Create the
PreCachingConfigCR with the list of additional images you want to pre-cache.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
ClusterGroupUpgradeCR with thepreCachingfield set totrueand specify thePreCachingConfigCR created in the previous step:Copy to Clipboard Copied! Toggle word wrap Toggle overflow WarningOnce you install the images on the cluster, you cannot change or delete them.
When you want to start pre-caching the images, apply the
ClusterGroupUpgradeCR by running the following command:oc apply -f cgu.yaml
$ oc apply -f cgu.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
TALM verifies the ClusterGroupUpgrade CR.
From this point, you can continue with the TALM pre-caching workflow.
All sites are pre-cached concurrently.
Verification
Check the pre-caching status on the hub cluster where the
ClusterUpgradeGroupCR is applied by running the following command:oc get cgu <cgu_name> -n <cgu_namespace> -oyaml
$ oc get cgu <cgu_name> -n <cgu_namespace> -oyamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pre-caching configurations are validated by checking if the managed policies exist. Valid configurations of the
ClusterGroupUpgradeand thePreCachingConfigCRs result in the following statuses:Example output of valid CRs
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example of an invalid PreCachingConfig CR
Type: "PrecacheSpecValid" Status: False, Reason: "PrecacheSpecIncomplete" Message: "Precaching spec is incomplete: failed to get PreCachingConfig resource due to PreCachingConfig.ran.openshift.io "<pre-caching_cr_name>" not found"
Type: "PrecacheSpecValid" Status: False, Reason: "PrecacheSpecIncomplete" Message: "Precaching spec is incomplete: failed to get PreCachingConfig resource due to PreCachingConfig.ran.openshift.io "<pre-caching_cr_name>" not found"Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can find the pre-caching job by running the following command on the managed cluster:
oc get jobs -n openshift-talo-pre-cache
$ oc get jobs -n openshift-talo-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example of pre-caching job in progress
NAME COMPLETIONS DURATION AGE pre-cache 0/1 1s 1s
NAME COMPLETIONS DURATION AGE pre-cache 0/1 1s 1sCopy to Clipboard Copied! Toggle word wrap Toggle overflow You can check the status of the pod created for the pre-caching job by running the following command:
oc describe pod pre-cache -n openshift-talo-pre-cache
$ oc describe pod pre-cache -n openshift-talo-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example of pre-caching job in progress
Type Reason Age From Message Normal SuccesfulCreate 19s job-controller Created pod: pre-cache-abcd1
Type Reason Age From Message Normal SuccesfulCreate 19s job-controller Created pod: pre-cache-abcd1Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can get live updates on the status of the job by running the following command:
oc logs -f pre-cache-abcd1 -n openshift-talo-pre-cache
$ oc logs -f pre-cache-abcd1 -n openshift-talo-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow To verify the pre-cache job is successfully completed, run the following command:
oc describe pod pre-cache -n openshift-talo-pre-cache
$ oc describe pod pre-cache -n openshift-talo-pre-cacheCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example of completed pre-cache job
Type Reason Age From Message Normal SuccesfulCreate 5m19s job-controller Created pod: pre-cache-abcd1 Normal Completed 19s job-controller Job completed
Type Reason Age From Message Normal SuccesfulCreate 5m19s job-controller Created pod: pre-cache-abcd1 Normal Completed 19s job-controller Job completedCopy to Clipboard Copied! Toggle word wrap Toggle overflow To verify that the images are successfully pre-cached on the single-node OpenShift, do the following:
Enter into the node in debug mode:
oc debug node/cnfdf00.example.lab
$ oc debug node/cnfdf00.example.labCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change root to
host:chroot /host/
$ chroot /host/Copy to Clipboard Copied! Toggle word wrap Toggle overflow Search for the desired images:
sudo podman images | grep <operator_name>
$ sudo podman images | grep <operator_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.12.2. About the auto-created ClusterGroupUpgrade CR for GitOps ZTP Copy linkLink copied to clipboard!
TALM has a controller called ManagedClusterForCGU that monitors the Ready state of the ManagedCluster CRs on the hub cluster and creates the ClusterGroupUpgrade CRs for GitOps Zero Touch Provisioning (ZTP).
For any managed cluster in the Ready state without a ztp-done label applied, the ManagedClusterForCGU controller automatically creates a ClusterGroupUpgrade CR in the ztp-install namespace with its associated RHACM policies that are created during the GitOps ZTP process. TALM then remediates the set of configuration policies that are listed in the auto-created ClusterGroupUpgrade CR to push the configuration CRs to the managed cluster.
If there are no policies for the managed cluster at the time when the cluster becomes Ready, a ClusterGroupUpgrade CR with no policies is created. Upon completion of the ClusterGroupUpgrade the managed cluster is labeled as ztp-done. If there are policies that you want to apply for that managed cluster, manually create a ClusterGroupUpgrade as a day-2 operation.
Example of an auto-created ClusterGroupUpgrade CR for GitOps ZTP
22.13. Expanding single-node OpenShift clusters with GitOps ZTP Copy linkLink copied to clipboard!
You can expand single-node OpenShift clusters with GitOps Zero Touch Provisioning (ZTP). When you add worker nodes to single-node OpenShift clusters, the original single-node OpenShift cluster retains the control plane node role. Adding worker nodes does not require any downtime for the existing single-node OpenShift cluster.
Although there is no specified limit on the number of worker nodes that you can add to a single-node OpenShift cluster, you must revaluate the reserved CPU allocation on the control plane node for the additional worker nodes.
If you require workload partitioning on the worker node, you must deploy and remediate the managed cluster policies on the hub cluster before installing the node. This way, the workload partitioning MachineConfig objects are rendered and associated with the worker machine config pool before the GitOps ZTP workflow applies the MachineConfig ignition file to the worker node.
It is recommended that you first remediate the policies, and then install the worker node. If you create the workload partitioning manifests after installing the worker node, you must drain the node manually and delete all the pods managed by daemon sets. When the managing daemon sets create the new pods, the new pods undergo the workload partitioning process.
Adding worker nodes to single-node OpenShift clusters with GitOps ZTP is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
22.13.1. Applying profiles to the worker node Copy linkLink copied to clipboard!
You can configure the additional worker node with a DU profile.
You can apply a RAN distributed unit (DU) profile to the worker node cluster using the GitOps Zero Touch Provisioning (ZTP) common, group, and site-specific PolicyGenTemplate resources. The GitOps ZTP pipeline that is linked to the ArgoCD policies application includes the following CRs that you can find in the out/argocd/example/policygentemplates folder when you extract the ztp-site-generate container:
-
common-ranGen.yaml -
group-du-sno-ranGen.yaml -
example-sno-site.yaml -
ns.yaml -
kustomization.yaml
Configuring the DU profile on the worker node is considered an upgrade. To initiate the upgrade flow, you must update the existing policies or create additional ones. Then, you must create a ClusterGroupUpgrade CR to reconcile the policies in the group of clusters.
22.13.2. (Optional) Ensuring PTP and SR-IOV daemon selector compatibility Copy linkLink copied to clipboard!
If the DU profile was deployed using the GitOps Zero Touch Provisioning (ZTP) plugin version 4.11 or earlier, the PTP and SR-IOV Operators might be configured to place the daemons only on nodes labelled as master. This configuration prevents the PTP and SR-IOV daemons from operating on the worker node. If the PTP and SR-IOV daemon node selectors are incorrectly configured on your system, you must change the daemons before proceeding with the worker DU profile configuration.
Procedure
Check the daemon node selector settings of the PTP Operator on one of the spoke clusters:
oc get ptpoperatorconfig/default -n openshift-ptp -ojsonpath='{.spec}' | jq$ oc get ptpoperatorconfig/default -n openshift-ptp -ojsonpath='{.spec}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for PTP Operator
{"daemonNodeSelector":{"node-role.kubernetes.io/master":""}}{"daemonNodeSelector":{"node-role.kubernetes.io/master":""}}1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- If the node selector is set to
master, the spoke was deployed with the version of the GitOps ZTP plugin that requires changes.
Check the daemon node selector settings of the SR-IOV Operator on one of the spoke clusters:
oc get sriovoperatorconfig/default -n \ openshift-sriov-network-operator -ojsonpath='{.spec}' | jq$ oc get sriovoperatorconfig/default -n \ openshift-sriov-network-operator -ojsonpath='{.spec}' | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output for SR-IOV Operator
{"configDaemonNodeSelector":{"node-role.kubernetes.io/worker":""},"disableDrain":false,"enableInjector":true,"enableOperatorWebhook":true}{"configDaemonNodeSelector":{"node-role.kubernetes.io/worker":""},"disableDrain":false,"enableInjector":true,"enableOperatorWebhook":true}1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- If the node selector is set to
master, the spoke was deployed with the version of the GitOps ZTP plugin that requires changes.
In the group policy, add the following
complianceTypeandspecentries:Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantChanging the
daemonNodeSelectorfield causes temporary PTP synchronization loss and SR-IOV connectivity loss.- Commit the changes in Git, and then push to the Git repository being monitored by the GitOps ZTP ArgoCD application.
22.13.3. PTP and SR-IOV node selector compatibility Copy linkLink copied to clipboard!
The PTP configuration resources and SR-IOV network node policies use node-role.kubernetes.io/master: "" as the node selector. If the additional worker nodes have the same NIC configuration as the control plane node, the policies used to configure the control plane node can be reused for the worker nodes. However, the node selector must be changed to select both node types, for example with the "node-role.kubernetes.io/worker" label.
22.13.4. Using PolicyGenTemplate CRs to apply worker node policies to worker nodes Copy linkLink copied to clipboard!
You can create policies for worker nodes.
Procedure
Create the following policy template:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The policies are applied to all clusters with this label.
- 2
- The
MCPfield must be set toworker. - 3
- This generic
MachineConfigCR is used to configure workload partitioning on the worker node. - 4
- The
cpu.isolatedandcpu.reservedfields must be configured for each particular hardware platform. - 5
- The
cmdline_crashCPU set must match thecpu.isolatedset in thePerformanceProfilesection.
A generic
MachineConfigCR is used to configure workload partitioning on the worker node. You can generate the content ofcrioandkubeletconfiguration files.-
Add the created policy template to the Git repository monitored by the ArgoCD
policiesapplication. -
Add the policy in the
kustomization.yamlfile. - Commit the changes in Git, and then push to the Git repository being monitored by the GitOps ZTP ArgoCD application.
To remediate the new policies to your spoke cluster, create a TALM custom resource:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.13.5. Adding worker nodes to single-node OpenShift clusters with GitOps ZTP Copy linkLink copied to clipboard!
You can add one or more worker nodes to existing single-node OpenShift clusters to increase available CPU resources in the cluster.
Prerequisites
- Install and configure RHACM 2.6 or later in an OpenShift Container Platform 4.11 or later bare-metal hub cluster
- Install Topology Aware Lifecycle Manager in the hub cluster
- Install Red Hat OpenShift GitOps in the hub cluster
-
Use the GitOps ZTP
ztp-site-generatecontainer image version 4.12 or later - Deploy a managed single-node OpenShift cluster with GitOps ZTP
- Configure the Central Infrastructure Management as described in the RHACM documentation
-
Configure the DNS serving the cluster to resolve the internal API endpoint
api-int.<cluster_name>.<base_domain>
Procedure
If you deployed your cluster by using the
example-sno.yamlSiteConfigmanifest, add your new worker node to thespec.clusters['example-sno'].nodeslist:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a BMC authentication secret for the new host, as referenced by the
bmcCredentialsNamefield in thespec.nodessection of yourSiteConfigfile:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Commit the changes in Git, and then push to the Git repository that is being monitored by the GitOps ZTP ArgoCD application.
When the ArgoCD
clusterapplication synchronizes, two new manifests appear on the hub cluster generated by the GitOps ZTP plugin:-
BareMetalHost NMStateConfigImportantThe
cpusetfield should not be configured for the worker node. Workload partitioning for worker nodes is added through management policies after the node installation is complete.
-
Verification
You can monitor the installation process in several ways.
Check if the preprovisioning images are created by running the following command:
oc get ppimg -n example-sno
$ oc get ppimg -n example-snoCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAMESPACE NAME READY REASON example-sno example-sno True ImageCreated example-sno example-node2 True ImageCreated
NAMESPACE NAME READY REASON example-sno example-sno True ImageCreated example-sno example-node2 True ImageCreatedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the state of the bare-metal hosts:
oc get bmh -n example-sno
$ oc get bmh -n example-snoCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATE CONSUMER ONLINE ERROR AGE example-sno provisioned true 69m example-node2 provisioning true 4m50s
NAME STATE CONSUMER ONLINE ERROR AGE example-sno provisioned true 69m example-node2 provisioning true 4m50s1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
provisioningstate indicates that node booting from the installation media is in progress.
Continuously monitor the installation process:
Watch the agent install process by running the following command:
oc get agent -n example-sno --watch
$ oc get agent -n example-sno --watchCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow When the worker node installation is finished, the worker node certificates are approved automatically. At this point, the worker appears in the
ManagedClusterInfostatus. Run the following command to see the status:oc get managedclusterinfo/example-sno -n example-sno -o \ jsonpath='{range .status.nodeList[*]}{.name}{"\t"}{.conditions}{"\t"}{.labels}{"\n"}{end}'$ oc get managedclusterinfo/example-sno -n example-sno -o \ jsonpath='{range .status.nodeList[*]}{.name}{"\t"}{.conditions}{"\t"}{.labels}{"\n"}{end}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
example-sno [{"status":"True","type":"Ready"}] {"node-role.kubernetes.io/master":"","node-role.kubernetes.io/worker":""} example-node2 [{"status":"True","type":"Ready"}] {"node-role.kubernetes.io/worker":""}example-sno [{"status":"True","type":"Ready"}] {"node-role.kubernetes.io/master":"","node-role.kubernetes.io/worker":""} example-node2 [{"status":"True","type":"Ready"}] {"node-role.kubernetes.io/worker":""}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.14. Pre-caching images for single-node OpenShift deployments Copy linkLink copied to clipboard!
In environments with limited bandwidth where you use the GitOps Zero Touch Provisioning (ZTP) solution to deploy a large number of clusters, you want to avoid downloading all the images that are required for bootstrapping and installing OpenShift Container Platform. The limited bandwidth at remote single-node OpenShift sites can cause long deployment times. The factory-precaching-cli tool allows you to pre-stage servers before shipping them to the remote site for ZTP provisioning.
The factory-precaching-cli tool does the following:
- Downloads the RHCOS rootfs image that is required by the minimal ISO to boot.
-
Creates a partition from the installation disk labelled as
data. - Formats the disk in xfs.
- Creates a GUID Partition Table (GPT) data partition at the end of the disk, where the size of the partition is configurable by the tool.
- Copies the container images required to install OpenShift Container Platform.
- Copies the container images required by ZTP to install OpenShift Container Platform.
- Optional: Copies Day-2 Operators to the partition.
The factory-precaching-cli tool is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
22.14.1. Getting the factory-precaching-cli tool Copy linkLink copied to clipboard!
The factory-precaching-cli tool Go binary is publicly available in the Telco RAN distributed unit (DU) tools container image. The factory-precaching-cli tool Go binary in the container image is executed on the server running an RHCOS live image using podman. If you are working in a disconnected environment or have a private registry, you need to copy the image there so you can download the image to the server.
Procedure
Pull the factory-precaching-cli tool image by running the following command:
podman pull quay.io/openshift-kni/telco-ran-tools:latest
# podman pull quay.io/openshift-kni/telco-ran-tools:latestCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
To check that the tool is available, query the current version of the factory-precaching-cli tool Go binary:
podman run quay.io/openshift-kni/telco-ran-tools:latest -- factory-precaching-cli -v
# podman run quay.io/openshift-kni/telco-ran-tools:latest -- factory-precaching-cli -vCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
factory-precaching-cli version 20221018.120852+main.feecf17
factory-precaching-cli version 20221018.120852+main.feecf17Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.14.2. Booting from a live operating system image Copy linkLink copied to clipboard!
You can use the factory-precaching-cli tool with to boot servers where only one disk is available and external disk drive cannot be attached to the server.
RHCOS requires the disk to not be in use when the disk is about to be written with an RHCOS image.
Depending on the server hardware, you can mount the RHCOS live ISO on the blank server using one of the following methods:
- Using the Dell RACADM tool on a Dell server.
- Using the HPONCFG tool on a HP server.
- Using the Redfish BMC API.
It is recommended to automate the mounting procedure. To automate the procedure, you need to pull the required images and host them on a local HTTP server.
Prerequisites
- You powered up the host.
- You have network connectivity to the host.
This example procedure uses the Redfish BMC API to mount the RHCOS live ISO.
Mount the RHCOS live ISO:
Check virtual media status:
curl --globoff -H "Content-Type: application/json" -H \ "Accept: application/json" -k -X GET --user ${username_password} \ https://$BMC_ADDRESS/redfish/v1/Managers/Self/VirtualMedia/1 | python -m json.tool$ curl --globoff -H "Content-Type: application/json" -H \ "Accept: application/json" -k -X GET --user ${username_password} \ https://$BMC_ADDRESS/redfish/v1/Managers/Self/VirtualMedia/1 | python -m json.toolCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mount the ISO file as a virtual media:
curl --globoff -L -w "%{http_code} %{url_effective}\\n" -ku ${username_password} -H "Content-Type: application/json" -H "Accept: application/json" -d '{"Image": "http://[$HTTPd_IP]/RHCOS-live.iso"}' -X POST https://$BMC_ADDRESS/redfish/v1/Managers/Self/VirtualMedia/1/Actions/VirtualMedia.InsertMedia$ curl --globoff -L -w "%{http_code} %{url_effective}\\n" -ku ${username_password} -H "Content-Type: application/json" -H "Accept: application/json" -d '{"Image": "http://[$HTTPd_IP]/RHCOS-live.iso"}' -X POST https://$BMC_ADDRESS/redfish/v1/Managers/Self/VirtualMedia/1/Actions/VirtualMedia.InsertMediaCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set the boot order to boot from the virtual media once:
curl --globoff -L -w "%{http_code} %{url_effective}\\n" -ku ${username_password} -H "Content-Type: application/json" -H "Accept: application/json" -d '{"Boot":{ "BootSourceOverrideEnabled": "Once", "BootSourceOverrideTarget": "Cd", "BootSourceOverrideMode": "UEFI"}}' -X PATCH https://$BMC_ADDRESS/redfish/v1/Systems/Self$ curl --globoff -L -w "%{http_code} %{url_effective}\\n" -ku ${username_password} -H "Content-Type: application/json" -H "Accept: application/json" -d '{"Boot":{ "BootSourceOverrideEnabled": "Once", "BootSourceOverrideTarget": "Cd", "BootSourceOverrideMode": "UEFI"}}' -X PATCH https://$BMC_ADDRESS/redfish/v1/Systems/SelfCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- Reboot and ensure that the server is booting from virtual media.
22.14.3. Partitioning the disk Copy linkLink copied to clipboard!
To run the full pre-caching process, you have to boot from a live ISO and use the factory-precaching-cli tool from a container image to partition and pre-cache all the artifacts required.
A live ISO or RHCOS live ISO is required because the disk must not be in use when the operating system (RHCOS) is written to the device during the provisioning. Single-disk servers can also be enabled with this procedure.
Prerequisites
- You have a disk that is not partitioned.
-
You have access to the
quay.io/openshift-kni/telco-ran-tools:latestimage. - You have enough storage to install OpenShift Container Platform and pre-cache the required images.
Procedure
Verify that the disk is cleared:
lsblk
# lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 93.8G 0 loop /run/ephemeral loop1 7:1 0 897.3M 1 loop /sysroot sr0 11:0 1 999M 0 rom /run/media/iso nvme0n1 259:1 0 1.5T 0 disk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 93.8G 0 loop /run/ephemeral loop1 7:1 0 897.3M 1 loop /sysroot sr0 11:0 1 999M 0 rom /run/media/iso nvme0n1 259:1 0 1.5T 0 diskCopy to Clipboard Copied! Toggle word wrap Toggle overflow Erase any file system, RAID or partition table signatures from the device:
wipefs -a /dev/nvme0n1
# wipefs -a /dev/nvme0n1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
/dev/nvme0n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 /dev/nvme0n1: 8 bytes were erased at offset 0x1749a955e00 (gpt): 45 46 49 20 50 41 52 54 /dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
/dev/nvme0n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 /dev/nvme0n1: 8 bytes were erased at offset 0x1749a955e00 (gpt): 45 46 49 20 50 41 52 54 /dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aaCopy to Clipboard Copied! Toggle word wrap Toggle overflow
The tool fails if the disk is not empty because it uses partition number 1 of the device for pre-caching the artifacts.
22.14.3.1. Creating the partition Copy linkLink copied to clipboard!
Once the device is ready, you create a single partition and a GPT partition table. The partition is automatically labelled as data and created at the end of the device. Otherwise, the partition will be overridden by the coreos-installer.
The coreos-installer requires the partition to be created at the end of the device and to be labelled as data. Both requirements are necessary to save the partition when writing the RHCOS image to the disk.
Prerequisites
-
The container must run as
privilegeddue to formatting host devices. -
You have to mount the
/devfolder so that the process can be executed inside the container.
Procedure
In the following example, the size of the partition is 250 GiB due to allow pre-caching the DU profile for Day 2 Operators.
Run the container as
privilegedand partition the disk:podman run -v /dev:/dev --privileged \ --rm quay.io/openshift-kni/telco-ran-tools:latest -- \ factory-precaching-cli partition \ -d /dev/nvme0n1 \ -s 250
# podman run -v /dev:/dev --privileged \ --rm quay.io/openshift-kni/telco-ran-tools:latest -- \ factory-precaching-cli partition \1 -d /dev/nvme0n1 \2 -s 2503 Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the storage information:
lsblk
# lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
You must verify that the following requirements are met:
- The device has a GPT partition table
- The partition uses the latest sectors of the device.
-
The partition is correctly labeled as
data.
Query the disk status to verify that the disk is partitioned as expected:
gdisk -l /dev/nvme0n1
# gdisk -l /dev/nvme0n1
Example output
22.14.3.2. Mounting the partition Copy linkLink copied to clipboard!
After verifying that the disk is partitioned correctly, you can mount the device into /mnt.
It is recommended to mount the device into /mnt because that mounting point is used during GitOps ZTP preparation.
Verify that the partition is formatted as
xfs:lsblk -f /dev/nvme0n1
# lsblk -f /dev/nvme0n1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME FSTYPE LABEL UUID MOUNTPOINT nvme0n1 └─nvme0n1p1 xfs 1bee8ea4-d6cf-4339-b690-a76594794071
NAME FSTYPE LABEL UUID MOUNTPOINT nvme0n1 └─nvme0n1p1 xfs 1bee8ea4-d6cf-4339-b690-a76594794071Copy to Clipboard Copied! Toggle word wrap Toggle overflow Mount the partition:
mount /dev/nvme0n1p1 /mnt/
# mount /dev/nvme0n1p1 /mnt/Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check that the partition is mounted:
lsblk
# lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The mount point is
/var/mntbecause the/mntfolder in RHCOS is a link to/var/mnt.
22.14.4. Downloading the images Copy linkLink copied to clipboard!
The factory-precaching-cli tool allows you to download the following images to your partitioned server:
- OpenShift Container Platform images
- Operator images that are included in the distributed unit (DU) profile for 5G RAN sites
- Operator images from disconnected registries
The list of available Operator images can vary in different OpenShift Container Platform releases.
22.14.4.1. Downloading with parallel workers Copy linkLink copied to clipboard!
The factory-precaching-cli tool uses parallel workers to download multiple images simultaneously. You can configure the number of workers with the --parallel or -p option. The default number is set to 80% of the available CPUs to the server.
Your login shell may be restricted to a subset of CPUs, which reduces the CPUs available to the container. To remove this restriction, you can precede your commands with taskset 0xffffffff, for example:
taskset 0xffffffff podman run --rm quay.io/openshift-kni/telco-ran-tools:latest factory-precaching-cli download --help
# taskset 0xffffffff podman run --rm quay.io/openshift-kni/telco-ran-tools:latest factory-precaching-cli download --help
22.14.4.2. Preparing to download the OpenShift Container Platform images Copy linkLink copied to clipboard!
To download OpenShift Container Platform container images, you need to know the multicluster engine version. When you use the --du-profile flag, you also need to specify the Red Hat Advanced Cluster Management (RHACM) version running in the hub cluster that is going to provision the single-node OpenShift.
Prerequisites
- You have RHACM and the multicluster engine Operator installed.
- You partitioned the storage device.
- You have enough space for the images on the partitioned device.
- You connected the bare-metal server to the Internet.
- You have a valid pull secret.
Procedure
Check the RHACM version and the multicluster engine version by running the following commands in the hub cluster:
oc get csv -A | grep -i advanced-cluster-management
$ oc get csv -A | grep -i advanced-cluster-managementCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
open-cluster-management advanced-cluster-management.v2.6.3 Advanced Cluster Management for Kubernetes 2.6.3 advanced-cluster-management.v2.6.3 Succeeded
open-cluster-management advanced-cluster-management.v2.6.3 Advanced Cluster Management for Kubernetes 2.6.3 advanced-cluster-management.v2.6.3 SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc get csv -A | grep -i multicluster-engine
$ oc get csv -A | grep -i multicluster-engineCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
multicluster-engine cluster-group-upgrades-operator.v0.0.3 cluster-group-upgrades-operator 0.0.3 Pending multicluster-engine multicluster-engine.v2.1.4 multicluster engine for Kubernetes 2.1.4 multicluster-engine.v2.0.3 Succeeded multicluster-engine openshift-gitops-operator.v1.5.7 Red Hat OpenShift GitOps 1.5.7 openshift-gitops-operator.v1.5.6-0.1664915551.p Succeeded multicluster-engine openshift-pipelines-operator-rh.v1.6.4 Red Hat OpenShift Pipelines 1.6.4 openshift-pipelines-operator-rh.v1.6.3 Succeeded
multicluster-engine cluster-group-upgrades-operator.v0.0.3 cluster-group-upgrades-operator 0.0.3 Pending multicluster-engine multicluster-engine.v2.1.4 multicluster engine for Kubernetes 2.1.4 multicluster-engine.v2.0.3 Succeeded multicluster-engine openshift-gitops-operator.v1.5.7 Red Hat OpenShift GitOps 1.5.7 openshift-gitops-operator.v1.5.6-0.1664915551.p Succeeded multicluster-engine openshift-pipelines-operator-rh.v1.6.4 Red Hat OpenShift Pipelines 1.6.4 openshift-pipelines-operator-rh.v1.6.3 SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow To access the container registry, copy a valid pull secret on the server to be installed:
Create the
.dockerfolder:mkdir /root/.docker
$ mkdir /root/.dockerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy the valid pull in the
config.jsonfile to the previously created.docker/folder:cp config.json /root/.docker/config.json
$ cp config.json /root/.docker/config.json1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
/root/.docker/config.jsonis the default path wherepodmanchecks for the login credentials for the registry.
If you use a different registry to pull the required artifacts, you need to copy the proper pull secret. If the local registry uses TLS, you need to include the certificates from the registry as well.
22.14.4.3. Downloading the OpenShift Container Platform images Copy linkLink copied to clipboard!
The factory-precaching-cli tool allows you to pre-cache all the container images required to provision a specific OpenShift Container Platform release.
Procedure
Pre-cache the release by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies the downloading function of the factory-precaching-cli tool.
- 2
- Defines the OpenShift Container Platform release version.
- 3
- Defines the RHACM version.
- 4
- Defines the multicluster engine version.
- 5
- Defines the folder where you want to download the images on the disk.
- 6
- Optional. Defines the repository where you store your additional images. These images are downloaded and pre-cached on the disk.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check that all the images are compressed in the target folder of server:
ls -l /mnt
$ ls -l /mnt1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- It is recommended that you pre-cache the images in the
/mntfolder.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.14.4.4. Downloading the Operator images Copy linkLink copied to clipboard!
You can also pre-cache Day-2 Operators used in the 5G Radio Access Network (RAN) Distributed Unit (DU) cluster configuration. The Day-2 Operators depend on the installed OpenShift Container Platform version.
You need to include the RHACM hub and multicluster engine Operator versions by using the --acm-version and --mce-version flags so the factory-precaching-cli tool can pre-cache the appropriate containers images for RHACM and the multicluster engine Operator.
Procedure
Pre-cache the Operator images:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies the downloading function of the factory-precaching-cli tool.
- 2
- Defines the OpenShift Container Platform release version.
- 3
- Defines the RHACM version.
- 4
- Defines the multicluster engine version.
- 5
- Defines the folder where you want to download the images on the disk.
- 6
- Optional. Defines the repository where you store your additional images. These images are downloaded and pre-cached on the disk.
- 7
- Specifies pre-caching the Operators included in the DU configuration.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
22.14.4.5. Pre-caching custom images in disconnected environments Copy linkLink copied to clipboard!
The --generate-imageset argument stops the factory-precaching-cli tool after the ImageSetConfiguration custom resource (CR) is generated. This allows you to customize the ImageSetConfiguration CR before downloading any images. After you customized the CR, you can use the --skip-imageset argument to download the images that you specified in the ImageSetConfiguration CR.
You can customize the ImageSetConfiguration CR in the following ways:
- Add Operators and additional images
- Remove Operators and additional images
- Change Operator and catalog sources to local or disconnected registries
Procedure
Pre-cache the images:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies the downloading function of the factory-precaching-cli tool.
- 2
- Defines the OpenShift Container Platform release version.
- 3
- Defines the RHACM version.
- 4
- Defines the multicluster engine version.
- 5
- Defines the folder where you want to download the images on the disk.
- 6
- Optional. Defines the repository where you store your additional images. These images are downloaded and pre-cached on the disk.
- 7
- Specifies pre-caching the Operators included in the DU configuration.
- 8
- The
--generate-imagesetargument generates theImageSetConfigurationCR only, which allows you to customize the CR.
Example output
Generated /mnt/imageset.yaml
Generated /mnt/imageset.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example ImageSetConfiguration CR
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Customize the catalog resource in the CR:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow When you download images by using a local or disconnected registry, you have to first add certificates for the registries that you want to pull the content from.
To avoid any errors, copy the registry certificate into your server:
cp /tmp/eko4-ca.crt /etc/pki/ca-trust/source/anchors/.
# cp /tmp/eko4-ca.crt /etc/pki/ca-trust/source/anchors/.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Then, update the certificates trust store:
update-ca-trust
# update-ca-trustCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mount the host
/etc/pkifolder into the factory-cli image:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies the downloading function of the factory-precaching-cli tool.
- 2
- Defines the OpenShift Container Platform release version.
- 3
- Defines the RHACM version.
- 4
- Defines the multicluster engine version.
- 5
- Defines the folder where you want to download the images on the disk.
- 6
- Optional. Defines the repository where you store your additional images. These images are downloaded and pre-cached on the disk.
- 7
- Specifies pre-caching the Operators included in the DU configuration.
- 8
- The
--skip-imagesetargument allows you to download the images that you specified in your customizedImageSetConfigurationCR.
Download the images without generating a new
imageSetConfigurationCR:podman run -v /mnt:/mnt -v /root/.docker:/root/.docker --privileged --rm quay.io/openshift-kni/telco-ran-tools:latest -- factory-precaching-cli download -r 4.14.0 \ --acm-version 2.6.3 --mce-version 2.1.4 -f /mnt \ --img quay.io/custom/repository \ --du-profile -s \ --skip-imageset
# podman run -v /mnt:/mnt -v /root/.docker:/root/.docker --privileged --rm quay.io/openshift-kni/telco-ran-tools:latest -- factory-precaching-cli download -r 4.14.0 \ --acm-version 2.6.3 --mce-version 2.1.4 -f /mnt \ --img quay.io/custom/repository \ --du-profile -s \ --skip-imagesetCopy to Clipboard Copied! Toggle word wrap Toggle overflow
22.14.5. Pre-caching images in GitOps ZTP Copy linkLink copied to clipboard!
The SiteConfig manifest defines how an OpenShift cluster is to be installed and configured. In the GitOps Zero Touch Provisioning (ZTP) provisioning workflow, the factory-precaching-cli tool requires the following additional fields in the SiteConfig manifest:
-
clusters.ignitionConfigOverride -
nodes.installerArgs -
nodes.ignitionConfigOverride
Example SiteConfig with additional fields
22.14.5.1. Understanding the clusters.ignitionConfigOverride field Copy linkLink copied to clipboard!
The clusters.ignitionConfigOverride field adds a configuration in Ignition format during the GitOps ZTP discovery stage. The configuration includes systemd services in the ISO mounted in virtual media. This way, the scripts are part of the discovery RHCOS live ISO and they can be used to load the Assisted Installer (AI) images.
systemdservices-
The
systemdservices arevar-mnt.mountandprecache-images.services. Theprecache-images.servicedepends on the disk partition to be mounted in/var/mntby thevar-mnt.mountunit. The service calls a script calledextract-ai.sh. extract-ai.sh-
The
extract-ai.shscript extracts and loads the required images from the disk partition to the local container storage. When the script finishes successfully, you can use the images locally. agent-fix-bz1964591-
The
agent-fix-bz1964591script is a workaround for an AI issue. To prevent AI from removing the images, which can force theagent.serviceto pull the images again from the registry, theagent-fix-bz1964591script checks if the requested container images exist.
22.14.5.2. Understanding the nodes.installerArgs field Copy linkLink copied to clipboard!
The nodes.installerArgs field allows you to configure how the coreos-installer utility writes the RHCOS live ISO to disk. You need to indicate to save the disk partition labeled as data because the artifacts saved in the data partition are needed during the OpenShift Container Platform installation stage.
The extra parameters are passed directly to the coreos-installer utility that writes the live RHCOS to disk. On the next reboot, the operating system starts from the disk.
You can pass several options to the coreos-installer utility:
22.14.5.3. Understanding the nodes.ignitionConfigOverride field Copy linkLink copied to clipboard!
Similarly to clusters.ignitionConfigOverride, the nodes.ignitionConfigOverride field allows the addition of configurations in Ignition format to the coreos-installer utility, but at the OpenShift Container Platform installation stage. When the RHCOS is written to disk, the extra configuration included in the GitOps ZTP discovery ISO is no longer available. During the discovery stage, the extra configuration is stored in the memory of the live OS.
At this stage, the number of container images extracted and loaded is bigger than in the discovery stage. Depending on the OpenShift Container Platform release and whether you install the Day-2 Operators, the installation time can vary.
At the installation stage, the var-mnt.mount and precache-ocp.services systemd services are used.
precache-ocp.serviceThe
precache-ocp.servicedepends on the disk partition to be mounted in/var/mntby thevar-mnt.mountunit. Theprecache-ocp.serviceservice calls a script calledextract-ocp.sh.ImportantTo extract all the images before the OpenShift Container Platform installation, you must execute
precache-ocp.servicebefore executing themachine-config-daemon-pull.serviceandnodeip-configuration.serviceservices.extract-ocp.sh-
The
extract-ocp.shscript extracts and loads the required images from the disk partition to the local container storage. When the script finishes successfully, you can use the images locally.
When you upload the SiteConfig and the optional PolicyGenTemplates custom resources (CRs) to the Git repo, which Argo CD is monitoring, you can start the GitOps ZTP workflow by syncing the CRs with the hub cluster.
22.14.6. Troubleshooting Copy linkLink copied to clipboard!
22.14.6.1. Rendered catalog is invalid Copy linkLink copied to clipboard!
When you download images by using a local or disconnected registry, you might see the The rendered catalog is invalid error. This means that you are missing certificates of the new registry you want to pull content from.
The factory-precaching-cli tool image is built on a UBI RHEL image. Certificate paths and locations are the same on RHCOS.
Example error
Procedure
Copy the registry certificate into your server:
cp /tmp/eko4-ca.crt /etc/pki/ca-trust/source/anchors/.
# cp /tmp/eko4-ca.crt /etc/pki/ca-trust/source/anchors/.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Update the certificates truststore:
update-ca-trust
# update-ca-trustCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mount the host
/etc/pkifolder into the factory-cli image:podman run -v /mnt:/mnt -v /root/.docker:/root/.docker -v /etc/pki:/etc/pki --privileged -it --rm quay.io/openshift-kni/telco-ran-tools:latest -- \ factory-precaching-cli download -r 4.14.0 --acm-version 2.5.4 \ --mce-version 2.0.4 -f /mnt \--img quay.io/custom/repository
# podman run -v /mnt:/mnt -v /root/.docker:/root/.docker -v /etc/pki:/etc/pki --privileged -it --rm quay.io/openshift-kni/telco-ran-tools:latest -- \ factory-precaching-cli download -r 4.14.0 --acm-version 2.5.4 \ --mce-version 2.0.4 -f /mnt \--img quay.io/custom/repository --du-profile -s --skip-imagesetCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Legal Notice
Copy linkLink copied to clipboard!
Copyright © 2025 Red Hat
OpenShift documentation is licensed under the Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0).
Modified versions must remove all Red Hat trademarks.
Portions adapted from https://github.com/kubernetes-incubator/service-catalog/ with modifications by Red Hat.
Red Hat, Red Hat Enterprise Linux, the Red Hat logo, the Shadowman logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
Java® is a registered trademark of Oracle and/or its affiliates.
XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.
MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.
Node.js® is an official trademark of Joyent. Red Hat Software Collections is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation’s permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
All other trademarks are the property of their respective owners.