Chapter 15. Performing latency tests for platform verification
You can use the Cloud-native Network Functions (CNF) tests image to run latency tests on a CNF-enabled OpenShift Container Platform cluster, where all the components required for running CNF workloads are installed. Run the latency tests to validate node tuning for your workload.
The cnf-tests
container image is available at registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9
.
The cnf-tests
image also includes several tests that are not supported by Red Hat at this time. Only the latency tests are supported by Red Hat.
15.1. Prerequisites for running latency tests
Your cluster must meet the following requirements before you can run the latency tests:
- You have configured a performance profile with the Performance Addon Operator.
- You have applied all the required CNF configurations in the cluster.
-
You have a pre-existing
MachineConfigPool
CR applied in the cluster. The default worker pool isworker-cnf
.
Additional resources
- For more information about creating the cluster performance profile, see Provisioning real-time and low latency workloads.
15.2. About discovery mode for latency tests
Use discovery mode to validate the functionality of a cluster without altering its configuration. Existing environment configurations are used for the tests. The tests can find the configuration items needed and use those items to execute the tests. If resources needed to run a specific test are not found, the test is skipped, providing an appropriate message to the user. After the tests are finished, no cleanup of the pre-configured configuration items is done, and the test environment can be immediately used for another test run.
When running the latency tests, always run the tests with -e DISCOVERY_MODE=true
and -ginkgo.focus
set to the appropriate latency test. If you do not run the latency tests in discovery mode, your existing live cluster performance profile configuration will be modified by the test run.
Limiting the nodes used during tests
The nodes on which the tests are executed can be limited by specifying a NODES_SELECTOR
environment variable, for example, -e NODES_SELECTOR=node-role.kubernetes.io/worker-cnf
. Any resources created by the test are limited to nodes with matching labels.
If you want to override the default worker pool, pass the -e ROLE_WORKER_CNF=<custom_worker_pool>
variable to the command specifying an appropriate label.
15.3. Measuring latency
The cnf-tests
image uses three tools to measure the latency of the system:
-
hwlatdetect
-
cyclictest
-
oslat
Each tool has a specific use. Use the tools in sequence to achieve reliable test results.
- hwlatdetect
-
Measures the baseline that the bare-metal hardware can achieve. Before proceeding with the next latency test, ensure that the latency reported by
hwlatdetect
meets the required threshold because you cannot fix hardware latency spikes by operating system tuning. - cyclictest
-
Verifies the real-time kernel scheduler latency after
hwlatdetect
passes validation. Thecyclictest
tool schedules a repeated timer and measures the difference between the desired and the actual trigger times. The difference can uncover basic issues with the tuning caused by interrupts or process priorities. The tool must run on a real-time kernel. - oslat
- Behaves similarly to a CPU-intensive DPDK application and measures all the interruptions and disruptions to the busy loop that simulates CPU heavy data processing.
The tests introduce the following environment variables:
Environment variables | Description |
---|---|
| Specifies the amount of time in seconds after which the test starts running. You can use the variable to allow the CPU manager reconcile loop to update the default CPU pool. The default value is 0. |
| Specifies the number of CPUs that the pod running the latency tests uses. If you do not set the variable, the default configuration includes all isolated CPUs. |
| Specifies the amount of time in seconds that the latency test must run. The default value is 300 seconds. |
|
Specifies the maximum acceptable hardware latency in microseconds for the workload and operating system. If you do not set the value of |
|
Specifies the maximum latency in microseconds that all threads expect before waking up during the |
|
Specifies the maximum acceptable latency in microseconds for the |
| Unified variable that specifies the maximum acceptable latency in microseconds. Applicable for all available latency tools. |
|
Boolean parameter that indicates whether the tests should run. |
Variables that are specific to a latency tool take precedence over unified variables. For example, if OSLAT_MAXIMUM_LATENCY
is set to 30 microseconds and MAXIMUM_LATENCY
is set to 10 microseconds, the oslat
test will run with maximum acceptable latency of 30 microseconds.
15.4. Running the latency tests
Run the cluster latency tests to validate node tuning for your Cloud-native Network Functions (CNF) workload.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Procedure
Open a shell prompt in the directory containing the
kubeconfig
file.You provide the test image with a
kubeconfig
file in current directory and its related$KUBECONFIG
environment variable, mounted through a volume. This allows the running container to use thekubeconfig
file from inside the container.Run the latency tests by entering the following command:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
-
Optional: Append
-ginkgo.dryRun
to run the latency tests in dry-run mode. This is useful for checking what the tests run. -
Optional: Append
-ginkgo.v
to run the tests with increased verbosity. Optional: To run the latency tests against a specific performance profile, run the following command, substituting appropriate values:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ -e PERF_TEST_PROFILE=<performance_profile> registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/test-run.sh -ginkgo.focus="[performance]\ Latency\ Test"
where:
- <performance_profile>
- Is the name of the performance profile you want to run the latency tests against.
ImportantFor valid latency tests results, run the tests for at least 12 hours.
15.4.1. Running hwlatdetect
The hwlatdetect
tool is available in the rt-kernel
package with a regular subscription of Red Hat Enterprise Linux (RHEL) 8.x.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Prerequisites
- You have installed the real-time kernel in the cluster.
-
You have logged in to
registry.redhat.io
with your Customer Portal credentials.
Procedure
To run the
hwlatdetect
tests, run the following command, substituting variable values as appropriate:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="hwlatdetect"
The
hwlatdetect
test runs for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower thanMAXIMUM_LATENCY
(20 μs).If the results exceed the latency threshold, the test fails.
ImportantFor valid results, the test should run for at least 12 hours.
Example failure output
running /usr/bin/validationsuite -ginkgo.v -ginkgo.focus=hwlatdetect I0210 17:08:38.607699 7 request.go:668] Waited for 1.047200253s due to client-side throttling, not priority and fairness, request: GET:https://api.ocp.demo.lab:6443/apis/apps.openshift.io/v1?timeout=32s Running Suite: CNF Features e2e validation ========================================== Random Seed: 1644512917 Will run 0 of 48 specs SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS Ran 0 of 48 Specs in 0.001 seconds SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 48 Skipped PASS Discovery mode enabled, skipping setup running /usr/bin/cnftests -ginkgo.v -ginkgo.focus=hwlatdetect I0210 17:08:41.179269 40 request.go:668] Waited for 1.046001096s due to client-side throttling, not priority and fairness, request: GET:https://api.ocp.demo.lab:6443/apis/storage.k8s.io/v1beta1?timeout=32s Running Suite: CNF Features e2e integration tests ================================================= Random Seed: 1644512920 Will run 1 of 151 specs SSSSSSS ------------------------------ [performance] Latency Test with the hwlatdetect image should succeed /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:221 STEP: Waiting two minutes to download the latencyTest image STEP: Waiting another two minutes to give enough time for the cluster to move the pod to Succeeded phase Feb 10 17:10:56.045: [INFO]: found mcd machine-config-daemon-dzpw7 for node ocp-worker-0.demo.lab Feb 10 17:10:56.259: [INFO]: found mcd machine-config-daemon-dzpw7 for node ocp-worker-0.demo.lab Feb 10 17:11:56.825: [ERROR]: timed out waiting for the condition • Failure [193.903 seconds] [performance] Latency Test /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:60 with the hwlatdetect image /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:213 should succeed [It] /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:221 Log file created at: 2022/02/10 17:08:45 Running on machine: hwlatdetect-cd8b6 Binary: Built with gc go1.16.6 for linux/amd64 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0210 17:08:45.716288 1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-56fabc639a679b757ebae30e5f01b2ebd38e9fde9ecae91c41be41d3e89b37f8/vmlinuz-4.18.0-305.34.2.rt7.107.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu ostree=/ostree/boot.0/rhcos/56fabc639a679b757ebae30e5f01b2ebd38e9fde9ecae91c41be41d3e89b37f8/0 root=UUID=56731f4f-f558-46a3-85d3-d1b579683385 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=3-5 tuned.non_isolcpus=ffffffc7 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,3-5 systemd.cpu_affinity=0,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 + + I0210 17:08:45.716782 1 node.go:44] Environment information: kernel version 4.18.0-305.34.2.rt7.107.el8_4.x86_64 I0210 17:08:45.716861 1 main.go:50] running the hwlatdetect command with arguments [/usr/bin/hwlatdetect --threshold 1 --hardlimit 1 --duration 10 --window 10000000us --width 950000us] F0210 17:08:56.815204 1 main.go:53] failed to run hwlatdetect command; out: hwlatdetect: test duration 10 seconds detector: tracer parameters: Latency threshold: 1us 1 Sample window: 10000000us Sample width: 950000us Non-sampling period: 9050000us Output File: None Starting test test finished Max Latency: 24us 2 Samples recorded: 1 Samples exceeding threshold: 1 ts: 1644512927.163556381, inner:20, outer:24 ; err: exit status 1 goroutine 1 [running]: k8s.io/klog.stacks(0xc000010001, 0xc00012e000, 0x25b, 0x2710) /remote-source/app/vendor/k8s.io/klog/klog.go:875 +0xb9 k8s.io/klog.(*loggingT).output(0x5bed00, 0xc000000003, 0xc0000121c0, 0x53ea81, 0x7, 0x35, 0x0) /remote-source/app/vendor/k8s.io/klog/klog.go:829 +0x1b0 k8s.io/klog.(*loggingT).printf(0x5bed00, 0x3, 0x5082da, 0x33, 0xc000113f58, 0x2, 0x2) /remote-source/app/vendor/k8s.io/klog/klog.go:707 +0x153 k8s.io/klog.Fatalf(...) /remote-source/app/vendor/k8s.io/klog/klog.go:1276 main.main() /remote-source/app/cnf-tests/pod-utils/hwlatdetect-runner/main.go:53 +0x897 goroutine 6 [chan receive]: k8s.io/klog.(*loggingT).flushDaemon(0x5bed00) /remote-source/app/vendor/k8s.io/klog/klog.go:1010 +0x8b created by k8s.io/klog.init.0 /remote-source/app/vendor/k8s.io/klog/klog.go:411 +0xd8 goroutine 7 [chan receive]: k8s.io/klog/v2.(*loggingT).flushDaemon(0x5bede0) /remote-source/app/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b created by k8s.io/klog/v2.init.0 /remote-source/app/vendor/k8s.io/klog/v2/klog.go:420 +0xdf Unexpected error: <*errors.errorString | 0xc000418ed0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:433 ------------------------------ SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS JUnit report was created: /junit.xml/cnftests-junit.xml Summarizing 1 Failure: [Fail] [performance] Latency Test with the hwlatdetect image [It] should succeed /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:433 Ran 1 of 151 Specs in 222.254 seconds FAIL! -- 0 Passed | 1 Failed | 0 Pending | 150 Skipped --- FAIL: TestTest (222.45s) FAIL
Example hwlatdetect test results
You can capture the following types of results:
- Rough results that are gathered after each run to create a history of impact on any changes made throughout the test.
- The combined set of the rough tests with the best results and configuration settings.
Example of good results
hwlatdetect: test duration 3600 seconds detector: tracer parameters: Latency threshold: 10us Sample window: 1000000us Sample width: 950000us Non-sampling period: 50000us Output File: None Starting test test finished Max Latency: Below threshold Samples recorded: 0
The hwlatdetect
tool only provides output if the sample exceeds the specified threshold.
Example of bad results
hwlatdetect: test duration 3600 seconds detector: tracer parameters:Latency threshold: 10usSample window: 1000000us Sample width: 950000usNon-sampling period: 50000usOutput File: None Starting tests:1610542421.275784439, inner:78, outer:81 ts: 1610542444.330561619, inner:27, outer:28 ts: 1610542445.332549975, inner:39, outer:38 ts: 1610542541.568546097, inner:47, outer:32 ts: 1610542590.681548531, inner:13, outer:17 ts: 1610543033.818801482, inner:29, outer:30 ts: 1610543080.938801990, inner:90, outer:76 ts: 1610543129.065549639, inner:28, outer:39 ts: 1610543474.859552115, inner:28, outer:35 ts: 1610543523.973856571, inner:52, outer:49 ts: 1610543572.089799738, inner:27, outer:30 ts: 1610543573.091550771, inner:34, outer:28 ts: 1610543574.093555202, inner:116, outer:63
The output of hwlatdetect
shows that multiple samples exceed the threshold. However, the same output can indicate different results based on the following factors:
- The duration of the test
- The number of CPU cores
- The host firmware settings
Before proceeding with the next latency test, ensure that the latency reported by hwlatdetect
meets the required threshold. Fixing latencies introduced by hardware might require you to contact the system vendor support.
Not all latency spikes are hardware related. Ensure that you tune the host firmware to meet your workload requirements. For more information, see Setting firmware parameters for system tuning.
15.4.2. Running cyclictest
The cyclictest
tool measures the real-time kernel scheduler latency on the specified CPUs.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Prerequisites
-
You have logged in to
registry.redhat.io
with your Customer Portal credentials. - You have installed the real-time kernel in the cluster.
- You have applied a cluster performance profile by using Performance addon operator.
Procedure
To perform the
cyclictest
, run the following command, substituting variable values as appropriate:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_CPUS=10 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="cyclictest"
The command runs the
cyclictest
tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower thanMAXIMUM_LATENCY
(in this example, 20 μs). Latency spikes of 20 μs and above are generally not acceptable for telco RAN workloads.If the results exceed the latency threshold, the test fails.
ImportantFor valid results, the test should run for at least 12 hours.
Example failure output
Discovery mode enabled, skipping setup running /usr/bin//cnftests -ginkgo.v -ginkgo.focus=cyclictest I0811 15:02:36.350033 20 request.go:668] Waited for 1.049965918s due to client-side throttling, not priority and fairness, request: GET:https://api.cnfdc8.t5g.lab.eng.bos.redhat.com:6443/apis/machineconfiguration.openshift.io/v1?timeout=32s Running Suite: CNF Features e2e integration tests ================================================= Random Seed: 1628694153 Will run 1 of 138 specs SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ------------------------------ [performance] Latency Test with the cyclictest image should succeed /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:200 STEP: Waiting two minutes to download the latencyTest image STEP: Waiting another two minutes to give enough time for the cluster to move the pod to Succeeded phase Aug 11 15:03:06.826: [INFO]: found mcd machine-config-daemon-wf4w8 for node cnfdc8.clus2.t5g.lab.eng.bos.redhat.com • Failure [22.527 seconds] [performance] Latency Test /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:84 with the cyclictest image /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:188 should succeed [It] /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:200 The current latency 27 is bigger than the expected one 20 Expected <bool>: false to be true /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:219 Log file created at: 2021/08/11 15:02:51 Running on machine: cyclictest-knk7d Binary: Built with gc go1.16.6 for linux/amd64 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0811 15:02:51.092254 1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/vmlinuz-4.18.0-305.10.2.rt7.83.el8_4.x86_64 ip=dhcp random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.1/rhcos/612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/0 ignition.platform.id=openstack root=UUID=5a4ddf16-9372-44d9-ac4e-3ee329e16ab3 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=000000ff,ffffffff,ffffffff,fffffff1 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 default_hugepagesz=1G hugepagesz=2M hugepages=128 nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 I0811 15:02:51.092427 1 node.go:44] Environment information: kernel version 4.18.0-305.10.2.rt7.83.el8_4.x86_64 I0811 15:02:51.092450 1 main.go:48] running the cyclictest command with arguments \ [-D 600 -95 1 -t 10 -a 2,4,6,8,10,54,56,58,60,62 -h 30 -i 1000 --quiet] I0811 15:03:06.147253 1 main.go:54] succeeded to run the cyclictest command: # /dev/cpu_dma_latency set to 0us # Histogram 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000001 000000 005561 027778 037704 011987 000000 120755 238981 081847 300186 000002 587440 581106 564207 554323 577416 590635 474442 357940 513895 296033 000003 011751 011441 006449 006761 008409 007904 002893 002066 003349 003089 000004 000527 001079 000914 000712 001451 001120 000779 000283 000350 000251 More histogram entries ... # Min Latencies: 00002 00001 00001 00001 00001 00002 00001 00001 00001 00001 # Avg Latencies: 00002 00002 00002 00001 00002 00002 00001 00001 00001 00001 # Max Latencies: 00018 00465 00361 00395 00208 00301 02052 00289 00327 00114 # Histogram Overflows: 00000 00220 00159 00128 00202 00017 00069 00059 00045 00120 # Histogram Overflow at cycle number: # Thread 0: # Thread 1: 01142 01439 05305 … # 00190 others # Thread 2: 20895 21351 30624 … # 00129 others # Thread 3: 01143 17921 18334 … # 00098 others # Thread 4: 30499 30622 31566 ... # 00172 others # Thread 5: 145221 170910 171888 ... # Thread 6: 01684 26291 30623 ...# 00039 others # Thread 7: 28983 92112 167011 … 00029 others # Thread 8: 45766 56169 56171 ...# 00015 others # Thread 9: 02974 08094 13214 ... # 00090 others
Example cyclictest results
The same output can indicate different results for different workloads. For example, spikes up to 18μs are acceptable for 4G DU workloads, but not for 5G DU workloads.
Example of good results
running cmd: cyclictest -q -D 10m -p 1 -t 16 -a 2,4,6,8,10,12,14,16,54,56,58,60,62,64,66,68 -h 30 -i 1000 -m # Histogram 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000001 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000002 579506 535967 418614 573648 532870 529897 489306 558076 582350 585188 583793 223781 532480 569130 472250 576043 More histogram entries ... # Total: 000600000 000600000 000600000 000599999 000599999 000599999 000599998 000599998 000599998 000599997 000599997 000599996 000599996 000599995 000599995 000599995 # Min Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 # Avg Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 # Max Latencies: 00005 00005 00004 00005 00004 00004 00005 00005 00006 00005 00004 00005 00004 00004 00005 00004 # Histogram Overflows: 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 # Histogram Overflow at cycle number: # Thread 0: # Thread 1: # Thread 2: # Thread 3: # Thread 4: # Thread 5: # Thread 6: # Thread 7: # Thread 8: # Thread 9: # Thread 10: # Thread 11: # Thread 12: # Thread 13: # Thread 14: # Thread 15:
Example of bad results
running cmd: cyclictest -q -D 10m -p 1 -t 16 -a 2,4,6,8,10,12,14,16,54,56,58,60,62,64,66,68 -h 30 -i 1000 -m # Histogram 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000001 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000002 564632 579686 354911 563036 492543 521983 515884 378266 592621 463547 482764 591976 590409 588145 589556 353518 More histogram entries ... # Total: 000599999 000599999 000599999 000599997 000599997 000599998 000599998 000599997 000599997 000599996 000599995 000599996 000599995 000599995 000599995 000599993 # Min Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 # Avg Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 # Max Latencies: 00493 00387 00271 00619 00541 00513 00009 00389 00252 00215 00539 00498 00363 00204 00068 00520 # Histogram Overflows: 00001 00001 00001 00002 00002 00001 00000 00001 00001 00001 00002 00001 00001 00001 00001 00002 # Histogram Overflow at cycle number: # Thread 0: 155922 # Thread 1: 110064 # Thread 2: 110064 # Thread 3: 110063 155921 # Thread 4: 110063 155921 # Thread 5: 155920 # Thread 6: # Thread 7: 110062 # Thread 8: 110062 # Thread 9: 155919 # Thread 10: 110061 155919 # Thread 11: 155918 # Thread 12: 155918 # Thread 13: 110060 # Thread 14: 110060 # Thread 15: 110059 155917
15.4.3. Running oslat
The oslat
test simulates a CPU-intensive DPDK application and measures all the interruptions and disruptions to test how the cluster handles CPU heavy data processing.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Prerequisites
-
You have logged in to
registry.redhat.io
with your Customer Portal credentials. - You have applied a cluster performance profile by using the Performance addon operator.
Procedure
To perform the
oslat
test, run the following command, substituting variable values as appropriate:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker-cnf \ -e LATENCY_TEST_CPUS=7 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="oslat"
LATENCY_TEST_CPUS
specifices the list of CPUs to test with theoslat
command.The command runs the
oslat
tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower thanMAXIMUM_LATENCY
(20 μs).If the results exceed the latency threshold, the test fails.
ImportantFor valid results, the test should run for at least 12 hours.
Example failure output
running /usr/bin//validationsuite -ginkgo.v -ginkgo.focus=oslat I0829 12:36:55.386776 8 request.go:668] Waited for 1.000303471s due to client-side throttling, not priority and fairness, request: GET:https://api.cnfdc8.t5g.lab.eng.bos.redhat.com:6443/apis/authentication.k8s.io/v1?timeout=32s Running Suite: CNF Features e2e validation ========================================== Discovery mode enabled, skipping setup running /usr/bin//cnftests -ginkgo.v -ginkgo.focus=oslat I0829 12:37:01.219077 20 request.go:668] Waited for 1.050010755s due to client-side throttling, not priority and fairness, request: GET:https://api.cnfdc8.t5g.lab.eng.bos.redhat.com:6443/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s Running Suite: CNF Features e2e integration tests ================================================= Random Seed: 1630240617 Will run 1 of 142 specs SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ------------------------------ [performance] Latency Test with the oslat image should succeed /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:134 STEP: Waiting two minutes to download the latencyTest image STEP: Waiting another two minutes to give enough time for the cluster to move the pod to Succeeded phase Aug 29 12:37:59.324: [INFO]: found mcd machine-config-daemon-wf4w8 for node cnfdc8.clus2.t5g.lab.eng.bos.redhat.com • Failure [49.246 seconds] [performance] Latency Test /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:59 with the oslat image /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:112 should succeed [It] /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:134 The current latency 27 is bigger than the expected one 20 1 Expected <bool>: false to be true /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:168 Log file created at: 2021/08/29 13:25:21 Running on machine: oslat-57c2g Binary: Built with gc go1.16.6 for linux/amd64 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0829 13:25:21.569182 1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/vmlinuz-4.18.0-305.10.2.rt7.83.el8_4.x86_64 ip=dhcp random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.0/rhcos/612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/0 ignition.platform.id=openstack root=UUID=5a4ddf16-9372-44d9-ac4e-3ee329e16ab3 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=000000ff,ffffffff,ffffffff,fffffff1 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 default_hugepagesz=1G hugepagesz=2M hugepages=128 nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 I0829 13:25:21.569345 1 node.go:44] Environment information: kernel version 4.18.0-305.10.2.rt7.83.el8_4.x86_64 I0829 13:25:21.569367 1 main.go:53] Running the oslat command with arguments \ [--duration 600 --rtprio 1 --cpu-list 4,6,52,54,56,58 --cpu-main-thread 2] I0829 13:35:22.632263 1 main.go:59] Succeeded to run the oslat command: oslat V 2.00 Total runtime: 600 seconds Thread priority: SCHED_FIFO:1 CPU list: 4,6,52,54,56,58 CPU for main thread: 2 Workload: no Workload mem: 0 (KiB) Preheat cores: 6 Pre-heat for 1 seconds... Test starts... Test completed. Core: 4 6 52 54 56 58 CPU Freq: 2096 2096 2096 2096 2096 2096 (Mhz) 001 (us): 19390720316 19141129810 20265099129 20280959461 19391991159 19119877333 002 (us): 5304 5249 5777 5947 6829 4971 003 (us): 28 14 434 47 208 21 004 (us): 1388 853 123568 152817 5576 0 005 (us): 207850 223544 103827 91812 227236 231563 006 (us): 60770 122038 277581 323120 122633 122357 007 (us): 280023 223992 63016 25896 214194 218395 008 (us): 40604 25152 24368 4264 24440 25115 009 (us): 6858 3065 5815 810 3286 2116 010 (us): 1947 936 1452 151 474 361 ... Minimum: 1 1 1 1 1 1 (us) Average: 1.000 1.000 1.000 1.000 1.000 1.000 (us) Maximum: 37 38 49 28 28 19 (us) Max-Min: 36 37 48 27 27 18 (us) Duration: 599.667 599.667 599.667 599.667 599.667 599.667 (sec)
- 1
- In this example, the measured latency is outside the maximum allowed value.
15.5. Generating a latency test failure report
Use the following procedures to generate a JUnit latency test output and test failure report.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
Create a test failure report with information about the cluster state and resources for troubleshooting by passing the
--report
parameter with the path to where the report is dumped:$ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/reportdest:<report_folder_path> \ -e KUBECONFIG=/kubeconfig/kubeconfig -e DISCOVERY_MODE=true \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/test-run.sh --report <report_folder_path> \ -ginkgo.focus="\[performance\]\ Latency\ Test"
where:
- <report_folder_path>
- Is the path to the folder where the report is generated.
15.6. Generating a JUnit latency test report
Use the following procedures to generate a JUnit latency test output and test failure report.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
Create a JUnit-compliant XML report by passing the
--junit
parameter together with the path to where the report is dumped:$ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/junitdest:<junit_folder_path> \ -e KUBECONFIG=/kubeconfig/kubeconfig -e DISCOVERY_MODE=true \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/test-run.sh --junit <junit_folder_path> \ -ginkgo.focus="\[performance\]\ Latency\ Test"
where:
- <junit_folder_path>
- Is the path to the folder where the junit report is generated
15.7. Running latency tests on a single-node OpenShift cluster
You can run latency tests on single-node OpenShift clusters.
Always run the latency tests with DISCOVERY_MODE=true
set. If you don’t, the test suite will make changes to the running cluster configuration.
When executing podman
commands as a non-root or non-privileged user, mounting paths can fail with permission denied
errors. To make the podman
command work, append :Z
to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z
. This allows podman
to do the proper SELinux relabeling.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
To run the latency tests on a single-node OpenShift cluster, run the following command:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=master \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
NoteROLE_WORKER_CNF=master
is required because master is the only machine pool to which the node belongs. For more information about setting the requiredMachineConfigPool
for the latency tests, see "Prerequisites for running latency tests".After running the test suite, all the dangling resources are cleaned up.
15.8. Running latency tests in a disconnected cluster
The CNF tests image can run tests in a disconnected cluster that is not able to reach external registries. This requires two steps:
-
Mirroring the
cnf-tests
image to the custom disconnected registry. - Instructing the tests to consume the images from the custom disconnected registry.
Mirroring the images to a custom registry accessible from the cluster
A mirror
executable is shipped in the image to provide the input required by oc
to mirror the test image to a local registry.
Run this command from an intermediate machine that has access to the cluster and registry.redhat.io:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/mirror -registry <disconnected_registry> | oc image mirror -f -
where:
- <disconnected_registry>
-
Is the disconnected mirror registry you have configured, for example,
my.local.registry:5000/
.
When you have mirrored the
cnf-tests
image into the disconnected registry, you must override the original registry used to fetch the images when running the tests, for example:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e IMAGE_REGISTRY="<disconnected_registry>" \ -e CNF_TESTS_IMAGE="cnf-tests-rhel8:v4.9" \ /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
Configuring the tests to consume images from a custom registry
You can run the latency tests using a custom test image and image registry using CNF_TESTS_IMAGE
and IMAGE_REGISTRY
variables.
To configure the latency tests to use a custom test image and image registry, run the following command:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e IMAGE_REGISTRY="<custom_image_registry>" \ -e CNF_TESTS_IMAGE="<custom_cnf-tests_image>" \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 /usr/bin/test-run.sh
where:
- <custom_image_registry>
-
is the custom image registry, for example,
custom.registry:5000/
. - <custom_cnf-tests_image>
-
is the custom cnf-tests image, for example,
custom-cnf-tests-image:latest
.
Mirroring images to the cluster internal registry
OpenShift Container Platform provides a built-in container image registry, which runs as a standard workload on the cluster.
Procedure
Gain external access to the registry by exposing it with a route:
$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
Fetch the registry endpoint by running the following command:
$ REGISTRY=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
Create a namespace for exposing the images:
$ oc create ns cnftests
Make the image stream available to all the namespaces used for tests. This is required to allow the tests namespaces to fetch the images from the
cnf-tests
image stream. Run the following commands:$ oc policy add-role-to-user system:image-puller system:serviceaccount:cnf-features-testing:default --namespace=cnftests
$ oc policy add-role-to-user system:image-puller system:serviceaccount:performance-addon-operators-testing:default --namespace=cnftests
Retrieve the docker secret name and auth token by running the following commands:
$ SECRET=$(oc -n cnftests get secret | grep builder-docker | awk {'print $1'}
$ TOKEN=$(oc -n cnftests get secret $SECRET -o jsonpath="{.data['\.dockercfg']}" | base64 --decode | jq '.["image-registry.openshift-image-registry.svc:5000"].auth')
Create a
dockerauth.json
file, for example:$ echo "{\"auths\": { \"$REGISTRY\": { \"auth\": $TOKEN } }}" > dockerauth.json
Do the image mirroring:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:4.9 \ /usr/bin/mirror -registry $REGISTRY/cnftests | oc image mirror --insecure=true \ -a=$(pwd)/dockerauth.json -f -
Run the tests:
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ -e DISCOVERY_MODE=true -e IMAGE_REGISTRY=image-registry.openshift-image-registry.svc:5000/cnftests \ cnf-tests-local:latest /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
Mirroring a different set of test images
You can optionally change the default upstream images that are mirrored for the latency tests.
Procedure
The
mirror
command tries to mirror the upstream images by default. This can be overridden by passing a file with the following format to the image:[ { "registry": "public.registry.io:5000", "image": "imageforcnftests:4.9" } ]
Pass the file to the
mirror
command, for example saving it locally asimages.json
. With the following command, the local path is mounted in/kubeconfig
inside the container and that can be passed to the mirror command.$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 /usr/bin/mirror \ --registry "my.local.registry:5000/" --images "/kubeconfig/images.json" \ | oc image mirror -f -
15.9. Troubleshooting errors with the cnf-tests container
To run latency tests, the cluster must be accessible from within the cnf-tests
container.
Prerequisites
-
You have installed the OpenShift CLI (
oc
). -
You have logged in as a user with
cluster-admin
privileges.
Procedure
Verify that the cluster is accessible from inside the
cnf-tests
container by running the following command:$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \ registry.redhat.io/openshift4/cnf-tests-rhel8:v4.9 \ oc get nodes
If this command does not work, an error related to spanning across DNS, MTU size, or firewall access might be occurring.