Chapter 11. High availability for pod-level bonds on SR-IOV networks
For workloads using pod-level bonding with SR-IOV virtual functions (VFs), despite an upstream switch failure, an underlying physical function (PF) might still report an up state. This creates a silent failure, as attached VFs remain up and pods continue to send traffic to a dead endpoint, causing packet loss.
The PF Status Relay Operator solves this issue by using Link Aggregation Control Protocol (LACP) as an active health check. In this configuration, each physical function (PF) is placed in its own single-member LACP bond with the upstream switch. When the Operator detects an LACP failure on a PF’s bond, it changes the link state of the attached VFs from auto to disabled. This action triggers the pod’s active-backup bond to fail over to its backup network path, maintaining high availability.
Configuring LACP state monitoring for SR-IOV networks is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
11.1. Installing the PF Status Relay Operator using the CLI Copy linkLink copied to clipboard!
Install the PF Status Relay Operator to enable OpenShift Container Platform to use Link Aggregation Control Protocol (LACP) as an active health check on physical functions (PFs).
Prerequisites
- You configured LACP on your upstream switch.
- You configured pod-level bonding for your SR-IOV networks.
-
You installed the OpenShift CLI (
oc). - You have cluster-admin privileges.
Procedure
Create the
openshift-pf-status-relay-operatornamespace by entering the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create an
OperatorGroupcustom resource (CR) by entering the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
SubscriptionCR for the PF Status Relay Operator by entering the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
To verify that the Operator is installed, enter the following command and then check that output shows
Succeededfor the Operator:oc get csv -n openshift-pf-status-relay-operator -o custom-columns=Name:.metadata.name,Phase:.status.phase
$ oc get csv -n openshift-pf-status-relay-operator -o custom-columns=Name:.metadata.name,Phase:.status.phaseCopy to Clipboard Copied! Toggle word wrap Toggle overflow
11.2. Installing the PF Status Relay Operator using the web console Copy linkLink copied to clipboard!
Install the PF Status Relay Operator to enable OpenShift Container Platform to use Link Aggregation Control Protocol (LACP) as an active health check on physical functions (PFs).
Prerequisites
- You configured LACP on your upstream switch.
- You configured pod-level bonding for your SR-IOV networks.
- You have cluster-admin privileges.
Procedure
Install the PF Status Relay Operator:
-
In the OpenShift Container Platform web console, click Ecosystem
Software Catalog. - Select PF Status Relay Operator from the list of available Operators, and then click Install.
- On the Install Operator page, under Installed Namespace, select Operator recommended Namespace.
- Click Install.
-
In the OpenShift Container Platform web console, click Ecosystem
Verification
- Verify that the PF Status Relay Operator shows the Status as Succeeded on the Installed Operators dashboard.
11.3. Configuring the PF Status Relay Operator for LACP state monitoring on SR-IOV networks Copy linkLink copied to clipboard!
Use the PF Status Relay Operator to enable Link Aggregation Control Protocol (LACP) state monitoring for workloads using pod-level bonding with SR-IOV networks. The Operator monitors the LACP state on physical functions (PF) and changes the link state for attached virtual functions (VF) when it detects an upstream failure. With this approach, you can detect failures on VFs attached to a PF to ensure a timely fail over to backup network path, ensuring high availability for your workloads.
The following scenario demonstrates how to configure and verify LACP state monitoring for SR-IOV networks:
- Create host-level NIC bonds on worker nodes and configure LACP.
- Define SR-IOV network policies to create virtual functions (VFs) on the bonded interfaces.
- Deploy the PF Status Relay Operator to monitor PFs and monitor the LACP state.
- Verify that pods using these VFs automatically fail over to a backup network path in case of upstream switch failure.
The following scenario demonstrates how to configure and verify LACP state monitoring for SR-IOV networks. This scenario uses SR-IOV network cards with two ports on each node, worker-0 and worker-1, with both ports connected to a shared switch to support LACP bonding.
Prerequisites
- Nodes must have a NIC that supports SR-IOV.
- The SR-IOV Network Operator is installed.
- The PF Status Relay Operator is installed.
- The physical switch ports connected to the worker nodes are configured for LACP with a fast polling rate.
-
The
linkStateis set toautoordisablefor the SR-IOV VFs that you want to monitor. The Operator ignores VFs with thelinkStateset toenable. The default value for SR-IOV VFs islinkState: auto.
Procedure
Create the project namespace by creating a
namespace.yamlfile such as the following example:Example
namespace.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The namespace where you deploy the high-availability pod.
Apply the namespace by running the following command:
oc apply -f namespace.yaml
$ oc apply -f namespace.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Configure host-level LACP bonds:
Create a YAML file that defines the
NodeNetworkConfigurationPolicyresource for theens5f0interface on theworker-0node:Example
nncpBondF0Worker0.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a YAML file that defines the
NodeNetworkConfigurationPolicyresource for theens5f1interface on theworker-0node:Example
nncpBondF1Worker0.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the resources by running the following commands:
oc apply -f nncpBondF0Worker0.yaml oc apply -f nncpBondF1Worker0.yaml
$ oc apply -f nncpBondF0Worker0.yaml $ oc apply -f nncpBondF1Worker0.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Create SR-IOV network VFs for the bonded interfaces:
Create a YAML file that defines the
SriovNetworkNodePolicyresource for theens5f0interface on theworker-0node:Example
sriovnetworkpolicy-port1.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a YAML file that defines the
SriovNetworkNodePolicyresource for theens5f1interface on theworker-0node:Example
sriovnetworkpolicy-port2.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the resources by running the following commands:
oc apply -f sriovnetworkpolicy-port1.yaml oc apply -f sriovnetworkpolicy-port2.yaml
$ oc apply -f sriovnetworkpolicy-port1.yaml $ oc apply -f sriovnetworkpolicy-port2.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Configure the PF Status Relay Operator:
Create a YAML file that defines the
PFLACPMonitorresource. This example file configures the Operator to monitor the LACP status ofens5f0andens5f1bonded interfaces on theworker-0node:Example
pflacpmonitor.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantUse only one
PFLACPMonitorcustom resource to monitor each network interface on a node. If you create multiple resources that target the same interface, the PF Status Relay Operator will not process the conflicting configurations.Apply the
PFLACPMonitorresource by running the following command:oc apply -f pflacpmonitor.yaml
$ oc apply -f pflacpmonitor.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the logs of the PF Status Relay Operator to verify that it is monitoring the LACP state:
oc logs -n openshift-pf-status-relay-operator <pf_status_relay_operator_pod_name>
$ oc logs -n openshift-pf-status-relay-operator <pf_status_relay_operator_pod_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
{"time":"2025-07-24T13:35:54.653201692Z","level":"INFO","msg":"lacp is up","interface":"ens5f0"} {"time":"2025-07-24T13:35:54.65347273Z","level":"INFO","msg":"vf link state was set","id":0,"state":"auto","interface":"ens5f0"} ...{"time":"2025-07-24T13:35:54.653201692Z","level":"INFO","msg":"lacp is up","interface":"ens5f0"} {"time":"2025-07-24T13:35:54.65347273Z","level":"INFO","msg":"vf link state was set","id":0,"state":"auto","interface":"ens5f0"} ...Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the
SriovNetworkresources to make the VFs available for use within thesriov-operator-testsnamespace:Create a YAML file that defines the
SriovNetworkresource for the VFs created onens5f0:Example
sriovnetwork-port1.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a YAML file that defines the
SriovNetworkresource for the VFs created onens5f1:Example
sriovnetwork-port2.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the resources by running the following commands:
oc apply -f sriovnetwork-port1.yaml oc apply -f sriovnetwork-port2.yaml
$ oc apply -f sriovnetwork-port1.yaml $ oc apply -f sriovnetwork-port2.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Define a high-availability pod that uses the SR-IOV VFs:
Apply the
NetworkAttachmentDefinitionresource to create anactive-backupbond using the two SR-IOV networks:Example
nad-bond.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
linksInContainer: truecreates the bond inside the pod’s network namespace. -
mode: active-backupconfigures the bond to use active-backup mode. linksspecifies the pod-level interfaces to include in the bond.ImportantThe PF Status Relay Operator provides LACP state monitoring for pod-level bonding with the
mode: active-backupconfiguration only.
-
Apply the
NetworkAttachmentDefinitionresource by running the following command:oc apply -f nad-bond.yaml
$ oc apply -f nad-bond.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a YAML file that defines the
Podresource that uses the VFs from the bonded interfaces in active-backup mode:Example
client-bond.yamlfileCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The annotation requests three networks: two SR-IOV VFs,
net1andnet2and one bond,bond0, which uses them.
Apply the
Podresource by running the following command:oc apply -f client-bond.yaml
$ oc apply -f client-bond.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Check that the failover mechanism:
Log in to the
client-bondpod by running the following command:oc rsh -n sriov-operator-tests client-bond
$ oc rsh -n sriov-operator-tests client-bondCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the initial status of the pod-level bond by running the following command:
cat /proc/net/bonding/bond0
sh-4.4# cat /proc/net/bonding/bond0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Both
net1andnet2interfaces are up.
-
Both
- Exit the pod shell.
- Simulate an LACP failure on your upstream physical switch. To simulate this scenario, you can filter LACP traffic on the switch port that you want to test the failure on. This ensures that the physical link remains up while the LACP pollings fails. The command to do this is vendor-dependent.
Verify the failover inside the pod by logging back into the
client-bondpod and checking the bond status again:cat /proc/net/bonding/bond0
sh-4.4# cat /proc/net/bonding/bond0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
net1interface is down, and thenet2interface is now the active interface.The client-bond pod detects the link state change and switches to the backup network path.