Chapter 6. Deploying SR-IOV technologies
In your Red Hat OpenStack Platform NFV deployment, you can achieve higher performance with single root I/O virtualization (SR-IOV), when you configure direct access from your instances to a shared PCIe resource through virtual resources.
6.1. Prerequisites
- For details on how to install and configure the undercloud before deploying the overcloud, see the Director Installation and Usage Guide.
Do not manually edit any values in /etc/tuned/cpu-partitioning-variables.conf
that director heat templates modify.
6.2. Configuring SR-IOV
To deploy Red Hat OpenStack Platform (RHOSP) with single root I/O virtualization (SR-IOV), configure the shared PCIe resources that have SR-IOV capabilities that instances can request direct access to.
The following CPU assignments, memory allocation, and NIC configurations are examples, and might be different from your use case.
Procedure
-
Log in to the undercloud as the
stack
user. Source the
stackrc
file:[stack@director ~]$ source ~/stackrc
Generate a new roles data file named
roles_data_compute_sriov.yaml
that includes theController
andComputeSriov
roles:(undercloud)$ openstack overcloud roles \ generate -o /home/stack/templates/roles_data_compute_sriov.yaml \ Controller ComputeSriov
ComputeSriov
is a custom role provided with your RHOSP installation that includes theNeutronSriovAgent
,NeutronSriovHostConfig
services, in addition to the default compute services.To prepare the SR-IOV containers, include the
neutron-sriov.yaml
androles_data_compute_sriov.yaml
files when you generate theovercloud_images.yaml
file.$ sudo openstack tripleo container image prepare \ --roles-file ~/templates/roles_data_compute_sriov.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-sriov.yaml \ -e ~/containers-prepare-parameter.yaml \ --output-env-file=/home/stack/templates/overcloud_images.yaml
For more information on container image preparation, see Preparing container images in the Director Installation and Usage guide.
Create a copy of the
/usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml
file in your environment file directory:$ cp /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml /home/stack/templates/network-environment-sriov.yaml
Add the following parameters under
parameter_defaults
in yournetwork-environment-sriov.yaml
file to configure the SR-IOV nodes for your cluster and your hardware configuration:NeutronNetworkType: 'vlan' NeutronNetworkVLANRanges: - tenant:22:22 - tenant:25:25 NeutronTunnelTypes: ''
To determine the
vendor_id
andproduct_id
for each PCI device type, use one of the following commands on the physical server that has the PCI cards:To return the
vendor_id
andproduct_id
from a deployed overcloud, use the following command:# lspci -nn -s <pci_device_address> 3b:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [<vendor_id>: <product_id>] (rev 02)
To return the
vendor_id
andproduct_id
of a physical function (PF) if you have not yet deployed the overcloud, use the following command:(undercloud) [stack@undercloud-0 ~]$ openstack baremetal introspection data save <baremetal_node_name> | jq '.inventory.interfaces[] | .name, .vendor, .product'
Configure role specific parameters for SR-IOV compute nodes in your
network-environment-sriov.yaml
file:ComputeSriovParameters: IsolCpusList: "1-19,21-39" KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on isolcpus=1-19,21-39" TunedProfileName: "cpu-partitioning" NeutronBridgeMappings: - tenant:br-link0 NeutronPhysicalDevMappings: - tenant:p7p1 NovaComputeCpuDedicatedSet: '1-19,21-39' NovaReservedHostMemory: 4096
NoteThe
NovaVcpuPinSet
parameter is now deprecated, and is replaced byNovaComputeCpuDedicatedSet
for dedicated, pinned workloads.Configure the PCI passthrough devices for the SR-IOV compute nodes in your
network-environment-sriov.yaml
file:ComputeSriovParameters: ... NovaPCIPassthrough: - vendor_id: "<vendor_id>" product_id: "<product_id>" address: <NIC_address> physical_network: "<physical_network>" ...
-
Replace
<vendor_id>
with the vendor ID of the PCI device. -
Replace
<product_id>
with the product ID of the PCI device. -
Replace
<NIC_address>
with the address of the PCI device. For information about how to configure theaddress
parameter, see Guidelines for configuringNovaPCIPassthrough
in the Configuring the Compute Service for Instance Creation guide. Replace
<physical_network>
with the name of the physical network the PCI device is located on.NoteDo not use the
devname
parameter when you configure PCI passthrough because the device name of a NIC can change. To create a Networking service (neutron) port on a PF, specify thevendor_id
, theproduct_id
, and the PCI device address inNovaPCIPassthrough
, and create the port with the--vnic-type direct-physical
option. To create a Networking service port on a virtual function (VF), specify thevendor_id
andproduct_id
inNovaPCIPassthrough
, and create the port with the--vnic-type direct
option. The values of thevendor_id
andproduct_id
parameters might be different between physical function (PF) and VF contexts. For more information about how to configureNovaPCIPassthrough
, see Guidelines for configuringNovaPCIPassthrough
in the Configuring the Compute Service for Instance Creation guide.
-
Replace
Configure the SR-IOV enabled interfaces in the
compute.yaml
network configuration template. To create SR-IOV VFs, configure the interfaces as standalone NICs:- type: sriov_pf name: p7p3 mtu: 9000 numvfs: 10 use_dhcp: false defroute: false nm_controlled: true hotplug: true promisc: false - type: sriov_pf name: p7p4 mtu: 9000 numvfs: 10 use_dhcp: false defroute: false nm_controlled: true hotplug: true promisc: false
NoteThe
numvfs
parameter replaces theNeutronSriovNumVFs
parameter in the network configuration templates. Red Hat does not support modification of theNeutronSriovNumVFs
parameter or thenumvfs
parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that PF. In this case, you must hard reboot these instances to make the SR-IOV PCI device available again.Ensure that the list of default filters includes the value
AggregateInstanceExtraSpecsFilter
:NovaSchedulerDefaultFilters: ['AvailabilityZoneFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','ServerGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter','AggregateInstanceExtraSpecsFilter']
-
Run the
overcloud_deploy.sh
script.
6.3. NIC partitioning
This feature is generally available from Red Hat OpenStack Platform (RHOSP) 16.1.2, and is validated on Intel Fortville NICs, and Mellanox CX-5 NICs.
You can configure single root I/O virtualization (SR-IOV) so that a RHOSP host can use virtual functions (VFs).
When you partition a single, high-speed NIC into multiple VFs, you can use the NIC for both control and data plane traffic.
Procedure
- Open the NIC config file for your chosen role.
Add an entry for the interface type
sriov_pf
to configure a physical function that the host can use:- type: sriov_pf name: <interface name> use_dhcp: false numvfs: <number of vfs> promisc: <true/false> #optional (Defaults to true)
NoteThe
numvfs
parameter replaces theNeutronSriovNumVFs
parameter in the network configuration templates. Red Hat does not support modification of theNeutronSriovNumVFs
parameter or thenumvfs
parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that physical function (PF). In this case, you must hard reboot these instances to make the SR-IOV PCI device available again.Add an entry for the interface type
sriov_vf
to configure virtual functions that the host can use:- type: <bond_type> name: internal_bond bonding_options: mode=<bonding_option> use_dhcp: false members: - type: sriov_vf device: <pf_device_name> vfid: <vf_id> - type: sriov_vf device: <pf_device_name> vfid: <vf_id> - type: vlan vlan_id: get_param: InternalApiNetworkVlanID spoofcheck: false device: internal_bond addresses: - ip_netmask: get_param: InternalApiIpSubnet routes: list_concat_unique: - get_param: InternalApiInterfaceRoutes
-
Replace
<bond_type>
with the required bond type, for example,linux_bond
. You can apply VLAN tags on the bond for other bonds, such asovs_bond
. Replace
<bonding_option>
with one of the following supported bond modes:-
active-backup
Balance-slb
NoteLACP bonds are not supported.
-
Specify the
sriov_vf
as the interface type to bond in themembers
section.NoteIf you are using an OVS bridge as the interface type, you can configure only one OVS bridge on the sriov_vf of a sriov_pf device. More than one OVS bridge on a single sriov_pf device can result in packet duplication across VFs, and decreased performance.
-
Replace
<pf_device_name>
with the name of the PF device. -
If you use a
linux_bond
, you must assign VLAN tags. -
Replace
<vf_id>
with the ID of the VF. The applicable VF ID range starts at zero, and ends at the maximum number of VFs minus one.
-
Replace
-
Disable spoof checking, and apply VLAN tags on the
sriov_vf
forlinux_bond
over VFs. To reserve VFs for instances, include the
NovaPCIPassthrough
parameter in an environment file, for example:NovaPCIPassthrough: - address: "0000:19:0e.3" trusted: "true" physical_network: "sriov1" - address: "0000:19:0e.0" trusted: "true" physical_network: "sriov2"
Director identifies the host VFs, and derives the PCI addresses of the VFs that are available to the instance.
Enable
IOMMU
on all nodes that require NIC partitioning. For example, if you want NIC Partitioning for Compute nodes, enable IOMMU using theKernelArgs
parameter for that role.parameter_defaults: ComputeParameters: KernelArgs: "intel_iommu=on iommu=pt"
Add your role file and environment files to the stack with your other environment files and deploy the overcloud:
(undercloud)$ openstack overcloud deploy --templates \ -r os-net-config.yaml -e [your environment files] \ -e /home/stack/templates/<compute_environment_file>.yaml
Example NIC Partitioning configurations
To configure a Linux bond over VFs, disable
spoofcheck
, and apply VLAN tags tosriov_vf
:- type: linux_bond name: bond_api bonding_options: "mode=active-backup" members: - type: sriov_vf device: eno2 vfid: 1 vlan_id: get_param: InternalApiNetworkVlanID spoofcheck: false - type: sriov_vf device: eno3 vfid: 1 vlan_id: get_param: InternalApiNetworkVlanID spoofcheck: false addresses: - ip_netmask: get_param: InternalApiIpSubnet routes: list_concat_unique: - get_param: InternalApiInterfaceRoutes
Use the following example to configure an OVS bridge on VFs:
- type: ovs_bridge name: br-bond use_dhcp: true members: - type: vlan vlan_id: get_param: TenantNetworkVlanID addresses: - ip_netmask: get_param: TenantIpSubnet routes: list_concat_unique: - get_param: ControlPlaneStaticRoutes - type: ovs_bond name: bond_vf ovs_options: "bond_mode=active-backup" members: - type: sriov_vf device: p2p1 vfid: 2 - type: sriov_vf device: p2p2 vfid: 2
To configure an OVS user bridge on VFs, apply VLAN tags to the
ovs_user_bridge
parameter:- type: ovs_user_bridge name: br-link0 use_dhcp: false mtu: 9000 ovs_extra: - str_replace: template: set port br-link0 tag=_VLAN_TAG_ params: _VLAN_TAG_: get_param: TenantNetworkVlanID addresses: - ip_netmask: get_param: TenantIpSubnet routes: list_concat_unique: - get_param: TenantInterfaceRoutes members: - type: ovs_dpdk_bond name: dpdkbond0 mtu: 9000 ovs_extra: - set port dpdkbond0 bond_mode=balance-slb members: - type: ovs_dpdk_port name: dpdk0 members: - type: sriov_vf device: eno2 vfid: 3 - type: ovs_dpdk_port name: dpdk1 members: - type: sriov_vf device: eno3 vfid: 3
Validation
Check the number of VFs.
[root@overcloud-compute-0 heat-admin]# cat /sys/class/net/p4p1/device/sriov_numvfs 10 [root@overcloud-compute-0 heat-admin]# cat /sys/class/net/p4p2/device/sriov_numvfs 10
Check Linux bonds.
[root@overcloud-compute-0 heat-admin]# cat /proc/net/bonding/intapi_bond Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: p4p1_1 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: p4p1_1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 16:b4:4c:aa:f0:a8 Slave queue ID: 0 Slave Interface: p4p2_1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b6:be:82:ac:51:98 Slave queue ID: 0 [root@overcloud-compute-0 heat-admin]# cat /proc/net/bonding/st_bond Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: p4p1_3 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: p4p1_3 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 9a:86:b7:cc:17:e4 Slave queue ID: 0 Slave Interface: p4p2_3 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: d6:07:f8:78:dd:5b Slave queue ID: 0
List OVS bonds.
[root@overcloud-compute-0 heat-admin]# ovs-appctl bond/show ---- bond_prov ---- bond_mode: active-backup bond may use recirculation: no, Recirc-ID : -1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms lacp_status: off lacp_fallback_ab: false active slave mac: f2:ad:c7:00:f5:c7(dpdk2) slave dpdk2: enabled active slave may_enable: true slave dpdk3: enabled may_enable: true ---- bond_tnt ---- bond_mode: active-backup bond may use recirculation: no, Recirc-ID : -1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms lacp_status: off lacp_fallback_ab: false active slave mac: b2:7e:b8:75:72:e8(dpdk0) slave dpdk0: enabled active slave may_enable: true slave dpdk1: enabled may_enable: true
Show OVS connections.
[root@overcloud-compute-0 heat-admin]# ovs-vsctl show cec12069-9d4c-4fa8-bfe4-decfdf258f49 Manager "ptcp:6640:127.0.0.1" is_connected: true Bridge br-tenant fail_mode: standalone Port br-tenant Interface br-tenant type: internal Port bond_tnt Interface "dpdk0" type: dpdk options: {dpdk-devargs="0000:82:02.2"} Interface "dpdk1" type: dpdk options: {dpdk-devargs="0000:82:04.2"} Bridge "sriov2" Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port "phy-sriov2" Interface "phy-sriov2" type: patch options: {peer="int-sriov2"} Port "sriov2" Interface "sriov2" type: internal Bridge br-int Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port "int-sriov2" Interface "int-sriov2" type: patch options: {peer="phy-sriov2"} Port br-int Interface br-int type: internal Port "vhu93164679-22" tag: 4 Interface "vhu93164679-22" type: dpdkvhostuserclient options: {vhost-server-path="/var/lib/vhost_sockets/vhu93164679-22"} Port "vhu5d6b9f5a-0d" tag: 3 Interface "vhu5d6b9f5a-0d" type: dpdkvhostuserclient options: {vhost-server-path="/var/lib/vhost_sockets/vhu5d6b9f5a-0d"} Port patch-tun Interface patch-tun type: patch options: {peer=patch-int} Port "int-sriov1" Interface "int-sriov1" type: patch options: {peer="phy-sriov1"} Port int-br-vfs Interface int-br-vfs type: patch options: {peer=phy-br-vfs} Bridge br-vfs Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port phy-br-vfs Interface phy-br-vfs type: patch options: {peer=int-br-vfs} Port bond_prov Interface "dpdk3" type: dpdk options: {dpdk-devargs="0000:82:04.5"} Interface "dpdk2" type: dpdk options: {dpdk-devargs="0000:82:02.5"} Port br-vfs Interface br-vfs type: internal Bridge "sriov1" Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port "sriov1" Interface "sriov1" type: internal Port "phy-sriov1" Interface "phy-sriov1" type: patch options: {peer="int-sriov1"} Bridge br-tun Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port br-tun Interface br-tun type: internal Port patch-int Interface patch-int type: patch options: {peer=patch-tun} Port "vxlan-0a0a7315" Interface "vxlan-0a0a7315" type: vxlan options: {df_default="true", in_key=flow, local_ip="10.10.115.10", out_key=flow, remote_ip="10.10.115.21"} ovs_version: "2.10.0"
If you used NovaPCIPassthrough
to pass VFs to instances, test by deploying an SR-IOV instance.
6.4. Configuring OVS hardware offload
The procedure for OVS hardware offload configuration shares many of the same steps as configuring SR-IOV.
Procedure
Generate an overcloud role for OVS hardware offload that is based on the Compute role:
openstack overcloud roles generate -o roles_data.yaml Controller Compute:ComputeOvsHwOffload
-
Optional: Change the
HostnameFormatDefault: '%stackname%-compute-%index%'
name for theComputeOvsHwOffload
role. -
Add the
OvsHwOffload
parameter under role-specific parameters with a value oftrue
. -
To configure neutron to use the iptables/hybrid firewall driver implementation, include the line:
NeutronOVSFirewallDriver: iptables_hybrid
. For more information aboutNeutronOVSFirewallDriver
, see Using the Open vSwitch Firewall in the Advanced Overcloud Customization Guide. Configure the
physical_network
parameter to match your environment.-
For VLAN, set the
physical_network
parameter to the name of the network you create in neutron after deployment. This value should also be inNeutronBridgeMappings
. For VXLAN, set the
physical_network
parameter tonull
.Example:
parameter_defaults: NeutronOVSFirewallDriver: iptables_hybrid ComputeSriovParameters: IsolCpusList: 2-9,21-29,11-19,31-39 KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=128 intel_iommu=on iommu=pt" OvsHwOffload: true TunedProfileName: "cpu-partitioning" NeutronBridgeMappings: - tenant:br-tenant NovaPCIPassthrough: - vendor_id: <vendor-id> product_id: <product-id> address: <address> physical_network: "tenant" - vendor_id: <vendor-id> product_id: <product-id> address: <address> physical_network: "null" NovaReservedHostMemory: 4096 NovaComputeCpuDedicatedSet: 1-9,21-29,11-19,31-39
-
Replace
<vendor-id>
with the vendor ID of the physical NIC. -
Replace
<product-id>
with the product ID of the NIC VF. Replace
<address>
with the address of the physical NIC.For more information about how to configure
NovaPCIPassthrough
, see Guidelines for configuringNovaPCIPassthrough
.
-
For VLAN, set the
Ensure that the list of default filters includes
NUMATopologyFilter
:NovaSchedulerDefaultFilters: [\'AvailabilityZoneFilter',\'ComputeFilter',\'ComputeCapabilitiesFilter',\'ImagePropertiesFilter',\'ServerGroupAntiAffinityFilter',\'ServerGroupAffinityFilter',\'PciPassthroughFilter',\'NUMATopologyFilter']
Configure one or more network interfaces intended for hardware offload in the
compute-sriov.yaml
configuration file:- type: ovs_bridge name: br-tenant mtu: 9000 members: - type: sriov_pf name: p7p1 numvfs: 5 mtu: 9000 primary: true promisc: true use_dhcp: false link_mode: switchdev
Note-
Do not use the
NeutronSriovNumVFs
parameter when configuring Open vSwitch hardware offload. The number of virtual functions is specified using thenumvfs
parameter in a network configuration file used byos-net-config
. Red Hat does not support modifying thenumvfs
setting during update or redeployment. -
Do not configure Mellanox network interfaces as a nic-config interface type
ovs-vlan
because this prevents tunnel endpoints such as VXLAN from passing traffic due to driver limitations.
-
Do not use the
Include the
ovs-hw-offload.yaml
file in theovercloud deploy
command:TEMPLATES_HOME=”/usr/share/openstack-tripleo-heat-templates” CUSTOM_TEMPLATES=”/home/stack/templates” openstack overcloud deploy --templates \ -r ${CUSTOM_TEMPLATES}/roles_data.yaml \ -e ${TEMPLATES_HOME}/environments/ovs-hw-offload.yaml \ -e ${CUSTOM_TEMPLATES}/network-environment.yaml \ -e ${CUSTOM_TEMPLATES}/neutron-ovs.yaml
6.4.1. Verifying OVS hardware offload
Confirm that a PCI device is in
switchdev
mode:# devlink dev eswitch show pci/0000:03:00.0 pci/0000:03:00.0: mode switchdev inline-mode none encap enable
Verify if offload is enabled in OVS:
# ovs-vsctl get Open_vSwitch . other_config:hw-offload “true”
6.5. Tuning examples for OVS hardware offload
For optimal performance you must complete additional configuration steps.
Adjusting the number of channels for each network interface to improve performance
A channel includes an interrupt request (IRQ) and the set of queues that trigger the IRQ. When you set the mlx5_core
driver to switchdev
mode, the mlx5_core
driver defaults to one combined channel, which might not deliver optimal performance.
Procedure
On the PF representors, enter the following command to adjust the number of CPUs available to the host. Replace $(nproc) with the number of CPUs you want to make available:
$ sudo ethtool -L enp3s0f0 combined $(nproc)
CPU pinning
To prevent performance degradation from cross-NUMA operations, locate NICs, their applications, the VF guest, and OVS in the same NUMA node. For more information, see Configuring CPU pinning on the Compute node in the Configuring the Compute Service for Instance Creation guide.
6.6. Components of OVS hardware offload
A reference for configuring and troubleshooting the components of OVS HW Offload with Mellanox smart NICs.
Nova
Configure the Nova scheduler to use the NovaPCIPassthrough
filter with the NUMATopologyFilter
and DerivePciWhitelistEnabled
parameters. When you enable OVS HW Offload, the Nova scheduler operates similarly to SR-IOV passthrough for instance spawning.
Neutron
When you enable OVS HW Offload, use the devlink
cli tool to set the NIC e-switch mode to switchdev
. Switchdev
mode establishes representor ports on the NIC that are mapped to the VFs.
Procedure
To allocate a port from a
switchdev
-enabled NIC, create a neutron port with abinding-profile
value ofcapabilities
, and disable port security:$ openstack port create --network private --vnic-type=direct --binding-profile '{"capabilities": ["switchdev"]}' direct_port1 --disable-port-security
Pass this port information when you create the instance. You associate the representor port with the instance VF interface and connect the representor port to OVS bridge br-int for one-time OVS datapath processing. A VF port representor functions like a software version of a physical “patch panel” front-end. For more information about new instance creation, see: Deploying an Instance for SR-IOV
OVS
In an environment with hardware offload configured, the first packet transmitted traverses the OVS kernel path, and this packet journey establishes the ml2 OVS rules for incoming and outgoing traffic for the instance traffic. When the flows of the traffic stream are established, OVS uses the traffic control (TC) Flower utility to push these flows on the NIC hardware.
Procedure
Use director to apply the following configuration on OVS:
$ sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
- Restart to enable HW Offload.
Traffic Control (TC) subsystems
When you enable the hw-offload
flag, OVS uses the TC datapath. TC Flower is an iproute2 utility that writes datapath flows on hardware. This ensures that the flow is programmed on both the hardware and software datapaths, for redundancy.
Procedure
Apply the following configuration. This is the default option if you do not explicitly configure
tc-policy
:$ sudo ovs-vsctl set Open_vSwitch . other_config:tc-policy=none
- Restart OVS.
NIC PF and VF drivers
Mlx5_core is the PF and VF driver for the Mellanox ConnectX-5 NIC. The mlx5_core driver performs the following tasks:
- Creates routing tables on hardware.
- Manages network flow management.
-
Configures the Ethernet switch device driver model,
switchdev
. - Creates block devices.
Procedure
Use the following
devlink
commands to query the mode of the PCI device.$ sudo devlink dev eswitch set pci/0000:03:00.0 mode switchdev $ sudo devlink dev eswitch show pci/0000:03:00.0 pci/0000:03:00.0: mode switchdev inline-mode none encap enable
NIC firmware
The NIC firmware performs the following tasks:
- Maintains routing tables and rules.
- Fixes the pipelines of the tables.
- Manages hardware resources.
- Creates VFs.
The firmware works with the driver for optimal performance.
Although the NIC firmware is non-volatile and persists after you reboot, you can modify the configuration during run time.
Procedure
Apply the following configuration on the interfaces, and the representor ports, to ensure that TC Flower pushes the flow programming at the port level:
$ sudo ethtool -K enp3s0f0 hw-tc-offload on
Ensure that you keep the firmware updated.Yum
or dnf
updates might not complete the firmware update. For more information, see your vendor documentation.
6.7. Troubleshooting OVS hardware offload
Prerequisites
- Linux Kernel 4.13 or newer
- OVS 2.8 or newer
- RHOSP 12 or newer
- Iproute 4.12 or newer
- Mellanox NIC firmware, for example FW ConnectX-5 16.21.0338 or newer
For more information about supported prerequisites, see see the Red Hat Knowledgebase solution Network Adapter Fast Datapath Feature Support Matrix.
Configuring the network in an OVS HW offload deployment
In a HW offload deployment, you can choose one of the following scenarios for your network configuration according to your requirements:
- You can base guest VMs on VXLAN and VLAN by using either the same set of interfaces attached to a bond, or a different set of NICs for each type.
- You can bond two ports of a Mellanox NIC by using Linux bond.
- You can host tenant VXLAN networks on VLAN interfaces on top of a Mellanox Linux bond.
Ensure that individual NICs and bonds are members of an ovs-bridge.
Refer to the below example network configuration:
- type: ovs_bridge name: br-offload mtu: 9000 use_dhcp: false members: - type: linux_bond name: bond-pf bonding_options: "mode=active-backup miimon=100" members: - type: sriov_pf name: p5p1 numvfs: 3 primary: true promisc: true use_dhcp: false defroute: false link_mode: switchdev - type: sriov_pf name: p5p2 numvfs: 3 promisc: true use_dhcp: false defroute: false link_mode: switchdev - type: vlan vlan_id: get_param: TenantNetworkVlanID device: bond-pf addresses: - ip_netmask: get_param: TenantIpSubnet
Refer to the below validated bonding configurations:
- active-backup - mode=1
- active-active or balance-xor - mode=2
- 802.3ad (LACP) - mode=4
Verifying the interface configuration
Verify the interface configuration with the following procedure.
Procedure
-
During deployment, use the host network configuration tool
os-net-config
to enablehw-tc-offload
. -
Enable
hw-tc-offload
on thesriov_config
service any time you reboot the Compute node. Set the
hw-tc-offload
parameter toon
for the NICs that are attached to the bond:.[root@overcloud-computesriov-0 ~]# ethtool -k ens1f0 | grep tc-offload hw-tc-offload: on
Verifying the interface mode
Verify the interface mode with the following procedure.
Procedure
-
Set the eswitch mode to
switchdev
for the interfaces you use for HW offload. -
Use the host network configuration tool
os-net-config
to enableeswitch
during deployment. Enable
eswitch
on thesriov_config
service any time you reboot the Compute node.[root@overcloud-computesriov-0 ~]# devlink dev eswitch show pci/$(ethtool -i ens1f0 | grep bus-info | cut -d ':' -f 2,3,4 | awk '{$1=$1};1')
The driver of the PF interface is set to "mlx5e_rep"
, to show that it is a representor of the e-switch uplink port. This does not affect the functionality.
Verifying the offload state in OVS
Verify the offload state in OVS with the following procedure.
Enable hardware offload in OVS in the Compute node.
[root@overcloud-computesriov-0 ~]# ovs-vsctl get Open_vSwitch . other_config:hw-offload "true"
Verifying the name of the VF representor port
To ensure consistent naming of VF representor ports, os-net-config
uses udev rules to rename the ports in the <PF-name>_<VF_id> format.
Procedure
After deployment, verify that the VF representor ports are named correctly.
root@overcloud-computesriov-0 ~]# cat /etc/udev/rules.d/80-persistent-os-net-config.rules # This file is autogenerated by os-net-config SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}!="", ATTR{phys_port_name}=="pf*vf*", ENV{NM_UNMANAGED}="1" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", KERNELS=="0000:65:00.0", NAME="ens1f0" SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="98039b7f9e48", ATTR{phys_port_name}=="pf0vf*", IMPORT{program}="/etc/udev/rep-link-name.sh $attr{phys_port_name}", NAME="ens1f0_$env{NUMBER}" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", KERNELS=="0000:65:00.1", NAME="ens1f1" SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="98039b7f9e49", ATTR{phys_port_name}=="pf1vf*", IMPORT{program}="/etc/udev/rep-link-name.sh $attr{phys_port_name}", NAME="ens1f1_$env{NUMBER}"
Examining network traffic flow
HW offloaded network flow functions in a similar way to physical switches or routers with application-specific integrated circuit (ASIC) chips. You can access the ASIC shell of a switch or router to examine the routing table and for other debugging. The following procedure uses a Broadcom chipset from a Cumulus Linux switch as an example. Replace the values that are appropriate to your environment.
Procedure
To get Broadcom chip table content, use the
bcmcmd
command.root@dni-7448-26:~# cl-bcmcmd l2 show mac=00:02:00:00:00:08 vlan=2000 GPORT=0x2 modid=0 port=2/xe1 mac=00:02:00:00:00:09 vlan=2000 GPORT=0x2 modid=0 port=2/xe1 Hit
Inspect the Traffic Control (TC) Layer.
# tc -s filter show dev p5p1_1 ingress … filter block 94 protocol ip pref 3 flower chain 5 filter block 94 protocol ip pref 3 flower chain 5 handle 0x2 eth_type ipv4 src_ip 172.0.0.1 ip_flags nofrag in_hw in_hw_count 1 action order 1: mirred (Egress Redirect to device eth4) stolen index 3 ref 1 bind 1 installed 364 sec used 0 sec Action statistics: Sent 253991716224 bytes 169534118 pkt (dropped 0, overlimits 0 requeues 0) Sent software 43711874200 bytes 30161170 pkt Sent hardware 210279842024 bytes 139372948 pkt backlog 0b 0p requeues 0 cookie 8beddad9a0430f0457e7e78db6e0af48 no_percpu
-
Examine the
in_hw
flags and the statistics in this output. The wordhardware
indicates that the hardware processes the network traffic. If you usetc-policy=none
, you can check this output or a tcpdump to investigate when hardware or software handles the packets. You can see a corresponding log message indmesg
or inovs-vswitch.log
when the driver is unable to offload packets. For Mellanox, as an example, the log entries resemble syndrome messages in
dmesg
.[13232.860484] mlx5_core 0000:3b:00.0: mlx5_cmd_check:756:(pid 131368): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x6b1266)
In this example, the error code (0x6b1266) represents the following behavior:
0x6B1266 | set_flow_table_entry: pop vlan and forward to uplink is not allowed
Validating systems
Validate your system with the following procedure.
Procedure
- Ensure SR-IOV and VT-d are enabled on the system.
-
Enable IOMMU in Linux by adding
intel_iommu=on
to kernel parameters, for example, using GRUB.
Limitations
You cannot use the OVS firewall driver with HW offload because the connection tracking properties of the flows are unsupported in the offload path in OVS 2.11.
6.8. Debugging HW Offload flow
You can use the following procedure if you encounter the following message in the ovs-vswitch.log
file:
2020-01-31T06:22:11.257Z|00473|dpif_netlink(handler402)|ERR|failed to offload flow: Operation not supported: p6p1_5
Procedure
To enable logging on the offload modules and to get additional log information for this failure, use the following commands on the Compute node:
ovs-appctl vlog/set dpif_netlink:file:dbg # Module name changed recently (check based on the version used ovs-appctl vlog/set netdev_tc_offloads:file:dbg [OR] ovs-appctl vlog/set netdev_offload_tc:file:dbg ovs-appctl vlog/set tc:file:dbg
Inspect the
ovs-vswitchd
logs again to see additional details about the issue.In the following example logs, the offload failed because of an unsupported attribute mark.
2020-01-31T06:22:11.218Z|00471|dpif_netlink(handler402)|DBG|system@ovs-system: put[create] ufid:61bd016e-eb89-44fc-a17e-958bc8e45fda recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(7),skb_mark(0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=fa:16:3e:d2:f5:f3,dst=fa:16:3e:c4:a3:eb),eth_type(0x0800),ipv4(src=10.1.1.8/0.0.0.0,dst=10.1.1.31/0.0.0.0,proto=1/0,tos=0/0x3,ttl=64/0,frag=no),icmp(type=0/0,code=0/0), actions:set(tunnel(tun_id=0x3d,src=10.10.141.107,dst=10.10.141.124,ttl=64,tp_dst=4789,flags(df|key))),6 2020-01-31T06:22:11.253Z|00472|netdev_tc_offloads(handler402)|DBG|offloading attribute pkt_mark isn't supported 2020-01-31T06:22:11.257Z|00473|dpif_netlink(handler402)|ERR|failed to offload flow: Operation not supported: p6p1_5
Debugging Mellanox NICs
Mellanox has provided a system information script, similar to a Red Hat SOS report.
https://github.com/Mellanox/linux-sysinfo-snapshot/blob/master/sysinfo-snapshot.py
When you run this command, you create a zip file of the relevant log information, which is useful for support cases.
Procedure
You can run this system information script with the following command:
# ./sysinfo-snapshot.py --asap --asap_tc --ibdiagnet --openstack
You can also install Mellanox Firmware Tools (MFT), mlxconfig, mlxlink and the OpenFabrics Enterprise Distribution (OFED) drivers.
Useful CLI commands
Use the ethtool
utility with the following options to gather diagnostic information:
- ethtool -l <uplink representor> : View the number of channels
- ethtool -I <uplink/VFs> : Check statistics
- ethtool -i <uplink rep> : View driver information
- ethtool -g <uplink rep> : Check ring sizes
- ethtool -k <uplink/VFs> : View enabled features
Use the tcpdump
utility at the representor and PF ports to similarly check traffic flow.
- Any changes you make to the link state of the representor port, affect the VF link state also.
- Representor port statistics present VF statistics also.
Use the below commands to get useful diagnostic information:
$ ovs-appctl dpctl/dump-flows -m type=offloaded $ ovs-appctl dpctl/dump-flows -m $ tc filter show dev ens1_0 ingress $ tc -s filter show dev ens1_0 ingress $ tc monitor
6.9. Deploying an instance for SR-IOV
Use host aggregates to separate high performance compute hosts. For information on creating host aggregates and associated flavors for scheduling see Creating host aggregates.
Pinned CPU instances can be located on the same Compute node as unpinned instances. For more information, see Configuring CPU pinning on the Compute node in the Configuring the Compute Service for Instance Creation guide.
Deploy an instance for single root I/O virtualization (SR-IOV) by performing the following steps:
Create a flavor.
# openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>
TipYou can specify the NUMA affinity policy for PCI passthrough devices and SR-IOV interfaces by adding the extra spec
hw:pci_numa_affinity_policy
to your flavor. For more information, see Flavor metadata in the Configuring the Compute Service for Instance Creation guide.Create the network.
# openstack network create net1 --provider-physical-network tenant --provider-network-type vlan --provider-segment <VLAN-ID> # openstack subnet create subnet1 --network net1 --subnet-range 192.0.2.0/24 --dhcp
Create the port.
Use vnic-type
direct
to create an SR-IOV virtual function (VF) port.# openstack port create --network net1 --vnic-type direct sriov_port
Use the following command to create a virtual function with hardware offload.
# openstack port create --network net1 --vnic-type direct --binding-profile '{"capabilities": ["switchdev"]} sriov_hwoffload_port
Use vnic-type
direct-physical
to create an SR-IOV physical function (PF) port that is dedicated to a single instance. This PF port is a Networking service (neutron) port but is not controlled by the Networking service, and is not visible as a network adapter because it is a PCI device that is passed through to the instance.# openstack port create --network net1 --vnic-type direct-physical sriov_port
Deploy an instance.
# openstack server create --flavor <flavor> --image <image> --nic port-id=<id> <instance name>
6.10. Creating host aggregates
For better performance, deploy guests that have cpu pinning and hugepages. You can schedule high performance instances on a subset of hosts by matching aggregate metadata with flavor metadata.
You can configure the
AggregateInstanceExtraSpecsFilter
value, and other necessary filters, through the heat parameterNovaSchedulerDefaultFilters
underparameter_defaults
in your deployment templates.parameter_defaults: NovaSchedulerDefaultFilters: ['AggregateInstanceExtraSpecsFilter','AvailabilityZoneFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','ServerGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter','NUMATopologyFilter']
NoteTo add this parameter to the configuration of an exiting cluster, you can add it to the heat templates, and run the original deployment script again.
Create an aggregate group for SR-IOV, and add relevant hosts. Define metadata, for example,
sriov=true
, that matches defined flavor metadata.# openstack aggregate create sriov_group # openstack aggregate add host sriov_group compute-sriov-0.localdomain # openstack aggregate set --property sriov=true sriov_group
Create a flavor.
# openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>
Set additional flavor properties. Note that the defined metadata,
sriov=true
, matches the defined metadata on the SR-IOV aggregate.# openstack flavor set --property sriov=true --property hw:cpu_policy=dedicated --property hw:mem_page_size=1GB <flavor>