Chapter 9. Configuring PCI passthrough
You can use PCI passthrough to attach a physical PCI device, such as a graphics card or a network device, to an instance. If you use PCI passthrough for a device, the instance reserves exclusive access to the device for performing tasks, and the device is not available to the host.
The Compute service (nova) does not support single networks that span multiple provider networks. When a network contains multiple physical networks, the Compute service only uses the first physical network. Therefore, if you are using routed provider networks you must use the same physical_network name across all the Compute nodes.
If you use routed provider networks with VLAN or flat networks, you must use the same physical_network name for all segments. You then create multiple segments for the network and map the segments to the appropriate subnets.
To enable your cloud users to create instances with PCI devices attached, you must complete the following tasks:
- Designate Compute nodes to use for PCI passthrough.
- Configure the Compute nodes for PCI passthrough that have the required PCI devices.
- Deploy the data plane.
- Create a flavor for launching instances with PCI devices attached.
9.1. Prerequisites Copy linkLink copied to clipboard!
- The Compute nodes have the required PCI devices.
-
The
occommand line tool is installed on your workstation. -
You are logged in to Red Hat OpenStack Services on OpenShift (RHOSO) as a user with
cluster-adminprivileges.
9.2. PCI passthrough device type field Copy linkLink copied to clipboard!
The Compute service (nova) categorizes PCI devices into one of three types, depending on the capabilities the devices report.
You can set the PCI device device_type field to one of the following values:
- type-PF
- The device supports SR-IOV and is the parent or root device. Specify this device type to passthrough a device that supports SR-IOV in its entirety.
- type-VF
- The device is a child device of a device that supports SR-IOV.
- type-PCI
-
The device does not support SR-IOV. This is the default device type if the
device_typefield is not set.
The device_spec configuration on the Compute nodes and the alias configuration of the Compute services on the control plane must use the same device_type when referring to the same device.
9.3. Guidelines for configuring Nova PCI passthrough Copy linkLink copied to clipboard!
-
Do not use the
devnameparameter when configuring PCI passthrough, as the device name of a NIC can change. Instead, usevendor_idandproduct_idbecause they are more stable, or use the PCI device address of the NIC. -
Use the
addressparameter orproduct_idto pass through a specific Physical Function (PF). If you have multiple PFs of the sameproduct_id, then the Compute service uses any of those devices when an alias with the sameproduct_idis requested in the flavor. Theaddressparameter is always unique. -
To pass through all the Virtual Functions (VFs), specify only the
product_idandvendor_idof the VFs that you want to use for PCI passthrough. You must also specify the address of the VF if you are using SRIOV for NIC partitioning and you are running OVS on a VF. To pass through only the VFs for a PF but not the PF itself, you can use the address parameter to specify the PCI address of the PF and
product_idto specify the product ID of the VF.- Configuring the PCI device address parameter
-
The
addressparameter specifies the PCI address of the device. You can set the value of theaddressparameter by using either a string or adictmapping. - String format
If you specify the address using a string, you can include wildcards (*) as shown in the following example:
alias = {"name": "a1", "address": "*:0a:00.*", "physical_network": "physnet1"}- Dictionary format
If you specify the address using the dictionary format, you can include regular expression syntax, as shown in the following example:
[pci] device_spec = {"address":{"domain": ".*", "bus": "02", "slot": "01", "function": "[0-2]"}, physical_network: "net1"}The Compute service restricts the configuration of address fields to the following maximum values:
Expand Address field
Maximum value
domain
0xFFFF
bus
0xFF
slot
0x1F
function
0x7
The Compute service supports PCI devices with a 16-bit address domain. The Compute service ignores PCI devices with a 32-bit address domain.
You can optionally specify a default NUMA affinity policy for PCI passthrough devices by adding numa_policy to the configuration. For example:
alias = {"name":"a1", "product_id":"1572", "vendor_id": "8086", "device_type": "type-PF", "numa_policy": "preferred"}
You can choose one of four values for the numa_policy.
| Value | Description |
|---|---|
|
| The Compute service creates an instance that requests a PCI device only when at least one of the NUMA nodes of the instance has affinity with the PCI device. This option provides the best performance. |
|
| The Compute service attempts a best effort selection of PCI devices based on NUMA affinity. If this is not possible, then the Compute service schedules the instance on a NUMA node that has no affinity with the PCI device. |
|
| (Default) The Compute service creates instances that request a PCI device in one of the following cases:
|
|
| The Compute service creates an instance that requests a PCI device only when at least one of the instance NUMA nodes has affinity with a NUMA node in the same host socket as the PCI device. For example, the following host architecture has two sockets, each socket has two NUMA nodes, and a PCI device is connected to one of the nodes in one of the sockets. + image::../_images/NUMA_node_socket.png[NUMA node affinity with NUMA node in the same host socket as the PCI device]
The Compute service can pin an instance with two NUMA nodes and the
The only combination of host nodes that the instance cannot be pinned to is node 2 and node 3, as neither of those nodes are on the same socket as the PCI device. If the other nodes are consumed by other instances and only nodes 2 and 3 are available, the instance does not boot. |
9.4. Updating the control plane for PCI passthrough Copy linkLink copied to clipboard!
To enable your cloud users to create instances with PCI devices attached, start by configuring the control plane. Configure the alias field with the correct product ID, vendor ID, and device type to pass through.
Prerequisites
-
You have selected the
OpenStackDataPlaneNodeSetCR that defines the nodes that you can configure PCI passthrough on. For more information about creating anOpenStackDataPlaneNodeSetCR, see Creating an OpenStackDataPlaneNodeSet CR with pre-provisioned nodes in the Deploying Red Hat OpenStack Services on OpenShift guide. The
PCIPassthroughFilterandNUMATopologyFilterfilters are enabled. These filters are enabled by default. You can verify if they have been changed by checking theOpenStackControlPlaneCR:oc exec nova-scheduler-0 -- grep "enabled_filters" /etc/nova/nova.conf.d/ -R
Procedure
-
Open your
OpenStackControlPlanecustom resource (CR) file,openstack_control_plane.yaml, on your workstation. Add the
customServiceConfigfield to thenovatemplate to specify the PCI alias for the PCI devices on the Compute nodes:apiVersion: core.openstack.org/v1beta1 kind: OpenStackControlPlane spec: nova: apiOverride: route: {} template: secret: osp-secret apiServiceTemplate: replicas: 3 customServiceConfig: | [pci] alias = {"name":"a1", "product_id":"<prod_id>", "vendor_id": "<vendor_id>", "device_type": "<device_type>"}-
Replace
<prod_id>with the product ID for the PCI device, for example,1572. -
Replace
<vendor_id>with the vendor ID for the PCI device, for example,8086. Replace
<device_type>with the type of PCI device, for example,type-PF.NoteYou can find the product ID and vendor ID by using the
lspci -nncommand on a system with the PCI device installed. For more information about configuring thedevice_typefield, see PCI passthrough device type field.
-
Replace
Optional: To set a default NUMA affinity policy for PCI passthrough devices, add
numa_policyto the configuration:[pci] alias = {"name":"a1", "product_id":"<prod_id>", "vendor_id": "<vendor_id>", "device_type": "<device_type>", "numa_policy": "<pci_numa_policy>"}-
Replace
<prod_id>with the product ID for the PCI device, for example,1572. -
Replace
<vendor_id>with the vendor ID for the PCI device, for example,8086. -
Replace
<device_type>with the type of PCI device, for example,type-PF. -
Replace
<pci_numa_policy>with a value ofrequired,socket,preferred, orlegacy. For more information, see Guidelines for configuring Nova PCI passthrough.
-
Replace
Update the control plane:
oc apply -f openstack_control_plane.yaml -n openstackWait until RHOCP creates the resources related to the
OpenStackControlPlaneCR. Run the following command to check the status:$ oc get openstackcontrolplane -n openstackExample output:
NAME STATUS MESSAGE openstack-control-plane Unknown Setup startedThe
OpenStackControlPlaneresources are created when the status is "Setup complete".TipAppend the
-woption to the end of thegetcommand to track deployment progress.Optional: Confirm that the control plane is deployed by reviewing the pods in the openstack namespace for each of your cells:
$ oc get pods -n openstackThe control plane is deployed when all the pods are either completed or running.
9.5. Creating an OpenStackDataPlaneNodeSet CR for PCI passthrough Copy linkLink copied to clipboard!
To enable your cloud users to create instances with PCI devices attached, you must create an OpenStackDataPlaneNodeSet custom resource (CR) that groups and configures the Compute nodes that have the PCI devices to use for PCI passthrough.
This procedure applies to new data plane nodes that have not yet been provisioned. To configure, or to reconfigure, PCI devices on a data plane node that has already been provisioned, you must use the scale down procedure to unprovision the node, then use the scale up procedure to reprovision the node with the PCI device configuration. For more information, see Scaling data plane nodes in Maintaining the Red hat OpenStack Services on OpenShift deployment.
You cannot reconfigure a subset of the nodes within a node set. If you need to do this, you must scale the node set down, and create a new node set from the previously removed nodes.
Prerequisites
-
You have selected the
OpenStackDataPlaneNodeSetCR that defines the nodes that you can configure vGPU on. For more information about creating anOpenStackDataPlaneNodeSetCR, see Creating an OpenStackDataPlaneNodeSet CR with pre-provisioned nodes in the Deploying Red Hat OpenStack Services on OpenShift guide.
Procedure
Create a copy of the PCI alias on the Compute node for instance migration and resize operations. To specify the PCI alias for the devices on the PCI passthrough Compute node, create or update the
ConfigMapCR namednova-extra-configand set the value of the[pci] aliasparameter:apiVersion: v1 kind: ConfigMap metadata: name: nova-extra-config namespace: openstack data: 32-nova-pci-alias.conf: | [pci] alias = {"name":"a1", "product_id":"1572", "vendor_id": "8086", "device_type": "type-PF", "numa_policy": "preferred"}For more information about creating
ConfigMapobjects, see Creating and using config maps in Nodes.NoteThe Compute node aliases must be identical to the aliases on the Controller node. Therefore, if you added
numa_affinityto apiServiceTemplate’scustomServiceConfigin theOpenStackControlPlanecustom resource (CR) file,openstack_control_plane.yaml, then you must also add it to the PCI alias innova-extra-config.Under the
aliasparameter, set thedevice_specparameter to allow nova access to your PCI device:alias = {"name":"a1", "product_id":"1572", "vendor_id": "8086", "device_type": "type-PF", "numa_policy": "preferred"} device_spec = {"vendor_id":"8086", "product_id":"1572", "address": "0000:06:"}NoteEnsure that you use the vendor ID specific to the GPU type.
Create a new
OpenStackDataPlaneDeploymentCR to configure the services on the data plane nodes and deploy the data plane, and save it to a file namedcompute_pci_alias_deploy.yamlon your workstation:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: compute-pci-aliasIn the
compute_pci_alias_deploy.yamlCR, specifynodeSetsto include all theOpenStackDataPlaneNodeSetCRs that you want to deploy. Ensure that you include theOpenStackDataPlaneNodeSetCR that you selected as a prerequisite. ThatOpenStackDataPlaneNodeSetCR defines the nodes that you want to configure:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: compute-pci-alias spec: nodeSets: - openstack-edpm - compute-pci-alias - ... - <nodeSet_name>Replace
<nodeSet_name>with the names of theOpenStackDataPlaneNodeSetCRs that you want to include in your data plane deployment.WarningIf your deployment has more than one node set, changes to the
nova-extra-config.yamlConfigMapmight directly affect more than one node set, depending on how the node sets and theDataPlaneServicesare configured. To check if a node set uses thenova-extra-configConfigMapand therefore will be affected by the reconfiguration, complete the following steps:-
Check the services list of the node set and find the name of the
DataPlaneServicethat points to nova. -
Ensure that the value of the
edpmServiceTypefield of theDataPlaneServiceis set tonova.
If the
dataSourceslist of theDataPlaneServicecontains aconfigMapRefnamednova-extra-config, then this node set uses thisConfigMapand therefore will be affected by the configuration changes in thisConfigMap. If some of the node sets that are affected should not be reconfigured, you must create a newDataPlaneServicepointing to a separateConfigMapfor these node sets.-
Check the services list of the node set and find the name of the
-
Save the
compute_pci_alias_deploy.yamldeployment file. Deploy the data plane:
$ oc create -f compute_pci_alias_deploy.yamlVerify that the data plane is deployed:
$ oc get openstackdataplanenodeset NAME STATUS MESSAGE compute-pci-alias True DeployedAccess the remote shell for
openstackclientand verify that the deployed Compute nodes are visible on the control plane:$ oc rsh -n openstack openstackclient $ openstack hypervisor list-
To enable IOMMU in the server BIOS of the Compute nodes to support PCI passthrough, open the
OpenStackDataPlaneNodeSetCR definition file for the node set you want to update, for example,my_data_plane_node_set.yaml. Add the required configuration or modify the existing configuration to
my_data_plane_node_set.yaml. Place the configuration underansibleVars. The following example enables an Intel IOMMU:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneNodeSet metadata: name: my-data-plane-node-set spec: … nodeTemplate: … ansible: ansibleVars: edpm_kernel_args: "default_hugepagesz=1GB hugepagesz=1G hugepages=64 intel_iommu=on iommu=pt tsx=off isolcpus=2-11,14-23 vfio-pci.ids=<pci_device_id> rd.driver.pre=vfio-pci"Replace
<pci_device_id>with the PCI device ID for the GPU you are using, for example,10de:1eb8. Ensure that you use the device ID specific to the GPU.NoteWhen you first add the KernelArgs parameter to the configuration of a role, the control plane nodes are automatically rebooted. If required, you can disable the automatic rebooting of nodes and instead perform node reboots manually after each deployment.
-
Save the
OpenStackDataPlaneNodeSetCR definition file. Apply the updated
OpenStackDataPlaneNodeSetCR configuration:$ oc apply -f my_data_plane_node_set.yaml -n openstackVerify that the data plane resource has been updated:
$ oc get openstackdataplanenodeset Sample output: NAME STATUS MESSAGE my-data-plane-node-set False Deployment not startedCreate a file on your workstation to define the
OpenStackDataPlaneDeploymentCR, for example,my_data_plane_deploy.yaml:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: my-data-plane-deployTipGive the definition file and the
OpenStackDataPlaneDeploymentCR a unique and descriptive name that indicates the purpose of the modified node set.Add the
OpenStackDataPlaneNodeSetCR that you modified:spec: nodeSets: - my-data-plane-node-set-
Save the
OpenStackDataPlaneDeploymentCR deployment file. Deploy the modified
OpenStackDataPlaneNodeSetCR:$ oc create -f my_data_plane_deploy.yaml -n openstackYou can view the Ansible logs while the deployment executes:
$ oc get pod -l app=openstackansibleee -n openstack -w $ oc logs -l app=openstackansibleee -n openstack -f \ --max-log-requests 10Verify that the modified
OpenStackDataPlaneNodeSetCR is deployed:$ oc get openstackdataplanedeployment -n openstack Sample output NAME STATUS MESSAGE my-data-plane-node-set True Setup CompleteRepeat the
oc getcommand until you see theNodeSet Readymessage:$ oc get openstackdataplanenodeset -n openstack Sample output: NAME STATUS MESSAGE my-data-plane-node-set True NodeSet ReadyFor more information on the meaning of the returned status, see Data plane conditions and states in Deploying Red Hat OpenStack Services on OpenShift.
Create and configure the flavors that your cloud users can use to request the PCI devices. The following example requests two devices, each with a vendor ID of 8086 and a product ID of 1572, using the alias defined in step 7:
$ openstack --os-compute-api=2.86 flavor set \ --property "pci_passthrough:alias"="a1:2" device_passthroughOptional: To override the default NUMA affinity policy for PCI passthrough devices, you can add the NUMA affinity policy property key to the flavor or the image:
To override the default NUMA affinity policy by using the flavor, add the
hw:pci_numa_affinity_policyproperty key:$ openstack --os-compute-api=2.86 flavor set \ --property "hw:pci_numa_affinity_policy"="required" \ Device_passthroughFor more information about the valid values for hw:pci_numa_affinity_policy, see Flavor metadata.
To override the default NUMA affinity policy by using the image, add the
hw_pci_numa_affinity_policyproperty key:$ openstack image set \ --property hw_pci_numa_affinity_policy=required \ device_passthrough_imageNoteIf you set the NUMA affinity policy on both the image and the flavor, the property values must match. The flavor setting takes precedence over the image and default settings. Therefore, the configuration of the NUMA affinity policy on the image only takes effect if the property is not set on the flavor.
Verification
To verify that PCI passthrough is working, you must instruct an OpenStack user to create an instance with an attached PCI device, and then log directly into the instance to see that the PCI device is accessible. You can provide the following instructions:
Create an instance with a PCI passthrough device:
$ openstack server create --flavor device_passthrough \ --image <image> --wait test-pci- Log in to the instance as a cloud user. For more information, see Connecting to an instance in Creating and managing instances.
To verify that the PCI device is accessible from the instance, enter the following command from the instance:
$ lspci -nn | grep <device_name>
9.6. Configuring One Time Use devices Copy linkLink copied to clipboard!
The Compute service (nova) supports the marking of devices as One Time Use (OTU) to reserve them for a single use of a single instance.
Prerequisites
-
You have the
occommand line tool installed on your workstation. -
You are logged on to a workstation that has access to the RHOSO control plane as a user with
cluster-adminprivileges. -
You have selected the
OpenStackDataPlaneNodeSetCR that defines which nodes you want to configure as One Time Use PCI devices. For more information about creating anOpenStackDataPlaneNodeSetCR, see Creating an OpenStackDataPlaneNodeSet CR with pre-provisioned nodes in the Deploying Red Hat OpenStack Services on OpenShift guide. - You have configured PCI device tracking in the Placement service. For more information, see Enabling PCI device tracking in the Placement service.
Procedure
-
Create or update the
ConfigMapcustom resource (CR) namednova-extra-config.yaml. Add or edit the
device_specof the device you want to tag as an OTU device by adding theone_time_usetag to it.The following is an example of
device_specwith this tag added:apiVersion: v1 kind: ConfigMap metadata: name: nova-extra-config namespace: openstack data: 32-nova-pci-alias.conf: | [pci] alias = {"name":"a1", "product_id":"1572", "vendor_id": "8086", "device_type": "type-PF", "numa_policy": "preferred"} device_spec = {"vendor_id":"8086", "product_id":"1572", "address": "0000:06:", "one_time_use": true}NoteThe
device_specconfiguration option can be defined multiple times and Red Hat OpenStack Services on OpenShift (RHOSO) merges each of these definitions into a single list ofdevice_specvalues. This means adevice_specvalue cannot be overwritten by subsquentdevice_specdefinitons. When you are configuring a device to be an OTU device, theone_time_usetag must be defined in the configuration file that originally defined thedevice_spec.For example, Creating an OpenStackDataPlaneNodeSet CR for PCI passthrough defines how to enable cloud users to create instances with PCI devices attached. It would typically be at this stage that the tag would be added to the
device_spec.For more information about creating ConfigMap objects, see Creating and using config maps in Nodes.
-
Save the
nova-extra-config.yamlfile. Create a new
OpenStackDataPlaneDeploymentCR to configure the services on the data plane nodes and deploy the data plane. Save the CR to a file namedcompute_otu_devices_deploy.yamlon your workstation:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: compute-otu-devicesIn the
compute_otu_devices_deploy.yaml, specifynodeSetsto include all theOpenStackDataPlaneNodeSetCRs you want to deploy. Ensure that you include theOpenStackDataPlaneNodeSetCR that you selected as a prerequisite. ThatOpenStackDataPlaneNodeSetCR defines the nodes that you want to configure as OTU devices.WarningYou cannot reconfigure a subset of the nodes within a node set. If you need to do this, you must scale the node set down, and create a new node set from the previously removed nodes.
WarningIf your deployment has more than one node set, changes to the
nova-extra-config.yamlConfigMap might directly affect more than one node set, depending on how the node sets and theDataPlaneServicesare configured. To check if a node set uses thenova-extra-configConfigMap and therefore will be affected by the reconfiguration, complete the following steps:-
Check the services list of the node set and find the name of the
DataPlaneServicethat points tonova. Ensure that the value of theedpmServiceTypefield of theDataPlaneServiceis set tonova. -
If the
dataSourceslist of theDataPlaneServicecontains aconfigMapRefnamednova-extra-config, then this node set uses thisConfigMapand therefore will be affected by the configuration changes in thisConfigMap. If some of the node sets that are affected should not be reconfigured, you must create a newDataPlaneServicepointing to a separateConfigMapfor these node sets and use that custom service in the required node sets instead.
apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: compute-otu-devices spec: nodeSets: - openstack-edpm - ... - <nodeSet_name>-
Replace
<nodeSet_name>with the names of theOpenStackDataPlaneNodeSetCRs that you want to include in your data plane deployment.
-
Check the services list of the node set and find the name of the
-
Save the
compute_otu_devices_deploy.yamldeployment file. Deploy the data plane:
$ oc create -f compute_otu_devices_deploy.yamlVerify that the data plane is deployed:
$ oc get openstackdataplanenodeset NAME STATUS MESSAGE compute-otu-devices True DeployedAccess the remote shell for
openstackclientand verify that the deployed Compute nodes are visible on the control plane:$ oc rsh -n openstack openstackclient $ openstack hypervisor list
9.7. Removing One Time Use device reservation Copy linkLink copied to clipboard!
Devices in a One Time Use (OTU) reserved state cannot be allocated to another instance until the reserved state is cleared. Devices reserved as OTU devices have the HW_PCI_ONE_TIME_USE trait. You can use this trait to find and clear the reserved state.
Prerequisites
-
You have the
occommand line tool installed on your workstation. -
You are logged on to a workstation that has access to the RHOSO control plane as a user with
cluster-adminprivileges.
Procedure
Determine the devices that have the
HW_PCI_ONE_TIME_USEtrait:$ openstack resource provider list --required HW_PCI_ONE_TIME_USEThe following is an example output for this command:
$ openstack resource provider list --required HW_PCI_ONE_TIME_USE +--------------------------------------+--------------------+------------+--------------------------------------+--------------------------------------+ | uuid | name | generation | root_provider_uuid | parent_provider_uuid | +--------------------------------------+--------------------+------------+--------------------------------------+--------------------------------------+ | b9e67d7d-43db-49c7-8ce8-803cad08e656 | compute-01:00:01.0 | 39 | 2ee402e8-c5c6-4586-9ac7-58e7594d27d1 | 2ee402e8-c5c6-4586-9ac7-58e7594d27d1 | +--------------------------------------+--------------------+------------+--------------------------------------+--------------------------------------+For each device in the list, perform the following tasks:
Confirm that the value of the
reservedattribute is1and the value of theusedattribute is0:$ openstack resource provider inventory list <device_uuid>Replace
<device_uuid>with the UUID of the device.The following is an example output for this command:
$ openstack resource provider inventory list b9e67d7d-43db-49c7-8ce8-803cad08e656 +----------------------+------------------+----------+----------+----------+-----------+-------+------+ | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used | +----------------------+------------------+----------+----------+----------+-----------+-------+------+ | CUSTOM_PCI_1B36_0100 | 1.0 | 1 | 1 | 1 | 1 | 1 | 0 | +----------------------+------------------+----------+----------+----------+-----------+-------+------+ImportantDo not clear the reserved state of the device if the value of the
usedattribute is not0.
Set the value of the
reservedattribute to0:$ openstack resource provider inventory set --amend \ --resource <device_resource_class>:reserved=0 \ <device_uuid>-
Replace
<device_resource_class>with theresource_classof the device. -
Replace
<device_uuid>with the UUID of the device.
-
Replace