Chapter 13. Configuring virtual GPUs for instances
To support GPU-based rendering on your instances, you can define and manage virtual GPU (vGPU) resources according to your available physical GPU devices and your hypervisor type.
You can use this configuration to effectively spread the rendering workloads between all your physical GPU devices, and to control the scheduling of your vGPU-enabled instances.
To enable vGPU in the Compute service (nova), perform the following tasks:
- Identify the nodes on which you want to configure vGPUs.
- Retrieve the PCI address for each physical GPU on each Compute node, or for each SR-IOV virtual function (VF) if the GPU supports SR-IOV.
- Configure the GPU profiles on each Compute node.
Each instance hosted on the configured Compute nodes can support GPU workloads with vGPU devices that correspond to the physical GPU devices.
The Compute service (nova) tracks the number of vGPU devices that are available for each GPU profile you define on each host. The Compute service schedules instances to these hosts, attaches the devices, and monitors the use of vGPU. When an instance is deleted, the Compute service adds the vGPU devices back to the available pool.
Red Hat enables the use of NVIDIA vGPU in RHOSO without the requirement for support exceptions. However, Red Hat does not provide technical support for the NVIDIA vGPU drivers. The NVIDIA vGPU drivers are shipped and supported by NVIDIA. You require a NVIDIA Certified Support Services subscription to obtain NVIDIA Enterprise Support for NVIDIA vGPU software. For issues that result from the use of NVIDIA vGPUs where you are unable to reproduce the issue on a supported component, the following support policies apply:
- When Red Hat does not suspect that the third-party component is involved in the issue, the normal Scope of Support and Red Hat SLA apply.
- When Red Hat suspects that the third-party component is involved in the issue, the customer will be directed to NVIDIA in line with the Red Hat third party support and certification policies. For more information, see the Knowledge Base article Obtaining Support from NVIDIA.
13.1. Supported configurations and limitations for vGPU devices Copy linkLink copied to clipboard!
Deploying virtual GPUs (vGPUs) enables high-performance graphics and compute capabilities for cloud instances. Successful implementation requires verifying hardware compatibility with NVIDIA’s supported products and adhering to specific operational constraints regarding instance migration and host reboots.
- Supported GPU cards
- For a list of supported NVIDIA GPU cards, see Virtual GPU Software Supported Products on the NVIDIA website.
- Limitations when using vGPU devices
- Each instance can use only one vGPU resource.
- Evacuation of vGPU instances is not supported.
If you need to reboot the Compute node that hosts the vGPU instances, the vGPUs are not automatically reassigned to the recreated instances. You must either cold migrate the instances before you reboot the Compute node, or manually allocate each vGPU to the correct instance after reboot. To manually allocate each vGPU, you must retrieve the
mdevUUID from the instance XML for each vGPU instance that runs on the Compute node before you reboot. Use the following command to discover themdevUUID for each instance:# virsh dumpxml <instance_name> | grep mdevReplace
<instance_name>with the libvirt instance name,OS-EXT-SRV-ATTR:instance_name, returned in a/serversrequest to the Compute API.- By default, vGPU types on Compute hosts are not exposed to API users. To expose the vGPU types on Compute hosts to API users, you must configure resource provider traits and create flavors that require the traits. Alternatively, if you only have one vGPU type, you can grant access by adding the hosts to a host aggregate. For more information, see Creating and managing host aggregates.
- If you use NVIDIA accelerator hardware, you must comply with the NVIDIA licensing requirements. For example, NVIDIA vGPU GRID requires a licensing server. For more information about the NVIDIA licensing requirements, see NVIDIA License Server Release Notes on the NVIDIA website.
13.2. Preparing to configure the Compute service for vGPU Copy linkLink copied to clipboard!
Before you configure the Compute service (nova) for vGPU, you must prepare the data plane nodes that you want to use for vGPU, and you must download and install the NVIDIA device driver.
Procedure
Access the remote shell for
openstackclient:$ oc rsh openstackclientIdentify a node that you want to use for vGPU:
Retrieve the IP address of the Compute node that you want to use for vGPU:
$ openstack hypervisor listUse SSH to connect to the data plane node:
$ ssh <node_ipaddress>-
Create the file
/etc/modprobe.d/blacklist-nouveau.conf. Disable the nouveau driver by adding the following configuration to
blacklist-nouveau.conf:blacklist nouveau options nouveau modeset=0Regenerate the
initramfs:$ dracut --force $ grub2-mkconfig -o /boot/grub2/grub.cfg --update-bls-cmdline- Download and install the NVIDIA driver from the NVIDIA portal. For more information, see NVIDIA DOCS HUB.
Reboot the node:
$ sudo reboot
- Repeat this procedure for all nodes that you want to allocate for vGPU instances.
13.3. Configuring the Compute service for vGPUs Copy linkLink copied to clipboard!
To configure the Compute service (nova) for virtual GPUs (vGPUs), you must retrieve and assign the vGPU type that corresponds to the physical GPU device in your environment and configure a vGPU type.
You can configure only whole node sets. Reconfiguring a subset of the nodes within a node set is not supported. If you need to reconfigure a subset of nodes within a node set, you must scale the node set down, and create a new node set from the previously removed nodes.
Prerequisites
-
The
occommand line tool is installed on your workstation. -
You are logged in to Red Hat OpenStack Services on OpenShift (RHOSO) as a user with
cluster-adminprivileges. -
You have selected the
OpenStackDataPlaneNodeSetCR that defines the nodes that you can configure vGPU on. For more information about creating anOpenStackDataPlaneNodeSetCR, see Creating an OpenStackDataPlaneNodeSet CR with pre-provisioned nodes in the Deploying Red Hat OpenStack Services on OpenShift guide.
Procedure
Virtual GPUs are mediated devices. Retrieve the PCI address for each device that can create mediated devices on each Compute node:
$ ls /sys/class/mdev_bus/NoteThe PCI address of the GPU, or the GPU SR-IOV virtual function (VF) that can create vGPUs, is used as the device driver directory name, for example, 0000:84:00.0. In this procedure, the vGPU-capable resource is called an mdev device.
NoteRecent generations of NVIDIA cards now support SR-IOV. Refer to the NVIDIA documentation to discover if your GPU is SR-IOV-capable.
Review the supported mdev types for each available pGPU device on each Compute node to discover the available vGPU types:
$ ls /sys/class/mdev_bus/<mdev_device>/mdev_supported_typesReplace <mdev_device> with the PCI address for the mdev device, for example, 0000:84:00.0. For example, the following Compute node has 4 pGPUs, and each pGPU supports the same 11 vGPU types:
[root@computegpu-0 ~]# ls /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types: NVIDIA-35 NVIDIA-36 NVIDIA-37 NVIDIA-38 NVIDIA-39 NVIDIA-40 NVIDIA-41 NVIDIA-42 NVIDIA-43 NVIDIA-44 NVIDIA-45 [root@computegpu-0 ~]# ls /sys/class/mdev_bus/0000:85:00.0/mdev_supported_types: NVIDIA-35 NVIDIA-36 NVIDIA-37 NVIDIA-38 NVIDIA-39 NVIDIA-40 NVIDIA-41 NVIDIA-42 NVIDIA-43 NVIDIA-44 NVIDIA-45 [root@computegpu-0 ~]# ls /sys/class/mdev_bus/0000:86:00.0/mdev_supported_types: NVIDIA-35 NVIDIA-36 NVIDIA-37 NVIDIA-38 NVIDIA-39 NVIDIA-40 NVIDIA-41 NVIDIA-42 NVIDIA-43 NVIDIA-44 NVIDIA-45 [root@computegpu-0 ~]# ls /sys/class/mdev_bus/0000:87:00.0/mdev_supported_types: NVIDIA-35 NVIDIA-36 NVIDIA-37 NVIDIA-38 NVIDIA-39 NVIDIA-40 NVIDIA-41 NVIDIA-42 NVIDIA-43 NVIDIA-44 NVIDIA-45
Create or update the
ConfigMapCR namednova-extra-config.yamland set the values of the parameters under[devices]:apiVersion: v1 kind: ConfigMap metadata: name: nova-extra-config namespace: openstack data: 34-nova-vgpu.conf: | [devices] enabled_mdev_types = nvidia-35, nvidia-36For more information about creating
ConfigMapobjects, see Creating and using config maps.Optional: To configure more than one vGPU type, map the supported vGPU types to the pGPUs:
[devices] enabled_mdev_types = nvidia-35, nvidia-36 [mdev_nvidia-35] device_addresses = 0000:84:00.0,0000:85:00.0 [vgpu_nvidia-36] device_addresses = 0000:86:00.0- The nvidia-35 vGPU type is supported by the pGPUs that are in the PCI addresses 0000:84:00.0 and 0000:85:00.0.
- The nvidia-36 vGPU type is supported only by the pGPUs that are in the PCI address 0000:86:00.0.
Create a new
OpenStackDataPlaneDeploymentCR to configure the services on the data plane nodes and deploy the data plane, and save it to a file namedcompute_vgpu_deploy.yamlon your workstation:apiVersion: core.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: compute-vgpuIn the
compute_vgpu_deploy.yamlCR, specifynodeSetsto include all theOpenStackDataPlaneNodeSetCRs you want to deploy. Ensure that you include theOpenStackDataPlaneNodeSetCR that you selected as a prerequisite. ThatOpenStackDataPlaneNodeSetCR defines the nodes that you want to want to use for vGPU.WarningIf your deployment has more than one node set, changes to the
nova-extra-config.yamlConfigMapmight directly affect more than one node set, depending on how the node sets and theDataPlaneServicesare configured. To check if a node set uses thenova-extra-configConfigMapand therefore will be affected by the reconfiguration, complete the following steps:-
Check the services list of the node set and find the name of the
DataPlaneServicethat points to nova. Ensure that the value of the
edpmServiceTypefield of theDataPlaneServiceis set tonova.If the
dataSourceslist of theDataPlaneServicecontains aconfigMapRefnamednova-extra-config, then this node set uses thisConfigMapand therefore will be affected by the configuration changes in thisConfigMap. If some of the node sets that are affected should not be reconfigured, you must create a newDataPlaneServicepointing to a separateConfigMapfor these node sets.
apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: compute-vgpu spec: nodeSets: - openstack-edpm - compute-vgpu - ... - <nodeSet_name>-
Replace
<nodeSet_name>with the names of the `OpenStackDataPlaneNodeSet`CRs that you want to include in your data plane deployment.
-
Check the services list of the node set and find the name of the
-
Save the
compute_vgpu_deploy.yamldeployment file. Deploy the data plane:
$ oc create -f compute_vgpu_deploy.yamlVerify that the data plane is deployed:
$ oc get openstackdataplanenodeset NAME STATUS MESSAGE compute-vgpu True DeployedTipAppend the
-woption to the end of the get command to track deployment progress.Access the remote shell for
openstackclientand verify that the deployed Compute nodes are visible on the control plane:$ oc rsh -n openstack openstackclient $ openstack hypervisor list- Optional: Enable SR-IOV VFs of the GPUs. For more information, see Preparing virtual function for SRIOV vGPU on the NVIDIA DOCS HUB.
13.4. Configuring NVIDIA Multi-Instance GPU (MIG) with RHOSO Copy linkLink copied to clipboard!
You can partition a single physical GPU into multiple isolated GPU instances for predictable performance and fault isolation for those partitions on Red Hat OpenStack Services on OpenShift (RHOSO) Compute nodes.
Prerequisites
- SRIOV is enabled in the BIOS.
- You have completed the steps in Preparing to configure the compute service for vGPU and Configuring the Compute service for vGPUs.
Procedure
Access the remote shell for the
openstackclient:$ oc rsh openstackclientIdentify the Compute node that you want to use for MIG:
Retrieve the IP address of the Compute node that you want to use for MIG:
$ openstack hypervisor listUse SSH to connect to the data plane node:
$ ssh <node_ipaddress>
Identify the PCI address of your NVIDIA GPU:
$ sudo nvidia-smi -LEnable MIG mode on your GPU:
sudo nvidia-smi -i <gpu-ids> -mig 1Optional: You can query the current MIG mode of your GPU:
$ sudo nvidia-smi -i <gpu_id> --query-gpu=pci.bus_id,mig.mode.current --format=csvwhere:
<gpu_id>Specifies the PCI ID of your NVIDIA card.
Example output:
pci.bus_id, mig.mode.current 00000000:17:00.0, Enabled
Check the partitions that are available on your NVIDIA GPU:
$ sudo nvidia-smi mig -lgipExample output:
+-------------------------------------------------------------------------------+ | GPU instance profiles: | | GPU Name ID Instances Memory P2P SM DEC ENC | | Free/Total GiB CE JPEG OFA | |===============================================================================| | 0 MIG 1g.6gb 14 4/4 5.81 No 14 1 0 | | 1 0 0 | +-------------------------------------------------------------------------------+ | 0 MIG 1g.6gb+me 21 1/1 5.81 No 14 1 0 | | 1 1 1 | +-------------------------------------------------------------------------------+ | 0 MIG 2g.12gb 5 2/2 11.75 No 28 2 0 | | 2 0 0 | +-------------------------------------------------------------------------------+ | 0 MIG 2g.12gb+me 6 1/1 11.75 No 28 2 0 | | 2 1 1 | +-------------------------------------------------------------------------------+ | 0 MIG 4g.24gb 0 1/1 23.62 No 56 4 0 | | 4 1 1 | +-------------------------------------------------------------------------------+Create MIG partitions based on available profile IDs:
$ sudo nvidia-smi mig -cgi <ID>,<ID> -Cwhere:
<ID>, <ID>Specifies a comma separated list of profiles that you want to create.
Example:
$ sudo nvidia-smi mig -cgi 14,14,14,14 -CExample output:
Successfully created GPU instance ID 3 on GPU 0 using profile MIG 1g.6gb (ID 14) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 3 using profile MIG 1g.6gb (ID 0) Successfully created GPU instance ID 4 on GPU 0 using profile MIG 1g.6gb (ID 14) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 4 using profile MIG 1g.6gb (ID 0) Successfully created GPU instance ID 5 on GPU 0 using profile MIG 1g.6gb (ID 14) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 5 using profile MIG 1g.6gb (ID 0) Successfully created GPU instance ID 6 on GPU 0 using profile MIG 1g.6gb (ID 14) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 6 using profile MIG 1g.6gb (ID 0)q
Create a systemd unit file. Changes made directly to the Compute node are non-persistent, so a unit file is required to persistently recreate the partitions on reboot.
Create a script called
nvidia_mig_setup.shin the/usr/local/bindirectory that enables MIG and the partitions required for your environment:For example:
#!/bin/bash udevadm settle sleep 5 /usr/lib/nvidia/sriov-manage -e <gpu_id> nvidia-smi -i 0 -mig 1 nvidia-smi mig -dci && nvidia-smi mig -dgi nvidia-smi mig -cgi 14,14,14,14 -Cwhere:
<gpu_id>- Specifies the PCI ID of your NVIDIA card.
Make the script executable:
$ chmod +x /usr/local/bin/nvidia_mig_setup.shCreate a systemd unit file called
nvidia-mig.servicein/etc/systemd/systemand add the following service configuration:[Unit] Description=NVIDIA MIG Partitioning Setup Before=edpm_libvirt.target After=systemd-udevd.service [Service] Type=oneshot ExecStart=/usr/local/bin/nvidia_mig_setup.sh RemainAfterExit=yes [Install] WantedBy=multi-user.targetReload systemd to recognize the new service:
$ sudo systemctl daemon-reloadEnable the service so that it starts on boot:
$ sudo systemctl enable nvidia-mig.serviceNoteTo avoid warnings such as
persistence mode is disabled, run the commandsudo nvidia-smi -i <gpu_id> -pm 1.
Note the virtual GPU type name for your instance profile.
Expand Virtual GPU type
Corresponding GPU instance Profile
A30-4-24C
MIG 4g.24gb
A30-2-12C
MIG 2g.12gb
A30-1-6C
MIG 1g.6gb
A30-1-6CME
MIG 1g.6gb+me
For more information see, see NVIDIA A30 and NVIDIA A30X Virtual GPU Types.
Use the virtual GPU type name and PCI address of your GPU that you identified in previous steps to list the MDEV types:
VGPU_TYPE=<vgpu_type> PF=<gpu_id> for v in `ls -d /sys/bus/pci/devices/$PF/virtfn*`; do VF=$(readlink -f $v | xargs basename) for m in $(grep -El "${VGPU_TYPE}\$" /sys/bus/pci/devices/$VF/mdev_supported_types/nvidia-*/name); do d=$(dirname $m) if (( $(cat $d/available_instances) > 0 )); then echo $d; fi done donewhere:
<vgpu_type>-
Specifies a human readable virtual GPU type name, for example
A30-1-6C. <gpu_id>- Specifies the PCI ID of your NVIDIA card.
Update the Nova
ConfigMapCR file namednova-extra-config.yamland add mdev_types:For example:
[devices] enabled_mdev_types = <mdev_type> [mdev_nvidia_689] device_addresses = <sriov_ids>where:
<mdev_type>-
Specifies a comma delimited list of the MDEV types that match your virtual GPU type that you want to enable. For example,
nvidia-689. <sriov_ids>- Specifies a comma delimited list of SR-IOV device addresses obtained.
13.5. Setting the maximum number of vGPUs that an SR-IOV NVIDIA GPU can create Copy linkLink copied to clipboard!
If you are using NVIDIA SR-IOV GPUs, the Compute service (nova) cannot discover the maximum number of virtual GPUs (vGPUs) those GPUs can create. You must retrieve this number from NVIDIA and set the max_instances option to define the maximum number of vGPUs your SR-IOV NVIDIA GPU can create.
You cannot reconfigure a subset of the nodes within a node set. If you need to do this, you must scale the node set down, and create a new node set from the previously removed nodes.
Prerequisites
- You know whether your NVIDIA GPU supports SR-IOV and how many Virtual Functions (VFs) it supports. For example, the NVIDIA L4 GPU Accelerator provides SR-IOV support for 32 VFs. For more information, see www.nvidia.com.
-
You have the
occommand line tool installed on your workstation. -
You are logged on to a workstation that has access to the RHOSO control plane as a user with
cluster-adminprivileges. -
You have selected the
OpenStackDataPlaneNodeSetCR that defines the nodes on which you want to configure the maximum number of vGPUs for your SR-IOV NVIDIA GPU. For more information about creating anOpenStackDataPlaneNodeSetCR, see Creating an OpenStackDataPlaneNodeSet CR with pre-provisioned nodes in Deploying Red Hat OpenStack Services on OpenShift.
Procedure
To define the maximum number of vGPUs your SR-IOV NVIDIA GPU can create for a specific vGPU type, create or update the
ConfigMapCR namednova-extra-config.yaml. You must set the value of theenabled_mdev_typesparameter andmax_instancesparameter under the specificmdevsection for the vGPU type. This example configuration is for the A40-2Q NVIDIA GPU type which can create up to 24 vGPUs:apiVersion: v1 kind: ConfigMap metadata: name: nova-extra-config namespace: openstack data: 36-nova-max-instances.conf: | [devices] enabled_mdev_types = nvidia-558 [mdev_nvidia-558] max_instances = 24For more information about creating
ConfigMapobjects, see Creating and using config maps in Nodes.-
Save the
nova-extra-config.yamlfile. Create a new
OpenStackDataPlaneDeploymentCR to configure the services on the data plane nodes and deploy the data plane, and save it to a file namedcompute_vgpus_max_instance_deploy.yamlon your workstation:apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: compute_ vgpus_max_instanceIn the
compute_vgpus_max_instance_deploy.yaml, specifynodeSetsto include all theOpenStackDataPlaneNodeSetCRs you want to deploy. Ensure that you include theOpenStackDataPlaneNodeSetCR that you selected as a prerequisite.WarningIf your deployment has more than one node set, changes to the
nova-extra-config.yamlConfigMap might directly affect more than one node set, depending on how the node sets and theDataPlaneServicesare configured. To check if a node set uses thenova-extra-configConfigMap and therefore will be affected by the reconfiguration, complete the following steps:-
Check the services list of the node set and find the name of the
DataPlaneServicethat points tonova. Ensure that the value of theedpmServiceTypefield of theDataPlaneServiceis set tonova. -
If the
dataSourceslist of theDataPlaneServicecontains aconfigMapRefnamednova-extra-config, then this node set uses thisConfigMapand therefore will be affected by the configuration changes in thisConfigMap. If some of the node sets that are affected should not be reconfigured, you must create a newDataPlaneServicepointing to a separateConfigMapfor these node sets and use that custom service in the required node sets instead.
apiVersion: dataplane.openstack.org/v1beta1 kind: OpenStackDataPlaneDeployment metadata: name: compute-vgpus_max_instance spec: nodeSets: - openstack-edpm - compute_vgpus_max_instance - ... - <nodeSet_name>-
Replace
<nodeSet_name>with the names of theOpenStackDataPlaneNodeSetCRs that you want to include in your data plane deployment.
-
Check the services list of the node set and find the name of the
-
Save the
compute_vgpus_max_instance_deploy.yamldeployment file. Deploy the data plane:
$ oc create -f compute_vgpus_max_instance_deploy.yamlVerify that the data plane is deployed:
$ oc get openstackdataplanenodeset NAME STATUS MESSAGE compute_vgpus_max_instance True DeployedAccess the remote shell for
openstackclientand verify that the deployed Compute nodes are visible on the control plane:$ oc rsh -n openstack openstackclient $ openstack hypervisor list
13.6. Configuring Compute nodes for NVIDIA GPU passthrough Copy linkLink copied to clipboard!
You can use PCI passthrough (NVIDIA GPU passthrough) to attach a physical PCI device, such as a graphics card, to an instance. If you use PCI passthrough for a device, the instance reserves exclusive access to the device for performing tasks, and the device is not available to the host.
To use NVIDIA GPU passthrough as PCI passthrough, you must prepare the data plane nodes that you want to use for NVIDIA GPU passthrough, and you must download and install the NVIDIA device driver.
Prerequisites
- You have configured PCI passthrough. For more information, see Configuring PCI passthrough.
- You have enabled IOMMU in the server BIOS of the Compute nodes to support PCI passthrough as part of the PCI passthrough configuration. For more information, see Creating an OpenStackDataPlaneNodeSet CR for PCI passthrough.
Procedure
Access the remote shell for
openstackclient:$ oc rsh openstackclientCreate an instance and install the NVIDIA device driver:
$ openstack server create --flavor <flavor> \ --image <image> --network <network> \ --wait myInstanceFromImage-
Replace
<flavor>with the name or ID of the flavor. -
Replace
<image>with the name or ID of the image. Replace
<network>with the name or ID of the network. You can use the--networkoption more than once to connect your instance to several networks, as required.For more information about creating an instance, see Creating an instance in Creating and managing instances.
-
Create the file
/etc/modprobe.d/blacklist-nouveau.conf. Disable the nouveau device driver by adding the following configuration to
blacklist-nouveau.conf:$ blacklist nouveau $ options nvidia modeset=0Regenerate the
initramfs:$ dracut --force $ grub2-mkconfig -o /boot/grub2/grub.cfg --update-bls-cmdline- Download and install the NVIDIA device driver from the product portal. For more information, see NVIDIA DOCS HUB.
Reboot the node:
$ sudo reboot
-
Create the file
-
Replace
- Repeat this procedure for all instances that you want to allocate for GPU passthrough instances.
Verification
- To verify that the GPU is correctly configured for PCI passthrough, see Creating a nodeset for PCI passthrough.
13.7. Creating a custom vGPU resource provider trait Copy linkLink copied to clipboard!
You can create custom resource provider traits for each vGPU type that your RHOSO environment supports. You can then create flavors that your cloud users can use to launch instances on hosts that have those custom traits.
Custom traits are defined in uppercase letters, and must begin with the prefix CUSTOM_. For more information on resource provider traits, see Filtering by resource provider traits.
Procedure
Create a new trait:
$ openstack --os-placement-api-version 1.6 trait \ create CUSTOM_<TRAIT_NAME>-
Replace
<TRAIT_NAME>with the name of the trait. The name can contain only the letters A through Z, the numbers 0 through 9 and the underscore "_" character.
-
Replace
Collect the existing resource provider traits of each host:
$ existing_traits=$(openstack --os-placement-api-version 1.6 resource provider trait list -f value <host_uuid> | sed 's/^/--trait /')Check the existing resource provider traits for the traits you require a host or host aggregate to have:
$ echo $existing_traitsIf the traits you require are not already added to the resource provider, then add the existing traits and your required traits to the resource providers for each host:
$ openstack --os-placement-api-version 1.6 \ resource provider trait set $existing_traits \ --trait CUSTOM_<TRAIT_NAME> \ <host_uuid>Replace
<TRAIT_NAME>with the name of the trait that you want to add to the resource provider. You can use the--traitoption more than once to add additional traits, as required.NoteThis command performs a full replacement of the traits for the resource provider. Therefore, you must retrieve the list of existing resource provider traits on the host and set them again to prevent them from being removed.
13.8. Creating a custom GPU instance image Copy linkLink copied to clipboard!
To enable your cloud users to create instances that use a virtual GPU (vGPU), you can create a custom vGPU-enabled image for launching instances. Use the following procedure to create a custom vGPU-enabled instance image with the NVIDIA GRID guest driver and license file.
Prerequisites
- You have configured and deployed the overcloud with GPU-enabled Compute nodes.
Procedure
Create an instance with the hardware and software profile that your vGPU instances require:
$ openstack server create --flavor <flavor> \ --image <image> temp_vgpu_instance-
Replace
<flavor>with the name or ID of the flavor that has the hardware profile that your vGPU instances require. -
Replace
<image>with the name or ID of the image that has the software profile that your vGPU instances require. For information about downloading RHEL cloud images, see Creating operating system images for instances in Performing storage operations.
-
Replace
- Log in to the instance as a cloud user.
-
Create the
gridd.confNVIDIA GRID license file on the instance, following the NVIDIA guidance: Licensing an NVIDIA vGPU on Linux by Using a Configuration File. Install the GPU driver on the instance. For more information about installing an NVIDIA driver, see Installing the NVIDIA vGPU Software Graphics Driver on Linux.
NoteUse the
hw_video_modelimage property to define the GPU driver type. You can choosenoneif you want to disable the emulated GPUs for your vGPU instances.Create an image snapshot of the instance:
$ openstack server image create \ --name vgpu_image temp_vgpu_instance- Optional: Delete the instance.
13.9. Creating a vGPU flavor for instances Copy linkLink copied to clipboard!
To enable your cloud users to create instances for GPU workloads, you can create a GPU flavor that can be used to launch vGPU instances, and assign the vGPU resource to that flavor.
Prerequisites
- You have configured and deployed the overcloud with GPU-designated Compute nodes.
Procedure
Create an NVIDIA GPU flavor, for example:
$ openstack --os-compute-api=2.86 flavor create --vcpus 6 \ --ram 8192 --disk 100 m1.small-gpuAssign a vGPU resource to the flavor:
$ openstack --os-compute-api=2.86 flavor set m1.small-gpu \ --property "resources:VGPU=1"NoteYou can assign only one vGPU for each instance.
Optional: To customize the flavor for a specific vGPU type, add a required trait to the flavor:
$ openstack --os-compute-api=2.86 flavor set m1.small-gpu \ --property trait:CUSTOM_NVIDIA_11=requiredFor information on how to create custom resource provider traits for each vGPU type, see Creating a custom vGPU resource provider trait.
13.10. Launching a vGPU instance Copy linkLink copied to clipboard!
Create a GPU-enabled instance for GPU workloads.
Procedure
Create an instance using a GPU flavor and image, for example:
$ openstack --os-compute-api=2.86 server create --flavor m1.small-gpu \ --image vgpu_image --security-group web --nic net-id=internal0 \ --key-name lambda vgpu-instance- Log in to the instance as a cloud user.
To verify that the GPU is accessible from the instance, enter the following command from the instance:
$ lspci -nn | grep <gpu_name>