Chapter 11. Configuring virtual GPUs for instances


To support GPU-based rendering on your instances, you can define and manage virtual GPU (vGPU) resources according to your available physical GPU devices and your hypervisor type. You can use this configuration to effectively spread the rendering workloads between all your physical GPU devices, and to control the scheduling of your vGPU-enabled instances.

To enable vGPU in the Compute service (nova), perform the following tasks:

  1. Identify the nodes on which you want to configure vGPUs.
  2. Retrieve the PCI address for each physical GPU on each Compute node, or for each SR-IOV virtual function (VF) if the GPU supports SR-IOV.
  3. Configure the GPU profiles on each Compute node.

Each instance hosted on the configured Compute nodes can support GPU workloads with vGPU devices that correspond to the physical GPU devices.

The Compute service (nova) tracks the number of vGPU devices that are available for each GPU profile you define on each host. The Compute service schedules instances to these hosts, attaches the devices, and monitors the use of vGPU. When an instance is deleted, the Compute service adds the vGPU devices back to the available pool.

Important

Red Hat enables the use of NVIDIA vGPU in RHOSO without the requirement for support exceptions. However, Red Hat does not provide technical support for the NVIDIA vGPU drivers. The NVIDIA vGPU drivers are shipped and supported by NVIDIA. You require a NVIDIA Certified Support Services subscription to obtain NVIDIA Enterprise Support for NVIDIA vGPU software. For issues that result from the use of NVIDIA vGPUs where you are unable to reproduce the issue on a supported component, the following support policies apply:

Supported GPU cards
For a list of supported NVIDIA GPU cards, see Virtual GPU Software Supported Products on the NVIDIA website.
Limitations when using vGPU devices
  • Each instance can use only one vGPU resource.
  • Live migration of vGPU instances between hosts is not supported.
  • Evacuation of vGPU instances is not supported.
  • If you need to reboot the Compute node that hosts the vGPU instances, the vGPUs are not automatically reassigned to the recreated instances. You must either cold migrate the instances before you reboot the Compute node, or manually allocate each vGPU to the correct instance after reboot. To manually allocate each vGPU, you must retrieve the mdev UUID from the instance XML for each vGPU instance that runs on the Compute node before you reboot. You can use the following command to discover the mdev UUID for each instance:

    # virsh dumpxml <instance_name> | grep mdev
    Copy to Clipboard Toggle word wrap

    Replace <instance_name> with the libvirt instance name, OS-EXT-SRV-ATTR:instance_name, returned in a /servers request to the Compute API.

  • By default, vGPU types on Compute hosts are not exposed to API users. To expose the vGPU types on Compute hosts to API users, you must configure resource provider traits and create flavors that require the traits. Alternatively, if you only have one vGPU type, you can grant access by adding the hosts to a host aggregate. For more information, see Creating and managing host aggregates.
  • If you use NVIDIA accelerator hardware, you must comply with the NVIDIA licensing requirements. For example, NVIDIA vGPU GRID requires a licensing server. For more information about the NVIDIA licensing requirements, see NVIDIA License Server Release Notes on the NVIDIA website.

Before you configure the Compute service for vGPU, you must prepare the data plane nodes that you want to use for vGPU and you must download and install the NVIDIA device driver.

Procedure

  1. Access the remote shell for openstackclient:

    $ oc rsh openstackclient
    Copy to Clipboard Toggle word wrap
  2. Identify a node that you want to use for vGPU:

    1. Retrieve the IP address of the Compute node that you want to use for vGPU:

      $ openstack hypervisor list
      Copy to Clipboard Toggle word wrap
    2. Use SSH to connect to the data plane node:

      $ ssh <node_ipaddress>
      Copy to Clipboard Toggle word wrap
    3. Create the file /etc/modprobe.d/blacklist-nouveau.conf.
    4. Disable the nouveau driver by adding the following configuration to blacklist-nouveau.conf:

      blacklist nouveau
      options nouveau modeset=0
      Copy to Clipboard Toggle word wrap
    5. Regenerate the initramfs:

      $ dracut --force
      $ grub2-mkconfig -o /boot/grub2/grub.cfg --update-bls-cmdline
      Copy to Clipboard Toggle word wrap
    6. Download and install the NVIDIA driver from the NVIDIA portal. For more information, see NVIDIA DOCS HUB.
    7. Reboot the node:

      $ sudo reboot
      Copy to Clipboard Toggle word wrap
  3. Repeat this procedure for all nodes that you want to allocate for vGPU instances.

11.3. Configuring the Compute service for vGPU

You need to retrieve and assign the vGPU type that corresponds to the physical GPU device in your environment and configure a vGPU type.

Note

You can configure only whole node sets. Reconfiguring a subset of the nodes within a node set is not supported. If you need to reconfigure a subset of nodes within a node set, you must scale the node set down, and create a new node set from the previously removed nodes.

Prerequisites

  • The oc command line tool is installed on your workstation.
  • You are logged in to Red Hat OpenStack Services on OpenShift (RHOSO) as a user with cluster-admin privileges.
  • You have selected the OpenStackDataPlaneNodeSet CR that defines the nodes that you can configure vGPU on. For more information about creating an OpenStackDataPlaneNodeSet CR, see Creating an OpenStackDataPlaneNodeSet CR with pre-provisioned nodes in the Deploying Red Hat OpenStack Services on OpenShift guide.

Procedure

  1. Virtual GPUs are mediated devices. Retrieve the PCI address for each device that can create mediated devices on each Compute node:

    $ ls /sys/class/mdev_bus/
    Copy to Clipboard Toggle word wrap
    Note

    The PCI address of the GPU - or the GPU SR-IOV virtual function (VF) that can create vGPUs - is used as the device driver directory name, for example, 0000:84:00.0. In this procedure, the vGPU-capable resource is called an mdev device.

    Note

    Recent generations of NVIDIA cards now support SR-IOV. Refer to the NVIDIA documentation to discover if your GPU is SR-IOV-capable.

  2. Review the supported mdev types for each available pGPU device on each Compute node to discover the available vGPU types:

    $ ls /sys/class/mdev_bus/<mdev_device>/mdev_supported_types
    Copy to Clipboard Toggle word wrap
    • Replace <mdev_device> with the PCI address for the mdev device, for example, 0000:84:00.0. For example, the following Compute node has 4 pGPUs, and each pGPU supports the same 11 vGPU types:

      [root@computegpu-0 ~]# ls /sys/class/mdev_bus/0000:84:00.0/mdev_supported_types:
      NVIDIA-35  NVIDIA-36  NVIDIA-37  NVIDIA-38  NVIDIA-39  NVIDIA-40  NVIDIA-41  NVIDIA-42  NVIDIA-43  NVIDIA-44  NVIDIA-45
      [root@computegpu-0 ~]# ls /sys/class/mdev_bus/0000:85:00.0/mdev_supported_types:
      NVIDIA-35  NVIDIA-36  NVIDIA-37  NVIDIA-38  NVIDIA-39  NVIDIA-40  NVIDIA-41  NVIDIA-42  NVIDIA-43  NVIDIA-44  NVIDIA-45
      [root@computegpu-0 ~]# ls /sys/class/mdev_bus/0000:86:00.0/mdev_supported_types:
      NVIDIA-35  NVIDIA-36  NVIDIA-37  NVIDIA-38  NVIDIA-39  NVIDIA-40  NVIDIA-41  NVIDIA-42  NVIDIA-43  NVIDIA-44  NVIDIA-45
      [root@computegpu-0 ~]# ls /sys/class/mdev_bus/0000:87:00.0/mdev_supported_types:
      NVIDIA-35  NVIDIA-36  NVIDIA-37  NVIDIA-38  NVIDIA-39  NVIDIA-40  NVIDIA-41  NVIDIA-42  NVIDIA-43  NVIDIA-44  NVIDIA-45
      Copy to Clipboard Toggle word wrap
  3. Create or update the ConfigMap CR named nova-extra-config.yaml and set the values of the parameters under [devices]:

    apiVersion: v1
    kind: ConfigMap
    metadata:
       name: nova-extra-config
       namespace: openstack
    data:
       34-nova-vgpu.conf: |
          [devices]
          enabled_mdev_types = nvidia-35, nvidia-36
    Copy to Clipboard Toggle word wrap

    For more information about creating ConfigMap objects, see Creating and using config maps.

  4. Optional: To configure more than one vGPU type, map the supported vGPU types to the pGPUs:

    [devices]
    enabled_mdev_types = nvidia-35, nvidia-36
    [mdev_nvidia-35]
    device_addresses = 0000:84:00.0,0000:85:00.0
    [vgpu_nvidia-36]
    device_addresses = 0000:86:00.0
    Copy to Clipboard Toggle word wrap

    The nvidia-35 vGPU type is supported by the pGPUs that are in the PCI addresses 0000:84:00.0 and 0000:85:00.0. The nvidia-36 vGPU type is supported only by the pGPUs that are in the PCI address 0000:86:00.0.

  5. Create a new OpenStackDataPlaneDeployment CR to configure the services on the data plane nodes and deploy the data plane, and save it to a file named compute_vgpu_deploy.yaml on your workstation:

    apiVersion: core.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
       name: compute-vgpu
    Copy to Clipboard Toggle word wrap
  6. In the compute_vgpu_deploy.yaml CR, specify nodeSets to include all the OpenStackDataPlaneNodeSet CRs you want to deploy. Ensure that you include the OpenStackDataPlaneNodeSet CR that you selected as a prerequisite. That OpenStackDataPlaneNodeSet CR defines the nodes that you want to want to use for vGPU.

    Warning

    If your deployment has more than one node set, changes to the nova-extra-config.yaml ConfigMap might directly affect more than one node set, depending on how the node sets and the DataPlaneServices are configured. To check if a node set uses the nova-extra-config ConfigMap and therefore will be affected by the reconfiguration, complete the following steps:

    1. Check the services list of the node set and find the name of the DataPlaneService that points to nova.
    2. Ensure that the value of the edpmServiceType field of the DataPlaneService is set to nova.

      If the dataSources list of the DataPlaneService contains a configMapRef named nova-extra-config, then this node set uses this ConfigMap and therefore will be affected by the configuration changes in this ConfigMap. If some of the node sets that are affected should not be reconfigured, you must create a new DataPlaneService pointing to a separate ConfigMap for these node sets.

    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
      name: compute-vgpu
    spec:
      nodeSets:
        - openstack-edpm
        - compute-vgpu
        - ...
        - <nodeSet_name>
    Copy to Clipboard Toggle word wrap
    • Replace <nodeSet_name> with the names of the `OpenStackDataPlaneNodeSet`CRs that you want to include in your data plane deployment.
  7. Save the compute_vgpu_deploy.yaml deployment file.
  8. Deploy the data plane:

    $ oc create -f compute_vgpu_deploy.yaml
    Copy to Clipboard Toggle word wrap
  9. Verify that the data plane is deployed:

    $ oc get openstackdataplanenodeset
    
    NAME                    STATUS    MESSAGE
    compute-vgpu   True            Deployed
    Copy to Clipboard Toggle word wrap
    Tip

    Append the -w option to the end of the get command to track deployment progress.

  10. Access the remote shell for openstackclient and verify that the deployed Compute nodes are visible on the control plane:

    $ oc rsh -n openstack openstackclient
    
    $ openstack hypervisor list
    Copy to Clipboard Toggle word wrap
  11. Optional: Enable SR-IOV VFs of the GPUs. For more information, see Preparing virtual function for SRIOV vGPU on the NVIDIA DOCS HUB.

If you are using NVIDIA SR-IOV GPUs, the Compute service (nova) cannot discover the maximum number of virtual GPUs (vGPUs) those GPUs can create. Therefore, you must retrieve this number manually from NVIDIA and then set the max_instances configuration option to define the maximum number of vGPUs your SR-IOV NVIDIA GPU can create.

Warning

You cannot reconfigure a subset of the nodes within a node set. If you need to do this, you must scale the node set down, and create a new node set from the previously removed nodes.

Prerequisites

  • You know whether your NVIDIA GPU supports SR-IOV and how many Virtual Functions (VFs) it supports. For example, the NVIDIA L4 GPU Accelerator provides SR-IOV support for 32 VFs. For more information, see www.nvidia.com.
  • You have the oc command line tool installed on your workstation.
  • You are logged on to a workstation that has access to the RHOSO control plane as a user with cluster-admin privileges.
  • You have selected the OpenStackDataPlaneNodeSet CR that defines the nodes on which you want to configure the maximum number of vGPUs for your SR-IOV NVIDIA GPU. For more information about creating an OpenStackDataPlaneNodeSet CR, see Creating an OpenStackDataPlaneNodeSet CR with pre-provisioned nodes in Deploying Red Hat OpenStack Services on OpenShift.

Procedure

  1. To define the maximum number of vGPUs your SR-IOV NVIDIA GPU can create for a specific vGPU type, create or update the ConfigMap CR named nova-extra-config.yaml. You must set the value of the enabled_mdev_types parameter and max_instances parameter under the specific mdev section for the vGPU type. This example configuration is for the A40-2Q NVIDIA GPU type which can create up to 24 vGPUs:

    apiVersion: v1
    kind: ConfigMap
    metadata:
       name: nova-extra-config
       namespace: openstack
    data:
       36-nova-max-instances.conf: |
          [devices]
          enabled_mdev_types = nvidia-558
    
          [mdev_nvidia-558]
          max_instances = 24
    Copy to Clipboard Toggle word wrap

    For more information about creating ConfigMap objects, see Creating and using config maps in Nodes.

  2. Save the nova-extra-config.yaml file.
  3. Create a new OpenStackDataPlaneDeployment CR to configure the services on the data plane nodes and deploy the data plane, and save it to a file named compute_vgpus_max_instance_deploy.yaml on your workstation:

    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
      name: compute_ vgpus_max_instance
    Copy to Clipboard Toggle word wrap
  4. In the compute_vgpus_max_instance_deploy.yaml, specify nodeSets to include all the OpenStackDataPlaneNodeSet CRs you want to deploy. Ensure that you include the OpenStackDataPlaneNodeSet CR that you selected as a prerequisite.

    Warning

    If your deployment has more than one node set, changes to the nova-extra-config.yaml ConfigMap might directly affect more than one node set, depending on how the node sets and the DataPlaneServices are configured. To check if a node set uses the nova-extra-config ConfigMap and therefore will be affected by the reconfiguration, complete the following steps:

    1. Check the services list of the node set and find the name of the DataPlaneService that points to nova. Ensure that the value of the edpmServiceType field of the DataPlaneService is set to nova.
    2. If the dataSources list of the DataPlaneService contains a configMapRef named nova-extra-config, then this node set uses this ConfigMap and therefore will be affected by the configuration changes in this ConfigMap. If some of the node sets that are affected should not be reconfigured, you must create a new DataPlaneService pointing to a separate ConfigMap for these node sets and use that custom service in the required node sets instead.
    apiVersion: dataplane.openstack.org/v1beta1
    kind: OpenStackDataPlaneDeployment
    metadata:
      name: compute-vgpus_max_instance
    spec:
      nodeSets:
        - openstack-edpm
        - compute_vgpus_max_instance
        - ...
        - <nodeSet_name>
    Copy to Clipboard Toggle word wrap
    • Replace <nodeSet_name> with the names of the OpenStackDataPlaneNodeSet CRs that you want to include in your data plane deployment.
  5. Save the compute_vgpus_max_instance_deploy.yaml deployment file.
  6. Deploy the data plane:

    $ oc create -f compute_vgpus_max_instance_deploy.yaml
    Copy to Clipboard Toggle word wrap
  7. Verify that the data plane is deployed:

    $ oc get openstackdataplanenodeset
    NAME           STATUS MESSAGE
    compute_vgpus_max_instance True   Deployed
    Copy to Clipboard Toggle word wrap
  8. Access the remote shell for openstackclient and verify that the deployed Compute nodes are visible on the control plane:

    $ oc rsh -n openstack openstackclient
    $ openstack hypervisor list
    Copy to Clipboard Toggle word wrap

You can use PCI passthrough (NVIDIA GPU passthrough) to attach a physical PCI device, such as a graphics card, to an instance. If you use PCI passthrough for a device, the instance reserves exclusive access to the device for performing tasks, and the device is not available to the host. To use NVIDIA GPU passthrough as PCI passthrough, you must prepare the data plane nodes that you want to use for NVIDIA GPU passthrough, and you must download and install the NVIDIA device driver.

Prerequisites

Procedure

  1. Access the remote shell for openstackclient:

    $ oc rsh openstackclient
    Copy to Clipboard Toggle word wrap
  2. Create an instance and install the NVIDIA device driver:

    $ openstack server create --flavor <flavor> \
      --image <image> --network <network> \
      --wait myInstanceFromImage
    Copy to Clipboard Toggle word wrap
    • Replace <flavor> with the name or ID of the flavor.
    • Replace <image> with the name or ID of the image.
    • Replace <network> with the name or ID of the network. You can use the --network option more than once to connect your instance to several networks, as required.

      For more information about creating an instance, see Creating an instance in Creating and managing instances.

      1. Create the file /etc/modprobe.d/blacklist-nouveau.conf.
      2. Disable the nouveau device driver by adding the following configuration to blacklist-nouveau.conf:

        $ blacklist nouveau
        $ options nvidia modeset=0
        Copy to Clipboard Toggle word wrap
      3. Regenerate the initramfs:

        $ dracut --force
        $ grub2-mkconfig -o /boot/grub2/grub.cfg --update-bls-cmdline
        Copy to Clipboard Toggle word wrap
      4. Download and install the NVIDIA device driver from the product portal. For more information, see NVIDIA DOCS HUB.
      5. Reboot the node:

        $ sudo reboot
        Copy to Clipboard Toggle word wrap
  3. Repeat this procedure for all instances that you want to allocate for GPU passthrough instances.

Verification

  1. To verify that the GPU is correctly configured for PCI passthrough, see Creating a nodeset for PCI passthrough.

You can create custom resource provider traits for each vGPU type that your RHOSO environment supports. You can then create flavors that your cloud users can use to launch instances on hosts that have those custom traits. Custom traits are defined in uppercase letters, and must begin with the prefix CUSTOM_. For more information on resource provider traits, see Filtering by resource provider traits.

Procedure

  1. Create a new trait:

     $ openstack --os-placement-api-version 1.6 trait \
     create CUSTOM_<TRAIT_NAME>
    Copy to Clipboard Toggle word wrap
    • Replace <TRAIT_NAME> with the name of the trait. The name can contain only the letters A through Z, the numbers 0 through 9 and the underscore "_" character.
  2. Collect the existing resource provider traits of each host:

    $ existing_traits=$(openstack --os-placement-api-version 1.6 resource provider trait list -f value <host_uuid> | sed 's/^/--trait /')
    Copy to Clipboard Toggle word wrap
  3. Check the existing resource provider traits for the traits you require a host or host aggregate to have:

     $ echo $existing_traits
    Copy to Clipboard Toggle word wrap
  4. If the traits you require are not already added to the resource provider, then add the existing traits and your required traits to the resource providers for each host:

     $ openstack --os-placement-api-version 1.6 \
     resource provider trait set $existing_traits \
     --trait CUSTOM_<TRAIT_NAME> \
     <host_uuid>
    Copy to Clipboard Toggle word wrap
    • Replace <TRAIT_NAME> with the name of the trait that you want to add to the resource provider. You can use the --trait option more than once to add additional traits, as required.

      Note

      This command performs a full replacement of the traits for the resource provider. Therefore, you must retrieve the list of existing resource provider traits on the host and set them again to prevent them from being removed.

11.7. Creating a custom GPU instance image

To enable your cloud users to create instances that use a virtual GPU (vGPU), you can create a custom vGPU-enabled image for launching instances. Use the following procedure to create a custom vGPU-enabled instance image with the NVIDIA GRID guest driver and license file.

Prerequisites

  • You have configured and deployed the overcloud with GPU-enabled Compute nodes.

Procedure

  1. Create an instance with the hardware and software profile that your vGPU instances require:

    $ openstack server create --flavor <flavor> \
     --image <image> temp_vgpu_instance
    Copy to Clipboard Toggle word wrap
    • Replace <flavor> with the name or ID of the flavor that has the hardware profile that your vGPU instances require.
    • Replace <image> with the name or ID of the image that has the software profile that your vGPU instances require. For information about downloading RHEL cloud images, see Creating RHEL KVM or RHOSP-compatible images in Creating and managing images.
  2. Log in to the instance as a cloud user.
  3. Create the gridd.conf NVIDIA GRID license file on the instance, following the NVIDIA guidance: Licensing an NVIDIA vGPU on Linux by Using a Configuration File.
  4. Install the GPU driver on the instance. For more information about installing an NVIDIA driver, see Installing the NVIDIA vGPU Software Graphics Driver on Linux.

    Note

    Use the hw_video_model image property to define the GPU driver type. You can choose none if you want to disable the emulated GPUs for your vGPU instances. For more information about supported drivers, see Image configuration parameters.

  5. Create an image snapshot of the instance:

    $ openstack server image create \
     --name vgpu_image temp_vgpu_instance
    Copy to Clipboard Toggle word wrap
  6. Optional: Delete the instance.

11.8. Creating a vGPU flavor for instances

To enable your cloud users to create instances for GPU workloads, you can create a GPU flavor that can be used to launch vGPU instances, and assign the vGPU resource to that flavor.

Prerequisites

  • You have configured and deployed the overcloud with GPU-designated Compute nodes.

Procedure

  1. Create an NVIDIA GPU flavor, for example:

    $ openstack --os-compute-api=2.86 flavor create --vcpus 6 \
     --ram 8192 --disk 100 m1.small-gpu
    Copy to Clipboard Toggle word wrap
  2. Assign a vGPU resource to the flavor:

    $ openstack --os-compute-api=2.86 flavor set m1.small-gpu \
     --property "resources:VGPU=1"
    Copy to Clipboard Toggle word wrap
    Note

    You can assign only one vGPU for each instance.

  3. Optional: To customize the flavor for a specific vGPU type, add a required trait to the flavor:

    $ openstack --os-compute-api=2.86 flavor set m1.small-gpu \
     --property trait:CUSTOM_NVIDIA_11=required
    Copy to Clipboard Toggle word wrap

    For information on how to create custom resource provider traits for each vGPU type, see Creating a custom vGPU resource provider trait.

11.9. Launching a vGPU instance

You can create a GPU-enabled instance for GPU workloads.

Procedure

  1. Create an instance using a GPU flavor and image, for example:

    $ openstack --os-compute-api=2.86 server create --flavor m1.small-gpu \
     --image vgpu_image --security-group web --nic net-id=internal0 \
     --key-name lambda vgpu-instance
    Copy to Clipboard Toggle word wrap
  2. Log in to the instance as a cloud-user.
  3. To verify that the GPU is accessible from the instance, enter the following command from the instance:

    $ lspci -nn | grep <gpu_name>
    Copy to Clipboard Toggle word wrap
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat