Chapter 8. Configuring Virtual GPU for guest instances
To support GPU-based rendering on your guest instances, you can define and manage virtual GPU (vGPU) resources according to your available physical GPU devices and your hypervisor type. This configuration allows you to divide the rendering workloads between all your physical GPU devices more effectively, and to have more control over scheduling, tuning, and monitoring your vGPU-enabled guest instances.
To enable vGPU in OpenStack Compute, you create flavors that you can use to request Red Hat Enterprise Linux guests with vGPU devices, and then you assign those flavors to Compute instances. Each instance can then support GPU workloads with virtual GPU devices that correspond to the physical GPU devices.
The OpenStack Compute service tracks the number and size of the vGPU devices that are available on each host, schedules guests to these hosts based on the flavor, attaches the devices, and monitors usage on an ongoing basis. In case the guest is no longer available, OpenStack Compute adds the vGPU devices back to the available pool.
8.1. Supported configurations and limitations
This section lists currently supported virtual GPU (vGPU) graphics cards, as well as considerations and limitations for setting up vGPU devices in OpenStack Compute.
Supported GPU cards
For a list of supported NVIDIA GPU cards, see Virtual GPU Software Supported Products on the NVIDIA website.
Limitations and considerations
- You can use only one vGPU type for each Compute host.
- You can use only one vGPU resource for each Compute instance.
- Live migration of vGPU between hosts is not supported.
- Suspend operations on a vGPU-enabled guest is not supported due to a libvirt limitation. Instead, you can snapshot or shelve the instance.
- Resize and cold migration operations on an instance with a vGPU flavor does not automatically re-allocate the vGPU resources to the instance. After you resize or migrate the instance, you must rebuild it manually to re-allocate the vGPU resources.
- By default, vGPU types on Compute hosts are not exposed to API users. To allow access, you can add the hosts to a host aggregate. For general information about host aggregates, see Section 4.4, “Manage Host Aggregates”
- If you use NVIDIA accelerator hardware, you must comply with the NVIDIA licensing requirements. For example, NVIDIA vGPU GRID requires a licensing server. For more information about the NVIDIA licensing requirements, see the NVIDIA License Server Release Notes web page.
8.2. Deploying NVIDIA GRID vGPU
This section describes how to deploy virtual GPU (vGPU) for NVIDIA devices on your Compute node hosts and on your guest instances. This end-to-end process includes the following steps:
- Building a custom GPU-enabled overcloud image
- Preparing the GPU role, profile, and flavor
- Configuring and deploying the overcloud
- Building a custom vGPU-enabled guest image
- Preparing the vGPU flavor for the instances
- Launching and configuring the vGPU-enabled instances
Prerequisites
Before you deploy NVIDIA GRID vGPU on your overcloud, make sure that your environment meets the following requirements:
- Your deployment must meet the requirements for vGPU devices, as described in Section 8.1, “Supported configurations and limitations”.
- Your undercloud must be deployed and the default overcloud image must be uploaded to Glance.
- You must comply with the NVIDIA GRID licensing requirements and you must have the URL of your self-hosted license server. For more information about the NVIDIA licensing requirements and self-hosted server installation, see the NVIDIA License Server Release Notes web page.
8.2.1. Build a custom GPU overcloud image
Perform the following steps on the undercloud to install the NVIDIA GRID host driver on an overcloud Compute image and upload the image to Glance.
Copy the overcloud image and add the
gpu
suffix to the copied image.$ cp overcloud-full.qcow2 overcloud-full-gpu.qcow2
Install an ISO image generator tool from YUM.
$ sudo yum install genisoimage -y
Download the NVIDIA GRID host driver RPM package that corresponds to your GPU device from the NVIDIA website. To determine which driver you need, see the NVIDIA Driver Downloads Portal.
NoteYou must be a registered NVIDIA customer to download the drivers from the portal.
Create an ISO image from the driver RPM package and save the image in the nvidia-guest directory. You will use this ISO image to install the driver on your Compute nodes in subsequent steps.
$ genisoimage -o nvidia-guest.iso -R -J -V NVIDIA nvidia-guest/ I: -input-charset not specified, using utf-8 (detected in locale settings) 9.06% done, estimate finish Wed Oct 31 11:24:46 2018 18.08% done, estimate finish Wed Oct 31 11:24:46 2018 27.14% done, estimate finish Wed Oct 31 11:24:46 2018 36.17% done, estimate finish Wed Oct 31 11:24:46 2018 45.22% done, estimate finish Wed Oct 31 11:24:46 2018 54.25% done, estimate finish Wed Oct 31 11:24:46 2018 63.31% done, estimate finish Wed Oct 31 11:24:46 2018 72.34% done, estimate finish Wed Oct 31 11:24:46 2018 81.39% done, estimate finish Wed Oct 31 11:24:46 2018 90.42% done, estimate finish Wed Oct 31 11:24:46 2018 99.48% done, estimate finish Wed Oct 31 11:24:46 2018 Total translation table size: 0 Total rockridge attributes bytes: 358 Total directory bytes: 0 Path table size(bytes): 10 Max brk space used 0 55297 extents written (108 MB)
Create a driver installation script for your Compute nodes. This script installs the NVIDIA GRID host driver on each Compute node that you run it on. In this example the script is named install_nvidia.sh.
#/bin/bash # NVIDIA GRID package mkdir /tmp/mount mount LABEL=NVIDIA /tmp/mount rpm -ivh /tmp/mount/NVIDIA-vGPU-rhel-8.0-430.27.x86_64.rpm
Customize the overcloud image by attaching the ISO image that you generated and running the driver installation script that you created. For example:
$ virt-customize --attach nvidia-packages.iso -a overcloud-full-gpu.qcow2 -v --run install_nvidia.sh [ 0.0] Examining the guest ... libguestfs: launch: program=virt-customize libguestfs: launch: version=1.36.10rhel=8,release=6.el8_5.2,libvirt libguestfs: launch: backend registered: unix libguestfs: launch: backend registered: uml libguestfs: launch: backend registered: libvirt
Relabel the customized image with SELinux.
$ virt-customize -a overcloud-full-gpu.qcow2 --selinux-relabel [ 0.0] Examining the guest ... [ 2.2] Setting a random seed [ 2.2] SELinux relabelling [ 27.4] Finishing off
Prepare the custom image files for a Glance upload. For example:
$ mkdir /var/image/x86_64/image $ guestmount -a overcloud-full-gpu.qcow2 -i --ro image $ cp image/boot/vmlinuz-3.10.0-862.14.4.el8.x86_64 ./overcloud-full-gpu.vmlinuz $ cp image/boot/initramfs-3.10.0-862.14.4.el8.x86_64.img ./overcloud-full-gpu.initrd
From the undercloud, upload the custom image to Glance.
(undercloud) $ openstack overcloud image upload --update-existing --os-image-name overcloud-full-gpu.qcow2
8.2.2. Configure the vGPU role, profile, and flavor
After you build the custom GPU overcloud image, you prepare the Compute nodes for GPU-enabled overcloud deployment. This section describes how to configure the role, profile, and flavor for the GPU-enabled Compute nodes.
Create the new
ComputeGPU
role file by copying the file /home/stack/templates/roles/Compute.yaml to /home/stack/templates/roles/ComputeGPU.yaml and editing the following file sections:Table 8.1. ComputeGPU role file edits Section Current value New value Role comment
Role: Compute
Role: ComputeGpu
Role name
name: Compute
name: ComputeGpu
Description
Basic Compute Node role
GPU role
CountDefault
1
0
ImageDefault
overcloud-full
overcloud-gpu
HostnameFormatDefault
-compute-
-computegpu-
deprecated_nic_config_name
compute.yaml
compute-gpu.yaml
Generate a new roles data file named gpu_roles_data.yaml that includes the
Controller
,Compute
, andComputeGpu
roles.(undercloud) [stack@director templates]$ openstack overcloud roles generate -o /home/stack/templates/gpu_roles_data.yaml Controller Compute ComputeGpu
The following example shows the
ComputeGpu
role details:##################################################################### # Role: ComputeGpu # ##################################################################### - name: ComputeGpu description: | GPU Compute Node role CountDefault: 1 ImageDefault: overcloud-gpu networks: - InternalApi - Tenant - Storage HostnameFormatDefault: '%stackname%-computegpu-%index%' RoleParametersDefault: TunedProfileName: "virtual-host" # Deprecated & backward-compatible values (FIXME: Make parameters consistent) # Set uses_deprecated_params to True if any deprecated params are used. uses_deprecated_params: True deprecated_param_image: 'NovaImage' deprecated_param_extraconfig: 'NovaComputeExtraConfig' deprecated_param_metadata: 'NovaComputeServerMetadata' deprecated_param_scheduler_hints: 'NovaComputeSchedulerHints' deprecated_param_ips: 'NovaComputeIPs' deprecated_server_resource_name: 'NovaCompute' deprecated_nic_config_name: 'compute-gpu.yaml' ServicesDefault: - OS::TripleO::Services::Aide - OS::TripleO::Services::AuditD - OS::TripleO::Services::CACerts - OS::TripleO::Services::CephClient - OS::TripleO::Services::CephExternal - OS::TripleO::Services::CertmongerUser - OS::TripleO::Services::Collectd - OS::TripleO::Services::ComputeCeilometerAgent - OS::TripleO::Services::ComputeNeutronCorePlugin - OS::TripleO::Services::ComputeNeutronL3Agent - OS::TripleO::Services::ComputeNeutronMetadataAgent - OS::TripleO::Services::ComputeNeutronOvsAgent - OS::TripleO::Services::Docker - OS::TripleO::Services::Fluentd - OS::TripleO::Services::Ipsec - OS::TripleO::Services::Iscsid - OS::TripleO::Services::Kernel - OS::TripleO::Services::LoginDefs - OS::TripleO::Services::MetricsQdr - OS::TripleO::Services::MySQLClient - OS::TripleO::Services::NeutronBgpVpnBagpipe - OS::TripleO::Services::NeutronLinuxbridgeAgent - OS::TripleO::Services::NeutronVppAgent - OS::TripleO::Services::NovaCompute - OS::TripleO::Services::NovaLibvirt - OS::TripleO::Services::NovaLibvirtGuests - OS::TripleO::Services::NovaMigrationTarget - OS::TripleO::Services::Ntp - OS::TripleO::Services::ContainersLogrotateCrond - OS::TripleO::Services::OpenDaylightOvs - OS::TripleO::Services::Rhsm - OS::TripleO::Services::RsyslogSidecar - OS::TripleO::Services::Securetty - OS::TripleO::Services::SensuClient - OS::TripleO::Services::SkydiveAgent - OS::TripleO::Services::Snmp - OS::TripleO::Services::Sshd - OS::TripleO::Services::Timezone - OS::TripleO::Services::TripleoFirewall - OS::TripleO::Services::TripleoPackages - OS::TripleO::Services::Tuned - OS::TripleO::Services::Vpp - OS::TripleO::Services::OVNController - OS::TripleO::Services::OVNMetadataAgent - OS::TripleO::Services::Ptp
Create the
compute-vgpu-nvidia
flavor to tag nodes that you want to designate for vGPU workloads.(undercloud) [stack@director templates]$ openstack flavor create --id auto --ram 6144 --disk 40 --vcpus 4 compute-vgpu-nvidia +----------------------------+--------------------------------------+ | Field | Value | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | disk | 40 | | id | 9cb47954-be00-47c6-a57f-44db35be3e69 | | name | compute-vgpu-nvidia | | os-flavor-access:is_public | True | | properties | | | ram | 6144 | | rxtx_factor | 1.0 | | swap | | | vcpus | 4 | +----------------------------+--------------------------------------+
Tag each node that you want to designate for GPU workloads with the
compute-vgpu-nvidia
profile.(undercloud) [stack@director templates]$ openstack baremetal node set --property capabilities='profile:compute-vgpu-nvidia,boot_option:local' 9d07a673-b6bf-4a20-a538-3b05e8fa2c13
- Register the overcloud and run the standard hardware introspection on your nodes.
8.2.3. Prepare configuration files and deploying the overcloud
After you prepare your overcloud for vGPU, you retrieve and assign the vGPU type that corresponds to the physical GPU device in your environment and prepare the configuration templates.
Configure the vGPU type for your NVIDIA device
To determine the vGPU type for your physical GPU device, you must check the available device type from a different machine. You can perform these steps from any temporary Red Hat Enterprise Linux unused Compute node, and then delete the node. You do not need to deploy the overcloud to perform these steps.
- Install Red Hat Enterprise Linux and the NVIDIA GRID driver on one Compute node and launch the node. For information on installing the NVIDIA GRID driver, see Section 8.2.1, “Build a custom GPU overcloud image”.
On the Compute node, locate the vGPU type of the physical GPU device that you want to enable. For libvirt, virtual GPUs are seen as mediated devices, or
mdev
type devices. To discover the supportedmdev
devices, run the following command:[root@overcloud-computegpu-0 ~]# ls /sys/class/mdev_bus/0000\:06\:00.0/mdev_supported_types/ nvidia-11 nvidia-12 nvidia-13 nvidia-14 nvidia-15 nvidia-16 nvidia-17 nvidia-18 nvidia-19 nvidia-20 nvidia-21 nvidia-210 nvidia-22 [root@overcloud-computegpu-0 ~]# cat /sys/class/mdev_bus/0000\:06\:00.0/mdev_supported_types/nvidia-18/description num_heads=4, frl_config=60, framebuffer=2048M, max_resolution=4096x2160, max_instance=4
Prepare the configuration templates
Add the compute-gpu.yaml file to the network-environment.yaml file. For example:
resource_registry: OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/templates/nic-configs/compute.yaml OS::TripleO::ComputeGpu::Net::SoftwareConfig: /home/stack/templates/nic-configs/compute-gpu.yaml OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/templates/nic-configs/controller.yaml #OS::TripleO::AllNodes::Validation: OS::Heat::None
Add the
OvercloudComputeGpuFlavor
flavor to the node-info.yaml file. For example:parameter_defaults: OvercloudControllerFlavor: control OvercloudComputeFlavor: compute OvercloudComputeGpuFlavor: compute-vgpu-nvidia ControllerCount: 1 ComputeCount: 0 ComputeGpuCount: 1 NtpServer: `NTP_SERVER_URL` NeutronNetworkType: vxlan,vlan NeutronTunnelTypes: vxlan
Replace the
NTP_SERVER_URL
variable with the address of your NTP server.Create a gpu.yaml file with the vGPU type that you retrieved for your GPU device. For example:
parameter_defaults: ComputeGpuExtraConfig: nova::compute::vgpu::enabled_vgpu_types: - nvidia-18
NoteOnly one virtual GPU type is supported per physical GPU. If you specify multiple vGPU types in this property, only the first type is used.
Deploy the overcloud
Run the overcloud deploy
command with the custom GPU image and the configuration templates that you prepared.
$ openstack overcloud deploy -r /home/stack/templates/nvidia/gpu_roles_data.yaml -e /home/stack/templates/nvidia/gpu.yaml
8.2.4. Build a custom GPU guest image
After you deploy the overcloud with GPU-enabled Compute nodes, you build a custom vGPU-enabled instance image with the NVIDIA GRID guest driver and license file.
Create the NVIDIA GRID license file
In the overcloud host, create a gridd.conf file that contains the NVIDIA GRID license information. Use the license server information from your self-hosted NVIDIA GRID license server that you installed previously. For example:
# /etc/nvidia/gridd.conf.template - Configuration file for NVIDIA Grid Daemon # This is a template for the configuration file for NVIDIA Grid Daemon. # For details on the file format, please refer to the nvidia-gridd(1) # man page. # Description: Set License Server Address # Data type: string # Format: "<address>" ServerAddress=[NVIDIA_LICENSE_SERVER_URL] # Description: Set License Server port number # Data type: integer # Format: <port>, default is 7070 ServerPort=[PORT_NUMBER] # Description: Set Backup License Server Address # Data type: string # Format: "<address>" #BackupServerAddress= # Description: Set Backup License Server port number # Data type: integer # Format: <port>, default is 7070 #BackupServerPort= # Description: Set Feature to be enabled # Data type: integer # Possible values: # 0 => for unlicensed state # 1 => for GRID vGPU # 2 => for Quadro Virtual Datacenter Workstation FeatureType=[TYPE_ID] # Description: Parameter to enable or disable Grid Licensing tab in nvidia-settings # Data type: boolean # Possible values: TRUE or FALSE, default is FALSE EnableUI=TRUE # Description: Set license borrow period in minutes # Data type: integer # Possible values: 10 to 10080 mins(7 days), default is 1440 mins(1 day) #LicenseInterval=1440 # Description: Set license linger period in minutes # Data type: integer # Possible values: 0 to 10080 mins(7 days), default is 0 mins #LingerInterval=10
Prepare the guest image and the NVIDIA GRID guest driver
Download the NVIDIA GRID guest driver RPM package that corresponds to your GPU device from the NVIDIA website. To determine which driver you need, see the NVIDIA Driver Downloads Portal.
NoteYou must be a registered NVIDIA customer to download the drivers from the portal.
Create an ISO image from the driver RPM package. You will use this ISO image to install the driver on your Compute nodes in subsequent steps.
[root@virtlab607 guest]# genisoimage -o nvidia-guest.iso -R -J -V NVIDIA nvidia-guest/ I: -input-charset not specified, using utf-8 (detected in locale settings) 9.06% done, estimate finish Wed Oct 31 10:59:50 2018 18.08% done, estimate finish Wed Oct 31 10:59:50 2018 27.14% done, estimate finish Wed Oct 31 10:59:50 2018 36.17% done, estimate finish Wed Oct 31 10:59:50 2018 45.22% done, estimate finish Wed Oct 31 10:59:50 2018 54.25% done, estimate finish Wed Oct 31 10:59:50 2018 63.31% done, estimate finish Wed Oct 31 10:59:50 2018 72.34% done, estimate finish Wed Oct 31 10:59:50 2018 81.39% done, estimate finish Wed Oct 31 10:59:50 2018 90.42% done, estimate finish Wed Oct 31 10:59:50 2018 99.48% done, estimate finish Wed Oct 31 10:59:50 2018 Total translation table size: 0 Total rockridge attributes bytes: 358 Total directory bytes: 0 Path table size(bytes): 10 Max brk space used 0 55297 extents written (108 MB)
Copy the guest image that you want to customize for GPU instances. For example:
[root@virtlab607 guest]# cp rhel-server-8.0-update-4-x86_64-kvm.qcow2 rhel-server-8.0-update-4-x86_64-kvm-gpu.qcow2
Create and run the customization script
By default, you must install the NVIDIA GRID drivers on each instance that you want to designate for GPU workloads. This process involves modifying the guest image, rebooting, and then installing the guest drivers. You can create a script to automate this process for the guest instances.
Create a script named nvidia-prepare-guest.sh to enable the required repositories, update the instance to the latest kernel, install the NVIDIA GRID guest driver, and attach the gridd.conf license file to the instance.
#/bin/bash # Add build tooling subscription-manager register --username [USERNAME] --password [PASSWORD] subscription-manager attach --pool=8a85f98c651a88990165399d8eea03e7 subscription-manager repos --disable=* subscription-manager repos --enable=rhel-8-server-rpms dnf upgrade -y dnf install -y gcc make kernel-devel cpp glibc-devel glibc-headers kernel-headers libmpc mpfr elfutils-libelf-devel # NVIDIA GRID guest script mkdir /tmp/mount mount LABEL=NVIDIA /tmp/mount /bin/sh /tmp/mount/NVIDIA-Linux-x86_64-430.24-grid.run mkdir -p /etc/nvidia cp /tmp/mount/gridd.conf /etc/nvidia
Run the script on the guest image that you copied previously. For example:
$ virt-customize --attach nvidia-guest.iso -a rhel-server-8.0-update-4-x86_64-kvm-gpu.qcow2 -v --run nvidia-prepare-guest.sh
Upload the custom guest image to Glance.
(overcloud) [stack@director ~]$ openstack image create rhelgpu --file /var/images/x86_64/rhel-server-8.0-update-4-x86_64-kvm-gpu.qcow2 --disk-format qcow2 --container-format bare --public
8.2.5. Create a vGPU profile for instances
After you build the custom guest image, you create a GPU flavor and assign a vGPU resource to that flavor. When you later launch instances with this flavor, the vGPU resource will be available to each instance.
You can assign only one vGPU resource for each instance.
Create an NVIDIA GPU flavor to tag each instance that you want to designate for GPU workloads. For example:
(overcloud) [stack@virtlab-director2 ~]$ openstack flavor create --vcpus 6 --ram 8192 --disk 100 m1.small-gpu +----------------------------+--------------------------------------+ | Field | Value | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | disk | 100 | | id | a27b14dd-c42d-4084-9b6a-225555876f68 | | name | m1.small-gpu | | os-flavor-access:is_public | True | | properties | | | ram | 8192 | | rxtx_factor | 1.0 | | swap | | | vcpus | 6 | +----------------------------+--------------------------------------+
Assign a vGPU resource to the flavor that you created. Currently you can assign only one vGPU for each instance.
(overcloud) [stack@virtlab-director2 ~]$ openstack flavor set m1.small-gpu --property "resources:VGPU=1" (overcloud) [stack@virtlab-director2 ~]$ openstack flavor show m1.small-gpu +----------------------------+--------------------------------------+ | Field | Value | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 100 | | id | a27b14dd-c42d-4084-9b6a-225555876f68 | | name | m1.small-gpu | | os-flavor-access:is_public | True | | properties | resources:VGPU='1' | | ram | 8192 | | rxtx_factor | 1.0 | | swap | | | vcpus | 6 | +----------------------------+--------------------------------------+
8.2.6. Launch and test a vGPU instance
After you prepare the guest image and create the GPU flavor, you launch the GPU-enabled instance and install the NVIDIA guest driver from the ISO that you attached to the custom image in Section 8.2.4, “Build a custom GPU guest image”.
Launch a new instance with the GPU flavor that you created in Section 8.2.5, “Create a vGPU profile for instances”. For example:
(overcloud) [stack@virtlab-director2 ~]$ openstack server create --flavor m1.small-gpu --image rhelgpu --security-group web --nic net-id=internal0 --key-name lambda instance0
Log in to the instance and install the NVIDIA GRID driver. The exact installer name is available from the files that you attached to the guest image. For example:
[root@instance0 tmp]# sh NVIDIA-Linux-x86_64-430.24-grid.run
Check the status of the NVIDIA GRID daemon.
[root@instance0 nvidia]# systemctl status nvidia-gridd.service ● nvidia-gridd.service - NVIDIA Grid Daemon Loaded: loaded (/usr/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2018-10-31 20:00:41 EDT; 15s ago Process: 18143 ExecStopPost=/bin/rm -rf /var/run/nvidia-gridd (code=exited, status=0/SUCCESS) Process: 18145 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS) Main PID: 18146 (nvidia-gridd) CGroup: /system.slice/nvidia-gridd.service └─18146 /usr/bin/nvidia-gridd Oct 31 20:00:41 instance0 systemd[1]: Stopped NVIDIA Grid Daemon. Oct 31 20:00:41 instance0 systemd[1]: Starting NVIDIA Grid Daemon... Oct 31 20:00:41 instance0 systemd[1]: Started NVIDIA Grid Daemon. Oct 31 20:00:41 instance0 nvidia-gridd[18146]: Started (18146) Oct 31 20:00:41 instance0 nvidia-gridd[18146]: Ignore Service Provider Licensing. Oct 31 20:00:41 instance0 nvidia-gridd[18146]: Calling load_byte_array(tra) Oct 31 20:00:42 instance0 nvidia-gridd[18146]: Acquiring license for GRID vGPU Edition. Oct 31 20:00:42 instance0 nvidia-gridd[18146]: Calling load_byte_array(tra) Oct 31 20:00:45 instance0 nvidia-gridd[18146]: License acquired successfully. (Info: http://dhcp158-15.virt.lab.eng.bos.redhat.com:7070/request; GRID-Virtual-WS,2.0)