Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 2. Deploying Red Hat AI inference server
Verify your GPU passthrough configuration by launching Red Hat AI inference server on an OpenStack instance.
2.1. Creating the RHEL image in OpenStack Glance Copier lienLien copié sur presse-papiers!
Prepare a RHEL-based image in the Image service (glance) for launching GPU-enabled instances. You can use this image to deploy GPU-accelerated workloads while maintaining compatibility with your OpenStack environment.
Prerequisites
-
The administrator has created a project for you and has provided you with a
clouds.yamlfile to access the cloud. -
You have installed the
python-openstackclientpackage.
Procedure
- Download the latest Red Hat Enterprise Linux 9.x KVM guest image from the RHEL Download page.
Create the image:
$ openstack image create \ --disk-format qcow2 \ --container-format bare \ --file <download_name>.qcow2 \ <rhel9.x>- Replace <download_name> with the specific name of your downloaded image. For example: "rhel-9.7-x86_64-kvm"
- Replace <rhel9.x> with the name and version number of your downloaded image. For example: "rhel9.7"
Verification
Ensure that the image is uploaded. The status should initially show "importing" and then change to "active" when the upload is complete.
$ watch openstack image show -c status -f value <rhel9.x>- Replace <rhel9.x> with the name and version number of your downloaded image.
2.2. Launching an instance for GPU workloads Copier lienLien copié sur presse-papiers!
Deploy a GPU-enabled instance and install the Red Hat AI inference server to validate your configuration. This deployment verifies that instances can successfully access physical GPU devices and demonstrates a working inference service.
Prerequisites
- You have created an Image service (glance) image for the GPU workload.
- A flavor is available to you for creating instances for GPU workloads.
- An appropriate security group with SSH access is provisioned for you to use.
Procedure
Optional: If you do not have one already, create an SSH key-pair for use with your instances for GPU workloads:
$ openstack keypair create --private-key <private_key_file> <name>-
Replace
<private_key_file>with the file path and name that you want the private key saved to. The public key is automatically saved to your user account. -
Replace
namewith the name of your new key.
-
Replace
Populate environment variables for when you create the server instance.
$ instance_name=rhaiis-inference $ image_name=<image_name> $ flavor=<flavor_name> $ network=<net_name> $ public_network=public $ security_group=<sg_id> $ key_name=<key_pair_name>-
Replace
<image_name>with the name you used for your Image service (glance) image. -
Replace
<flavor_name>with the name of the flavor for GPU workloads, for example, "nvidia-gpu". - Replace all other parameters with names specific to your environment.
-
Replace
Use the variables you have created to create the instance:
$ openstack server create --image ${image_name} --flavor ${flavor} \ --nic net-id=${network} --security-group ${security_group} \ --key-name ${key_name} ${instance_name} --waitCreate a floating IP for the instance:
$ fip=$(openstack floating ip create ${public_network} -f value -c floating_ip_address)Link the floating IP to the instance:
$ openstack server add floating ip ${instance_name} ${fip}NoteThe default user account in the RHEL image is
cloud-user, which is a user with passwordlesssudoprivileges.Log in to your instance:
Use SSH to connect to your instance:
$ ssh -i <private_key> cloud-user@${fip}-
Replace
<private_key>with the private key file that you created.
-
Replace
Register your RHEL instance:
$ sudo rhc connect --organization <org> --activation-key <key>-
Replace
<org>with your organization. -
Replace
<key>with your Red Hat activation key.
-
Replace
Install the appropriate drivers and Container Toolkit:
For NVIDIA GPUs:
For AMD GPUs:
Install podman:
$ sudo dnf install podmanLog into the Red Hat container registry:
$ podman login registry.redhat.ioIn the instance that you created, deploy Red Hat AI Inference Server using
podman:$ mkdir ./rhaiis-cache $ podman run --device <device> \ --security-opt=label=disable \ --shm-size=4g -p 8000:8000 \ --env "HF_HUB_OFFLINE=0" \ --env=VLLM_NO_USAGE_STATS=1 \ -v ./rhaiis-cache:/opt/app-root/src/cache:Z \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \ --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \ --tensor-parallel-size <size>-
If you are using NVIDIA, replace
<device>withnvidia.com/gpu=all. If you are using AMD, use--device /dev/kdf --device /dev/dri -
Replace
<size>with the number of GPUs when running the AI Inference server container on multiple GPUs.
-
If you are using NVIDIA, replace
Verification
Check that the GPU device is available inside the RHEL instance:
- NVIDIA
-
nvidia-smi -L - AMD
/opt/rocm/bin/rocm-smiExample output
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-4b9f8464-ad90-cdde-8510-8006fc2772b7) GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-7a2e3526-44f0-4d2f-7283-6114a8b82a32)
Make a request to your model using the API:
$ curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<floating_ip>:8000/v1/completions | jqReplace
<floating_ip>with the address of your instance that you retrieved in a previous step.Example output
{ "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca", "object": "text_completion", "created": 1746555421, "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8", "choices": [ { "index": 0, "text": " Paris.\nThe capital of France is Paris.", "logprobs": null, "finish_reason": "stop", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 8, "total_tokens": 18, "completion_tokens": 10, "prompt_tokens_details": null }