Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Deploying Red Hat AI inference server

Verify your GPU passthrough configuration by launching Red Hat AI inference server on an OpenStack instance.

2.1. Creating the RHEL image in OpenStack Glance
Copier lien

Prepare a RHEL-based image in the Image service (glance) for launching GPU-enabled instances. You can use this image to deploy GPU-accelerated workloads while maintaining compatibility with your OpenStack environment.

Prerequisites

The administrator has created a project for you and has provided you with a clouds.yaml file to access the cloud.
You have installed the python-openstackclient package.

Procedure

Download the latest Red Hat Enterprise Linux 9.x KVM guest image from the RHEL Download page.
Create the image:
```
$ openstack image create \
--disk-format qcow2 \
--container-format bare \
--file <download_name>.qcow2 \
<rhel9.x>
```
- Replace <download_name> with the specific name of your downloaded image. For example: "rhel-9.7-x86_64-kvm"
- Replace <rhel9.x> with the name and version number of your downloaded image. For example: "rhel9.7"

Verification

Ensure that the image is uploaded. The status should initially show "importing" and then change to "active" when the upload is complete.
```
$ watch openstack image show -c status -f value <rhel9.x>
```
- Replace <rhel9.x> with the name and version number of your downloaded image.

2.2. Launching an instance for GPU workloads
Copier lien

Deploy a GPU-enabled instance and install the Red Hat AI inference server to validate your configuration. This deployment verifies that instances can successfully access physical GPU devices and demonstrates a working inference service.

Prerequisites

You have created an Image service (glance) image for the GPU workload.
A flavor is available to you for creating instances for GPU workloads.
An appropriate security group with SSH access is provisioned for you to use.

Procedure

Optional: If you do not have one already, create an SSH key-pair for use with your instances for GPU workloads:
```
$ openstack keypair create --private-key <private_key_file> <name>
```
- Replace <private_key_file> with the file path and name that you want the private key saved to. The public key is automatically saved to your user account.
- Replace name with the name of your new key.
Populate environment variables for when you create the server instance.
```
$ instance_name=rhaiis-inference
$ image_name=<image_name>
$ flavor=<flavor_name>
$ network=<net_name>
$ public_network=public
$ security_group=<sg_id>
$ key_name=<key_pair_name>
```
- Replace <image_name> with the name you used for your Image service (glance) image.
- Replace <flavor_name> with the name of the flavor for GPU workloads, for example, "nvidia-gpu".
- Replace all other parameters with names specific to your environment.

Use the variables you have created to create the instance:

$ openstack server create --image ${image_name} --flavor ${flavor} \
    --nic net-id=${network} --security-group ${security_group} \
    --key-name ${key_name} ${instance_name} --wait

Create a floating IP for the instance:

$ fip=$(openstack floating ip create ${public_network} -f value -c floating_ip_address)

Link the floating IP to the instance:
```
$ openstack server add floating ip ${instance_name} ${fip}
```
Note
The default user account in the RHEL image is cloud-user, which is a user with passwordless sudo privileges.
Log in to your instance:
1. Use SSH to connect to your instance:
  $ ssh -i <private_key> cloud-user@${fip}
  - Replace <private_key> with the private key file that you created.
Register your RHEL instance:
```
$ sudo rhc connect --organization <org> --activation-key <key>
```
- Replace <org> with your organization.
- Replace <key> with your Red Hat activation key.
Install the appropriate drivers and Container Toolkit:
1. For NVIDIA GPUs:
  - Install NVIDIA drivers
  - Install the NVIDIA Container Toolkit
2. For AMD GPUs:
  - Install ROCm software
  - Verify that you can run ROCm containers
Install podman:
```
$ sudo dnf install podman
```
Log into the Red Hat container registry:
```
$ podman login registry.redhat.io
```

In the instance that you created, deploy Red Hat AI Inference Server using podman:

$ mkdir ./rhaiis-cache

$ podman run --device <device> \
  --security-opt=label=disable \
  --shm-size=4g -p 8000:8000 \
  --env "HF_HUB_OFFLINE=0" \
  --env=VLLM_NO_USAGE_STATS=1 \
  -v ./rhaiis-cache:/opt/app-root/src/cache:Z \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
  --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
  --tensor-parallel-size <size>

If you are using NVIDIA, replace <device> with nvidia.com/gpu=all. If you are using AMD, use --device /dev/kdf --device /dev/dri
Replace <size> with the number of GPUs when running the AI Inference server container on multiple GPUs.

Verification

Check that the GPU device is available inside the RHEL instance:

NVIDIA

nvidia-smi -L

AMD

/opt/rocm/bin/rocm-smi

Example output

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-4b9f8464-ad90-cdde-8510-8006fc2772b7)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-7a2e3526-44f0-4d2f-7283-6114a8b82a32)

Make a request to your model using the API:

$ curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<floating_ip>:8000/v1/completions | jq

Replace <floating_ip> with the address of your instance that you retrieved in a previous step.

Example output

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Deploying Red Hat AI inference server

2.1. Creating the RHEL image in OpenStack Glance
Copier lien

2.2. Launching an instance for GPU workloads
Copier lien

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Deploying Red Hat AI inference server

2.1. Creating the RHEL image in OpenStack GlanceCopier lienLien copié sur presse-papiers!

2.2. Launching an instance for GPU workloadsCopier lienLien copié sur presse-papiers!

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.1. Creating the RHEL image in OpenStack Glance
Copier lien

2.2. Launching an instance for GPU workloads
Copier lien