Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Deploying Red Hat AI inference server


Verify your GPU passthrough configuration by launching Red Hat AI inference server on an OpenStack instance.

2.1. Creating the RHEL image in OpenStack Glance

Prepare a RHEL-based image in the Image service (glance) for launching GPU-enabled instances. You can use this image to deploy GPU-accelerated workloads while maintaining compatibility with your OpenStack environment.

Prerequisites

  • The administrator has created a project for you and has provided you with a clouds.yaml file to access the cloud.
  • You have installed the python-openstackclient package.

Procedure

  1. Download the latest Red Hat Enterprise Linux 9.x KVM guest image from the RHEL Download page.
  2. Create the image:

    $ openstack image create \
    --disk-format qcow2 \
    --container-format bare \
    --file <download_name>.qcow2 \
    <rhel9.x>
    • Replace <download_name> with the specific name of your downloaded image. For example: "rhel-9.7-x86_64-kvm"
    • Replace <rhel9.x> with the name and version number of your downloaded image. For example: "rhel9.7"

Verification

  1. Ensure that the image is uploaded. The status should initially show "importing" and then change to "active" when the upload is complete.

    $ watch openstack image show -c status -f value <rhel9.x>
    • Replace <rhel9.x> with the name and version number of your downloaded image.

2.2. Launching an instance for GPU workloads

Deploy a GPU-enabled instance and install the Red Hat AI inference server to validate your configuration. This deployment verifies that instances can successfully access physical GPU devices and demonstrates a working inference service.

Prerequisites

  • You have created an Image service (glance) image for the GPU workload.
  • A flavor is available to you for creating instances for GPU workloads.
  • An appropriate security group with SSH access is provisioned for you to use.

Procedure

  1. Optional: If you do not have one already, create an SSH key-pair for use with your instances for GPU workloads:

    $ openstack keypair create --private-key <private_key_file> <name>
    • Replace <private_key_file> with the file path and name that you want the private key saved to. The public key is automatically saved to your user account.
    • Replace name with the name of your new key.
  2. Populate environment variables for when you create the server instance.

    $ instance_name=rhaiis-inference
    $ image_name=<image_name>
    $ flavor=<flavor_name>
    $ network=<net_name>
    $ public_network=public
    $ security_group=<sg_id>
    $ key_name=<key_pair_name>
    • Replace <image_name> with the name you used for your Image service (glance) image.
    • Replace <flavor_name> with the name of the flavor for GPU workloads, for example, "nvidia-gpu".
    • Replace all other parameters with names specific to your environment.
  3. Use the variables you have created to create the instance:

    $ openstack server create --image ${image_name} --flavor ${flavor} \
        --nic net-id=${network} --security-group ${security_group} \
        --key-name ${key_name} ${instance_name} --wait
  4. Create a floating IP for the instance:

    $ fip=$(openstack floating ip create ${public_network} -f value -c floating_ip_address)
  5. Link the floating IP to the instance:

    $ openstack server add floating ip ${instance_name} ${fip}
    Note

    The default user account in the RHEL image is cloud-user, which is a user with passwordless sudo privileges.

  6. Log in to your instance:

    1. Use SSH to connect to your instance:

      $ ssh -i <private_key> cloud-user@${fip}
      • Replace <private_key> with the private key file that you created.
  7. Register your RHEL instance:

    $ sudo rhc connect --organization <org> --activation-key <key>
    • Replace <org> with your organization.
    • Replace <key> with your Red Hat activation key.
  8. Install the appropriate drivers and Container Toolkit:

  9. Install podman:

    $ sudo dnf install podman
  10. Log into the Red Hat container registry:

    $ podman login registry.redhat.io
  11. In the instance that you created, deploy Red Hat AI Inference Server using podman:

    $ mkdir ./rhaiis-cache
    
    $ podman run --device <device> \
      --security-opt=label=disable \
      --shm-size=4g -p 8000:8000 \
      --env "HF_HUB_OFFLINE=0" \
      --env=VLLM_NO_USAGE_STATS=1 \
      -v ./rhaiis-cache:/opt/app-root/src/cache:Z \
      registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
      --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
      --tensor-parallel-size <size>
    • If you are using NVIDIA, replace <device> with nvidia.com/gpu=all. If you are using AMD, use --device /dev/kdf --device /dev/dri
    • Replace <size> with the number of GPUs when running the AI Inference server container on multiple GPUs.

Verification

  1. Check that the GPU device is available inside the RHEL instance:

    NVIDIA
    nvidia-smi -L
    AMD

    /opt/rocm/bin/rocm-smi

    Example output

    GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-4b9f8464-ad90-cdde-8510-8006fc2772b7)
    GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-7a2e3526-44f0-4d2f-7283-6114a8b82a32)
  2. Make a request to your model using the API:

    $ curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://<floating_ip>:8000/v1/completions | jq
    • Replace <floating_ip> with the address of your instance that you retrieved in a previous step.

      Example output

      {
          "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
          "object": "text_completion",
          "created": 1746555421,
          "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
          "choices": [
              {
                  "index": 0,
                  "text": " Paris.\nThe capital of France is Paris.",
                  "logprobs": null,
                  "finish_reason": "stop",
                  "stop_reason": null,
                  "prompt_logprobs": null
              }
          ],
          "usage": {
              "prompt_tokens": 8,
              "total_tokens": 18,
              "completion_tokens": 10,
              "prompt_tokens_details": null
          }
Red Hat logoGithubredditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance. Découvrez nos récentes mises à jour.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez le Blog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

Theme

© 2026 Red Hat
Retour au début