Chapter 6. Deploying Red Hat AI Inference Server on IBM Z with IBM Spyre accelerators

Deploy a language model on OpenShift Container Platform running on IBM Z with IBM Spyre AI accelerators. You configure secrets, persistent storage, and a deployment custom resource (CR) that pulls the model from Hugging Face and uses Red Hat AI Inference Server to inference serve the model.

For more information about installing the Spyre Operator, see the Spyre Operator for Z and LinuxONE User’s Guide.

Prerequisites

You have installed the OpenShift CLI (oc).
You have logged in as a user with cluster-admin privileges.
Your cluster deployed on IBM Z has worker nodes with IBM Spyre AI accelerators installed.
You have installed the IBM Spyre Operator in the cluster. For more information, see Installing the Spyre Operator.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to registry.redhat.io and the cluster can pull images from this registry.

Note

IBM Spyre AI accelerator cards support FP16 format model weights only. For compatible models, the Red Hat AI Inference Server inference engine automatically converts weights to FP16 at startup. No additional configuration is needed.

Procedure

Create the Secret custom resource (CR) for the Hugging Face token. The cluster uses the Secret CR to pull models from Hugging Face.
1. Set the HF_TOKEN variable using the token you set in Hugging Face:
  $ HF_TOKEN=<your_huggingface_token>
2. Set the cluster namespace to match where you deployed the Red Hat AI Inference Server image, for example:
  $ NAMESPACE=rhaii-namespace
3. Create the Secret CR in the cluster:
  $ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE
Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a Secret CR that contains the contents of your local ~/.docker/config.json file, run the following command:
```
$ oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaii-namespace
```
Create a PersistentVolumeClaim (PVC) custom resource (CR) and apply it in the cluster. The following example PVC CR uses a default IBM VPC Block persistence volume. You use the PVC as the location where you store the models that you download.
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: rhaii-namespace
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: <STORAGE_CLASS_NAME>
```
Note
Configuring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.

Create a Deployment custom resource (CR) that pulls the model from Hugging Face and deploys the Red Hat AI Inference Server container. Reference the following example Deployment CR, which uses AI Inference Server to serve a Granite model with IBM Spyre AI accelerators.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: granite-spyre
  namespace: rhaii-namespace
  labels:
    app: granite-spyre
spec:
  replicas: 1
  selector:
    matchLabels:
      app: granite-spyre
  template:
    metadata:
      labels:
        app: granite-spyre
    spec:
      serviceAccountName: default
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: model-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "2Gi"
      initContainers:
        - name: fetch-model
          image: registry.redhat.io/ubi9/python-311:latest
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: HF_TOKEN
            - name: HF_HOME
              value: /tmp/hf_home
            - name: HF_REPO_ID
              value: "ibm-granite/granite-3.3-8b-instruct"
          command:
            - /bin/bash
            - -lc
          args:
            - |
              set -euo pipefail
              mkdir -p /tmp/model
              if [ -z "$(ls -A /tmp/model 2>/dev/null)" ]; then
                echo "Installing huggingface_hub..."
                pip install --no-cache-dir -U huggingface_hub

                echo "Downloading model from Hugging Face: ${HF_REPO_ID}"
                echo "Using HF_HOME=${HF_HOME}"

                python -c 'import os; from huggingface_hub import snapshot_download; snapshot_download(repo_id=os.environ["HF_REPO_ID"], local_dir="/tmp/model", local_dir_use_symlinks=False, token=os.environ.get("HF_TOKEN"), resume_download=True); print("Model download completed:", os.environ["HF_REPO_ID"])'
              else
                echo "Model already present in /tmp/model, skipping download."
              fi
          volumeMounts:
            - name: model-volume
              mountPath: /tmp/model
      containers:
        - name: vllm
          image: registry.redhat.io/{rhaii-registry-namespace}/vllm-spyre-rhel9:{rhaiis-version}
          command:
            - /bin/bash
            - -lc
            - |
              source /opt/rh/gcc-toolset-14/enable
              source /etc/profile.d/ibm-aiu-setup.sh
              exec python3 -m vllm.entrypoints.openai.api_server \
                --model=/tmp/model \
                --port=8000 \
                --served-model-name=spyre-model \
                --max-model-len=32768 \
                --max-num-seqs=32 \
                --tensor-parallel-size=4 \
                --enable-prefix-caching
          env:
            - name: HF_HOME
              value: /tmp/hf_home
            - name: FLEX_DEVICE
              value: VF
            - name: TOKENIZERS_PARALLELISM
              value: "false"
            - name: DTLOG_LEVEL
              value: error
            - name: TORCH_SENDNN_LOG
              value: CRITICAL
            - name: VLLM_SPYRE_USE_CB
              value: "1"
            - name: VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS
              value: "1"
            - name: TORCH_SENDNN_CACHE_ENABLE
              value: "1"
            - name: VLLM_DT_CHUNK_LEN
              value: "512"
          ports:
            - name: http
              containerPort: 8000
          resources:
            requests:
              cpu: "16"
              memory: "160Gi"
              ibm.com/spyre_vf: "4"
            limits:
              cpu: "23"
              memory: "200Gi"
              ibm.com/spyre_vf: "4"
          volumeMounts:
            - name: model-volume
              mountPath: /tmp/model
              readOnly: true
            - name: shm
              mountPath: /dev/shm

Where:

namespace: rhaii-namespace: Specifies the deployment namespace. The value of metadata.namespace must match the namespace where you configured the Hugging Face Secret CR.
claimName: model-cache: Specifies the persistent volume claim name. The value of spec.template.spec.volumes.persistentVolumeClaim.claimName must match the name of the PVC that you created.
initContainers: Defines a container that runs before the main application container to download the required model from Hugging Face by using the huggingface_hub Python library. The model download step is skipped if the model directory has already been populated, for example, from a previous deployment.
FLEX_DEVICE: Specifies the device type for IBM Spyre accelerators. Set to VF for virtual function mode.
TOKENIZERS_PARALLELISM: Disables tokenizer parallelism to prevent resource conflicts.
VLLM_SPYRE_USE_CB: Enables continuous batching for improved throughput on IBM Spyre accelerators.
VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS: Requires precompiled decoders for optimal performance on Spyre accelerators.
TORCH_SENDNN_CACHE_ENABLE: Enables caching for the SendNN backend to improve model loading times.
ibm.com/spyre_vf: Requests IBM Spyre virtual function devices from the cluster. The number specifies how many Spyre AI accelerator devices to allocate.
mountPath: /dev/shm: Mounts the shared memory volume required for tensor parallel inference across multiple Spyre accelerators.

Increase the deployment replica count to the required number.

$ oc scale deployment granite-spyre -n rhaii-namespace --replicas=1

Optional: Watch the deployment and ensure that it succeeds, for example:

$ oc get deployment -n rhaii-namespace --watch

Example output:

NAME            READY   UP-TO-DATE   AVAILABLE   AGE
granite-spyre   0/1     1            0           2s
granite-spyre   1/1     1            1           5m

Create a Service CR for the model inference. For example:

apiVersion: v1
kind: Service
metadata:
  name: granite-spyre
  namespace: rhaii-namespace
  labels:
    app: granite-spyre
spec:
  selector:
    app: granite-spyre
  ports:
    - name: http
      protocol: TCP
      port: 8000
      targetPort: 8000
  type: ClusterIP

Note

spec.selector.app must match the label in your Deployment pod.

Optional: Create a Route CR to enable public access to the model with TLS encryption. For example:

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: granite-spyre
  namespace: rhaii-namespace
  annotations:
    haproxy.router.openshift.io/timeout: 600s
spec:
  to:
    kind: Service
    name: granite-spyre
  port:
    targetPort: http
  tls:
    termination: edge
    insecureEdgeTerminationPolicy: Redirect

Get the URL for the exposed route. Run the following command:

$ oc get route granite-spyre -n rhaii-namespace -o jsonpath='{.spec.host}'

Example output:

granite-spyre-rhaii-namespace.apps.example.com

Verification

Ensure that the deployment is successful by querying the model. Run the following command:

$ curl -X POST https://granite-spyre-rhaii-namespace.apps.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-3.3-8b-instruct",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "temperature": 0.1
  }'

Example output:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 50,
    "total_tokens": 62
  }
}

Chapter 6. Deploying Red Hat AI Inference Server on IBM Z with IBM Spyre accelerators

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links