Chapter 6. Deploying Red Hat AI Inference Server on IBM Z with IBM Spyre accelerators
Deploy a language model on OpenShift Container Platform running on IBM Z with IBM Spyre AI accelerators. You configure secrets, persistent storage, and a deployment custom resource (CR) that pulls the model from Hugging Face and uses Red Hat AI Inference Server to inference serve the model.
For more information about installing the Spyre Operator, see the Spyre Operator for Z and LinuxONE User’s Guide.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - Your cluster deployed on IBM Z has worker nodes with IBM Spyre AI accelerators installed.
- You have installed the IBM Spyre Operator in the cluster. For more information, see Installing the Spyre Operator.
- You have a Hugging Face account and have generated a Hugging Face access token.
-
You have access to
registry.redhat.ioand the cluster can pull images from this registry.
IBM Spyre AI accelerator cards support FP16 format model weights only. For compatible models, the Red Hat AI Inference Server inference engine automatically converts weights to FP16 at startup. No additional configuration is needed.
Procedure
Create the
Secretcustom resource (CR) for the Hugging Face token. The cluster uses theSecretCR to pull models from Hugging Face.Set the
HF_TOKENvariable using the token you set in Hugging Face:$ HF_TOKEN=<your_huggingface_token>Set the cluster namespace to match where you deployed the Red Hat AI Inference Server image, for example:
$ NAMESPACE=rhaii-namespaceCreate the
SecretCR in the cluster:$ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE
Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a
SecretCR that contains the contents of your local~/.docker/config.jsonfile, run the following command:$ oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaii-namespaceCreate a
PersistentVolumeClaim(PVC) custom resource (CR) and apply it in the cluster. The following examplePVCCR uses a default IBM VPC Block persistence volume. You use thePVCas the location where you store the models that you download.apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache namespace: rhaii-namespace spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: <STORAGE_CLASS_NAME>NoteConfiguring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.
Create a
Deploymentcustom resource (CR) that pulls the model from Hugging Face and deploys the Red Hat AI Inference Server container. Reference the following exampleDeploymentCR, which uses AI Inference Server to serve a Granite model with IBM Spyre AI accelerators.apiVersion: apps/v1 kind: Deployment metadata: name: granite-spyre namespace: rhaii-namespace labels: app: granite-spyre spec: replicas: 1 selector: matchLabels: app: granite-spyre template: metadata: labels: app: granite-spyre spec: serviceAccountName: default volumes: - name: model-volume persistentVolumeClaim: claimName: model-cache - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" initContainers: - name: fetch-model image: registry.redhat.io/ubi9/python-311:latest env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: HF_TOKEN - name: HF_HOME value: /tmp/hf_home - name: HF_REPO_ID value: "ibm-granite/granite-3.3-8b-instruct" command: - /bin/bash - -lc args: - | set -euo pipefail mkdir -p /tmp/model if [ -z "$(ls -A /tmp/model 2>/dev/null)" ]; then echo "Installing huggingface_hub..." pip install --no-cache-dir -U huggingface_hub echo "Downloading model from Hugging Face: ${HF_REPO_ID}" echo "Using HF_HOME=${HF_HOME}" python -c 'import os; from huggingface_hub import snapshot_download; snapshot_download(repo_id=os.environ["HF_REPO_ID"], local_dir="/tmp/model", local_dir_use_symlinks=False, token=os.environ.get("HF_TOKEN"), resume_download=True); print("Model download completed:", os.environ["HF_REPO_ID"])' else echo "Model already present in /tmp/model, skipping download." fi volumeMounts: - name: model-volume mountPath: /tmp/model containers: - name: vllm image: registry.redhat.io/{rhaii-registry-namespace}/vllm-spyre-rhel9:{rhaiis-version} command: - /bin/bash - -lc - | source /opt/rh/gcc-toolset-14/enable source /etc/profile.d/ibm-aiu-setup.sh exec python3 -m vllm.entrypoints.openai.api_server \ --model=/tmp/model \ --port=8000 \ --served-model-name=spyre-model \ --max-model-len=32768 \ --max-num-seqs=32 \ --tensor-parallel-size=4 \ --enable-prefix-caching env: - name: HF_HOME value: /tmp/hf_home - name: FLEX_DEVICE value: VF - name: TOKENIZERS_PARALLELISM value: "false" - name: DTLOG_LEVEL value: error - name: TORCH_SENDNN_LOG value: CRITICAL - name: VLLM_SPYRE_USE_CB value: "1" - name: VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS value: "1" - name: TORCH_SENDNN_CACHE_ENABLE value: "1" - name: VLLM_DT_CHUNK_LEN value: "512" ports: - name: http containerPort: 8000 resources: requests: cpu: "16" memory: "160Gi" ibm.com/spyre_vf: "4" limits: cpu: "23" memory: "200Gi" ibm.com/spyre_vf: "4" volumeMounts: - name: model-volume mountPath: /tmp/model readOnly: true - name: shm mountPath: /dev/shmWhere:
namespace: rhaii-namespace-
Specifies the deployment namespace. The value of
metadata.namespacemust match the namespace where you configured the Hugging FaceSecretCR. claimName: model-cache-
Specifies the persistent volume claim name. The value of
spec.template.spec.volumes.persistentVolumeClaim.claimNamemust match the name of thePVCthat you created. initContainers-
Defines a container that runs before the main application container to download the required model from Hugging Face by using the
huggingface_hubPython library. The model download step is skipped if the model directory has already been populated, for example, from a previous deployment. FLEX_DEVICE-
Specifies the device type for IBM Spyre accelerators. Set to
VFfor virtual function mode. TOKENIZERS_PARALLELISM- Disables tokenizer parallelism to prevent resource conflicts.
VLLM_SPYRE_USE_CB- Enables continuous batching for improved throughput on IBM Spyre accelerators.
VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS- Requires precompiled decoders for optimal performance on Spyre accelerators.
TORCH_SENDNN_CACHE_ENABLE- Enables caching for the SendNN backend to improve model loading times.
ibm.com/spyre_vf- Requests IBM Spyre virtual function devices from the cluster. The number specifies how many Spyre AI accelerator devices to allocate.
mountPath: /dev/shm- Mounts the shared memory volume required for tensor parallel inference across multiple Spyre accelerators.
Increase the deployment replica count to the required number.
$ oc scale deployment granite-spyre -n rhaii-namespace --replicas=1Optional: Watch the deployment and ensure that it succeeds, for example:
$ oc get deployment -n rhaii-namespace --watchExample output:
NAME READY UP-TO-DATE AVAILABLE AGE granite-spyre 0/1 1 0 2s granite-spyre 1/1 1 1 5mCreate a
ServiceCR for the model inference. For example:apiVersion: v1 kind: Service metadata: name: granite-spyre namespace: rhaii-namespace labels: app: granite-spyre spec: selector: app: granite-spyre ports: - name: http protocol: TCP port: 8000 targetPort: 8000 type: ClusterIPNotespec.selector.appmust match the label in yourDeploymentpod.Optional: Create a
RouteCR to enable public access to the model with TLS encryption. For example:apiVersion: route.openshift.io/v1 kind: Route metadata: name: granite-spyre namespace: rhaii-namespace annotations: haproxy.router.openshift.io/timeout: 600s spec: to: kind: Service name: granite-spyre port: targetPort: http tls: termination: edge insecureEdgeTerminationPolicy: RedirectGet the URL for the exposed route. Run the following command:
$ oc get route granite-spyre -n rhaii-namespace -o jsonpath='{.spec.host}'Example output:
granite-spyre-rhaii-namespace.apps.example.com
Verification
Ensure that the deployment is successful by querying the model. Run the following command:
$ curl -X POST https://granite-spyre-rhaii-namespace.apps.example.com/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "granite-3.3-8b-instruct", "messages": [{"role": "user", "content": "What is AI?"}], "temperature": 0.1 }'Example output:
{ "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1234567890, "model": "granite-3.3-8b-instruct", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines..." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 12, "completion_tokens": 50, "total_tokens": 62 } }