Chapter 5. Deploying Red Hat AI Inference Server and inference serving the model
Deploy a language model with OpenShift Container Platform by configuring secrets, persistent storage, and a deployment custom resource (CR) that pulls the model from Hugging Face and uses Red Hat AI Inference Server to inference serve the model.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have installed NFD and the required GPU Operator for your underlying AI accelerator hardware.
Procedure
Create the
Secretcustom resource (CR) for the Hugging Face token. The cluster uses theSecretCR to pull models from Hugging Face.Set the
HF_TOKENvariable using the token you set in Hugging Face.$ HF_TOKEN=<your_huggingface_token>Set the cluster namespace to match where you deployed the Red Hat AI Inference Server image, for example:
$ NAMESPACE=rhaiis-namespaceCreate the
SecretCR in the cluster:$ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE
Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a
SecretCR that contains the contents of your local~/.docker/config.jsonfile, run the following command:oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespaceCreate a
PersistentVolumeClaim(PVC) custom resource (CR) and apply it in the cluster. The following examplePVCCR uses a default IBM VPC Block persistence volume. You use thePVCas the location where you store the models that you download.apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache namespace: rhaiis-namespace spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: ibmc-vpc-block-10iops-tierNoteConfiguring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.
Create a
Deploymentcustom resource (CR) that pulls the model from Hugging Face and deploys the Red Hat AI Inference Server container. Reference the following exampleDeploymentCR, which uses AI Inference Server to serve a Granite model on a CUDA accelerator.apiVersion: apps/v1 kind: Deployment metadata: name: granite namespace: rhaiis-namespace labels: app: granite spec: replicas: 1 selector: matchLabels: app: granite template: metadata: labels: app: granite spec: imagePullSecrets: - name: docker-secret volumes: - name: model-volume persistentVolumeClaim: claimName: model-cache - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" - name: oci-auth secret: secretName: docker-secret items: - key: .dockercfg path: config.json serviceAccountName: default initContainers: - name: fetch-model image: ghcr.io/oras-project/oras:v1.2.0 env: - name: DOCKER_CONFIG value: /auth command: ["/bin/sh","-c"] args: - | set -e # Only pull if /model is empty if [ -z "$(ls -A /model)" ]; then echo "Pulling model..." oras pull registry.redhat.io/rhelai1/granite-3-1-8b-instruct-quantized-w8a8:1.5 \ --output /model \ else echo "Model already present, skipping model pull" fi volumeMounts: - name: model-volume mountPath: /model - name: oci-auth mountPath: /auth readOnly: true containers: - name: granite image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:a6645a8e8d7928dce59542c362caf11eca94bb1b427390e78f0f8a87912041cd' imagePullPolicy: IfNotPresent env: - name: VLLM_SERVER_DEV_MODE value: '1' command: - python - '-m' - vllm.entrypoints.openai.api_server args: - '--port=8000' - '--model=/model' - '--served-model-name=granite-3-1-8b-instruct-quantized-w8a8' - '--tensor-parallel-size=1' resources: limits: cpu: '10' nvidia.com/gpu: '1' memory: 16Gi requests: cpu: '2' memory: 6Gi nvidia.com/gpu: '1' volumeMounts: - name: model-volume mountPath: /model - name: shm mountPath: /dev/shm restartPolicy: Always+
Where:
namespace: rhaiis-namespace-
Specifies the deployment namespace. The value of
metadata.namespacemust match the namespace where you configured the Hugging FaceSecretCR. claimName: model-cache-
Specifies the persistent volume claim name. The value of
spec.template.spec.volumes.persistentVolumeClaim.claimNamemust match the name of thePVCthat you created. initContainers:- Defines a container that runs before the main application container to download the required model from Hugging Face. The model pull step is skipped if the model directory has already been populated, for example, from a previous deployment.
mountPath: /dev/shm- Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail without this volume mount.
Increase the deployment replica count to the required number. For example, run the following command:
oc scale deployment granite -n rhaiis-namespace --replicas=1Optional: Watch the deployment and ensure that it succeeds:
$ oc get deployment -n rhaiis-namespace --watchExample output
NAME READY UP-TO-DATE AVAILABLE AGE granite 0/1 1 0 2s granite 1/1 1 1 14s
Create a
ServiceCR for the model inference. For example:apiVersion: v1 kind: Service metadata: name: granite namespace: rhaiis-namespace spec: selector: app: granite ports: - protocol: TCP port: 80 targetPort: 8000Optional: Create a
RouteCR to enable public access to the model. For example:apiVersion: route.openshift.io/v1 kind: Route metadata: name: granite namespace: rhaiis-namespace spec: to: kind: Service name: granite port: targetPort: 80Get the URL for the exposed route. Run the following command:
$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'Example output
granite-rhaiis-namespace.apps.example.com
Verification
Ensure that the deployment is successful by querying the model. Run the following command:
curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "granite-3.1-2b-instruct-quantized.w8a8",
"messages": [{"role": "user", "content": "What is AI?"}],
"temperature": 0.1
}'