Chapter 5. Deploying Red Hat AI Inference Server and inference serving the model

Deploy a language model with OpenShift Container Platform by configuring secrets, persistent storage, and a deployment custom resource (CR) that pulls the model from Hugging Face and uses Red Hat AI Inference Server to inference serve the model.

Prerequisites

You have installed the OpenShift CLI (oc).
You have logged in as a user with cluster-admin privileges.
You have installed NFD and the required GPU Operator for your underlying AI accelerator hardware.

Procedure

Create the Secret custom resource (CR) for the Hugging Face token. The cluster uses the Secret CR to pull models from Hugging Face.
1. Set the HF_TOKEN variable using the token you set in Hugging Face.
  $ HF_TOKEN=<your_huggingface_token>
2. Set the cluster namespace to match where you deployed the Red Hat AI Inference Server image, for example:
  $ NAMESPACE=rhaiis-namespace
3. Create the Secret CR in the cluster:
  $ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE
Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a Secret CR that contains the contents of your local ~/.docker/config.json file, run the following command:
```
oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespace
```
Create a PersistentVolumeClaim (PVC) custom resource (CR) and apply it in the cluster. The following example PVC CR uses a default IBM VPC Block persistence volume. You use the PVC as the location where you store the models that you download.
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: rhaiis-namespace
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: ibmc-vpc-block-10iops-tier
```
Note
Configuring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.

Create a Deployment custom resource (CR) that pulls the model from Hugging Face and deploys the Red Hat AI Inference Server container. Reference the following example Deployment CR, which uses AI Inference Server to serve a Granite model on a CUDA accelerator.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: granite
  namespace: rhaiis-namespace
  labels:
    app: granite
spec:
  replicas: 1
  selector:
    matchLabels:
      app: granite
  template:
    metadata:
      labels:
        app: granite
    spec:
      imagePullSecrets:
        - name: docker-secret
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: model-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "2Gi"
        - name: oci-auth
          secret:
            secretName: docker-secret
            items:
              - key: .dockercfg
                path: config.json
      serviceAccountName: default
      initContainers:
        - name: fetch-model
          image: ghcr.io/oras-project/oras:v1.2.0
          env:
            - name: DOCKER_CONFIG
              value: /auth
          command: ["/bin/sh","-c"]
          args:
            - |
              set -e
              # Only pull if /model is empty
              if [ -z "$(ls -A /model)" ]; then
                echo "Pulling model..."
                oras pull registry.redhat.io/rhelai1/granite-3-1-8b-instruct-quantized-w8a8:1.5 \
                  --output /model \
              else
                echo "Model already present, skipping model pull"
              fi
          volumeMounts:
            - name: model-volume
              mountPath: /model
            - name: oci-auth
              mountPath: /auth
              readOnly: true
      containers:
        - name: granite
          image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:a6645a8e8d7928dce59542c362caf11eca94bb1b427390e78f0f8a87912041cd'
          imagePullPolicy: IfNotPresent
          env:
            - name: VLLM_SERVER_DEV_MODE
              value: '1'
          command:
            - python
            - '-m'
            - vllm.entrypoints.openai.api_server
          args:
            - '--port=8000'
            - '--model=/model'
            - '--served-model-name=granite-3-1-8b-instruct-quantized-w8a8'
            - '--tensor-parallel-size=1'
          resources:
            limits:
              cpu: '10'
              nvidia.com/gpu: '1'
              memory: 16Gi
            requests:
              cpu: '2'
              memory: 6Gi
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-volume
              mountPath: /model
            - name: shm
              mountPath: /dev/shm
      restartPolicy: Always

Where:

namespace: rhaiis-namespace: Specifies the deployment namespace. The value of metadata.namespace must match the namespace where you configured the Hugging Face Secret CR.
claimName: model-cache: Specifies the persistent volume claim name. The value of spec.template.spec.volumes.persistentVolumeClaim.claimName must match the name of the PVC that you created.
initContainers:: Defines a container that runs before the main application container to download the required model from Hugging Face. The model pull step is skipped if the model directory has already been populated, for example, from a previous deployment.
mountPath: /dev/shm: Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail without this volume mount.

Increase the deployment replica count to the required number. For example, run the following command:
```
oc scale deployment granite -n rhaiis-namespace --replicas=1
```

Optional: Watch the deployment and ensure that it succeeds:

$ oc get deployment -n rhaiis-namespace --watch

Example output

NAME      READY   UP-TO-DATE   AVAILABLE   AGE
granite   0/1     1            0           2s
granite   1/1     1            1           14s

Create a Service CR for the model inference. For example:

apiVersion: v1
kind: Service
metadata:
  name: granite
  namespace: rhaiis-namespace
spec:
  selector:
    app: granite
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000

Optional: Create a Route CR to enable public access to the model. For example:

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: granite
  namespace: rhaiis-namespace
spec:
  to:
    kind: Service
    name: granite
  port:
    targetPort: 80

Get the URL for the exposed route. Run the following command:

$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'

Example output

granite-rhaiis-namespace.apps.example.com

Verification

Ensure that the deployment is successful by querying the model. Run the following command:

curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-3.1-2b-instruct-quantized.w8a8",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "temperature": 0.1
  }'

Chapter 5. Deploying Red Hat AI Inference Server and inference serving the model

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links