Chapter 5. Deploying Red Hat AI Inference Server and inference serving the model


Deploy a language model with OpenShift Container Platform by configuring secrets, persistent storage, and a deployment custom resource (CR) that pulls the model from Hugging Face and uses Red Hat AI Inference Server to inference serve the model.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.
  • You have installed NFD and the required GPU Operator for your underlying AI accelerator hardware.

Procedure

  1. Create the Secret custom resource (CR) for the Hugging Face token. The cluster uses the Secret CR to pull models from Hugging Face.

    1. Set the HF_TOKEN variable using the token you set in Hugging Face.

      $ HF_TOKEN=<your_huggingface_token>
      Copy to Clipboard Toggle word wrap
    2. Set the cluster namespace to match where you deployed the Red Hat AI Inference Server image, for example:

      $ NAMESPACE=rhaiis-namespace
      Copy to Clipboard Toggle word wrap
    3. Create the Secret CR in the cluster:

      $ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE
      Copy to Clipboard Toggle word wrap
  2. Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a Secret CR that contains the contents of your local ~/.docker/config.json file, run the following command:

    oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespace
    Copy to Clipboard Toggle word wrap
  3. Create a PersistentVolumeClaim (PVC) custom resource (CR) and apply it in the cluster. The following example PVC CR uses a default IBM VPC Block persistence volume. You use the PVC as the location where you store the models that you download.

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-cache
      namespace: rhaiis-namespace
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      storageClassName: ibmc-vpc-block-10iops-tier
    Copy to Clipboard Toggle word wrap
    Note

    Configuring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.

  4. Create a Deployment custom resource (CR) that pulls the model from Hugging Face and deploys the Red Hat AI Inference Server container. Reference the following example Deployment CR, which uses AI Inference Server to serve a Granite model on a CUDA accelerator.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: granite
      namespace: rhaiis-namespace
      labels:
        app: granite
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: granite
      template:
        metadata:
          labels:
            app: granite
        spec:
          imagePullSecrets:
            - name: docker-secret
          volumes:
            - name: model-volume
              persistentVolumeClaim:
                claimName: model-cache
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "2Gi"
            - name: oci-auth
              secret:
                secretName: docker-secret
                items:
                  - key: .dockercfg
                    path: config.json
          serviceAccountName: default
          initContainers:
            - name: fetch-model
              image: ghcr.io/oras-project/oras:v1.2.0
              env:
                - name: DOCKER_CONFIG
                  value: /auth
              command: ["/bin/sh","-c"]
              args:
                - |
                  set -e
                  # Only pull if /model is empty
                  if [ -z "$(ls -A /model)" ]; then
                    echo "Pulling model..."
                    oras pull registry.redhat.io/rhelai1/granite-3-1-8b-instruct-quantized-w8a8:1.5 \
                      --output /model \
                  else
                    echo "Model already present, skipping model pull"
                  fi
              volumeMounts:
                - name: model-volume
                  mountPath: /model
                - name: oci-auth
                  mountPath: /auth
                  readOnly: true
          containers:
            - name: granite
              image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:a6645a8e8d7928dce59542c362caf11eca94bb1b427390e78f0f8a87912041cd'
              imagePullPolicy: IfNotPresent
              env:
                - name: VLLM_SERVER_DEV_MODE
                  value: '1'
              command:
                - python
                - '-m'
                - vllm.entrypoints.openai.api_server
              args:
                - '--port=8000'
                - '--model=/model'
                - '--served-model-name=granite-3-1-8b-instruct-quantized-w8a8'
                - '--tensor-parallel-size=1'
              resources:
                limits:
                  cpu: '10'
                  nvidia.com/gpu: '1'
                  memory: 16Gi
                requests:
                  cpu: '2'
                  memory: 6Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - name: model-volume
                  mountPath: /model
                - name: shm
                  mountPath: /dev/shm
          restartPolicy: Always
    Copy to Clipboard Toggle word wrap

    +

    Where:

    namespace: rhaiis-namespace
    Specifies the deployment namespace. The value of metadata.namespace must match the namespace where you configured the Hugging Face Secret CR.
    claimName: model-cache
    Specifies the persistent volume claim name. The value of spec.template.spec.volumes.persistentVolumeClaim.claimName must match the name of the PVC that you created.
    initContainers:
    Defines a container that runs before the main application container to download the required model from Hugging Face. The model pull step is skipped if the model directory has already been populated, for example, from a previous deployment.
    mountPath: /dev/shm
    Mounts the shared memory volume required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail without this volume mount.
  1. Increase the deployment replica count to the required number. For example, run the following command:

    oc scale deployment granite -n rhaiis-namespace --replicas=1
    Copy to Clipboard Toggle word wrap
  2. Optional: Watch the deployment and ensure that it succeeds:

    $ oc get deployment -n rhaiis-namespace --watch
    Copy to Clipboard Toggle word wrap

    Example output

    NAME      READY   UP-TO-DATE   AVAILABLE   AGE
    granite   0/1     1            0           2s
    granite   1/1     1            1           14s
    Copy to Clipboard Toggle word wrap

  1. Create a Service CR for the model inference. For example:

    apiVersion: v1
    kind: Service
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      selector:
        app: granite
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8000
    Copy to Clipboard Toggle word wrap
  2. Optional: Create a Route CR to enable public access to the model. For example:

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      to:
        kind: Service
        name: granite
      port:
        targetPort: 80
    Copy to Clipboard Toggle word wrap
  3. Get the URL for the exposed route. Run the following command:

    $ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'
    Copy to Clipboard Toggle word wrap

    Example output

    granite-rhaiis-namespace.apps.example.com
    Copy to Clipboard Toggle word wrap

Verification

Ensure that the deployment is successful by querying the model. Run the following command:

curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-3.1-2b-instruct-quantized.w8a8",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "temperature": 0.1
  }'
Copy to Clipboard Toggle word wrap
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top